Lexical knowledge schemes for modeling words and
expressions in communication
Computational Lexicology & Terminology LabWauter BosmaIsa MaksRoxane SegersHennie van der VlietPiek Vossen
LCC-meeting, October, 9th, 2008, VU University Amsterdam
LCC meeting, October 9th, 2008, VU University Amsterdam
Overview
• Genre as a knowledge scheme
• What do we do at CLTL?
• How does it relate to genre?
• Projects at CLTL
• Discussion
LCC meeting, October 9th, 2008, VU University Amsterdam
A view on genre
• Genre is an abstract knowledge scheme that natural language speakers can apply to effectively structure communication. – How and where is such a scheme stored?– How is this knowledge activated and applied
in a communicative setting? – How can we benefit from these insights in
computerized information and communication systems?
LCC meeting, October 9th, 2008, VU University Amsterdam
Social behaviour
Communication
targets
strategy
form
medium
language
lexicon
grammar entities
relations
Participants
Intentions
Text: structure & content
genre
Attitudes
objects
relations
World Knowledge
Ontology
LCC meeting, October 9th, 2008, VU University Amsterdam
Focus of Computational Lexicology and Terminology Lab (CLTL)
• Lexicon = model of abstract knowledge to efficiently process and produce natural language in communicative settings
• Symbolic & abstract representation of forms related to concepts:– forms are variants that can refer to more-or-less the same
semantic content:• shootV – shootingN – agressionN- fightN - conflictN – warN – WOIIName
• payV – exchangeV - buyV – sellV – merchandiseN - tradeN - businessN
• Also encode pragmatic aspects of use– Sentiment, subjectivity & attitude– Perspective– Domain restrictions
LCC meeting, October 9th, 2008, VU University Amsterdam
Focus of Computational Lexicology and Terminology Lab (CLTL)
• Broad notion of knowledge:• words & expressions (what is a word, what is a concept?)• phrases, sentences and text (incorporating grammar)• genres
• Abstract symbolic representations related to statistical expectation patterns
• Tagged corpus represents an 'experience' of language use:– "X drinks beer", "Y drinks wine", "Z drinks milk"
• Lexicon is the highest abstraction of these experiences that gives the most effective prediction of how words and expressions behave:
– "XYZ drink beverages"• Corpus-based lexicon or corpus data represented as a
lexicon
LCC meeting, October 9th, 2008, VU University Amsterdam
Focus of Computational Lexicology and Terminology Lab (CLTL)
• Validation of models and databases with lexical knowledge:– Can we define types of structures (lexical and compositional
expressions) that correctly predict their behavior in language use? -> pluriform-object-count-noun (police), object-count-noun (police officer), group-object-count-noun (eikenbos (oak forest)), mass-object-uncount-noun (bos (forest))
– Can we build a comprehensive database using these types?• Use the database in corpus research and analysis:
– import corpus data into the lexical database– apply the database to textual corpora in computer applications:
• Automatic tagging of corpora with features• Automatically mine textual data using the lexicon as a background
knowledge resource, e.g. to find facts of causal relations for environmental phenomena
LCC meeting, October 9th, 2008, VU University Amsterdam
Text corpus with empirical data-linear text-every word occurrence is unique-domain and genre specific
Term database:-generic list of terms-derived from text corpus-patterns and features that are dominant in domain and genre
Lexical database-generic list of words and terms-abstracts from various text corpora-differentiation for different domains and genres-most generic representation -in a language community
Ontology-concepts instead of words-identity criteria-language neutral-domain and perspective neutral-no genre dependency-logically valid-for inferencing
Derive
Map
ValidateIntegrate
LCC meeting, October 9th, 2008, VU University Amsterdam
Projects at CLTC
• Cornetto (Stevin project: STE05039)• Kyoto (FP7 ICT Work Programme 2007 under
Challenge 4 - Digital libraries and Content, project ICT-211423)
• Camera projects:– From sentiments and opinions in text to positions of
political parties– The semantics of history
• A term bank for the Belastingdienst (Steunpunt Terminologie)
• DutchSemCor (NWO investeringssubsidie)
LCC meeting, October 9th, 2008, VU University Amsterdam
Cornetto
• COmbinatorial Relational NEtwork voor Taal TOepassingen
• Goal: to develop a lexical semantic database for Dutch:– 90K Entries: generic and central part of the
language– Rich horizontal and vertical semantic relations– Combinatoric information – Ontological information
LCC meeting, October 9th, 2008, VU University Amsterdam
Lexical Unit & Synsets
• Lexical Unit = form-meaning relation, such that:– form = abstract representation of certain realizations;– part-of-speech is the same;– meaning is the same, where meaning is defined by a
reference to a unique Synset;
• Synset = Set of synonyms (LUs) that refer to the same entities in most contexts.– Defined by lexical semantic relations;– Defined by reference to ontology Terms or logical
expressions involving Terms from the ontology;
LCC meeting, October 9th, 2008, VU University Amsterdam
Data Organization
Internal relations
PrincetonWordnet
WordnetDomains
SpanishWordnet
CzechWordnet
GermanWordnet
FrenchWordnet
KoreanWordnet Arabic
Wordnet
SUMOMILO
Collection of Terms and Axioms
Correspond to word-meaning pairform
morphology
syntax
semantics
pragmatics
usage examples
Lexical Unit
Model meaning relations
Synset
Synonyms
LCC meeting, October 9th, 2008, VU University Amsterdam
Data overview
ALL NOUNS VERBS ADJ. ADV. Other
Synsets 70,434 52,888 9,053 7,703 220 570
Lexical Units 118,466 85,278 17,363 15,731 73 21
Lemmas (form+pos) 91,991 70,556 9,055 12,307 73 n.a.
Synonyms in synsets 102,572 74,893 14,091 12,899 84 605
CID records 103,668 75,812 14,093 13,089 484 190
Synonym per synset 1.46 1.42 1.56 1.67 0.38 1.06
Senses per lemma 1.29 1.21 1.92 1.28 1.00 n.a.
LCC meeting, October 9th, 2008, VU University Amsterdam
band#2 (tire)band#1(band)
cassettebandje(audio cassette)
ring (ring)
voorwerp (object)
band#5 (bond)
verhouding(relation)
relatie (reltion)
toestand (state)
fietsband(bike tire)
buitenband(outer tire)
binnenband(inner tire)
autoband(car tire)
zwemband(tire for swimming)jazzband
(jazz band)popgroep(pop group)
muziekgezelschap(music group)
gezelschap(group of people)
groep(groep)
muzikant(musician)
muziek (music)
artiest (artist)
bloedband(blood bond)
familieband(family bond)
moederband(mother bond)
band#3/geluidsband(audio tape)
geluidsdrager(audio carrier)
informatiedrager(data carrier) schrijven
(write)lezen (read)
middel (device)
musiceren(to make music)
Combinatorics
de band starten(to start a tape)
op de band opnemen(to record on a tape)
de band afspelen(to play from a tape)
Combinatorics
een goede/sterke band(a good strong bond)
de banden verbreken(to break all bonds)
een band hebben met iemand(to have a bond with s.o.)
Combinatorics
in een band spelen(to play in a band)
een band oprichten(to start a band)
de band speelt(the band plays)
Combinatorics
de band oppompen(to pump air in a tire)
een band plakken(to fix a whole in a tire)
een lekke band(flat tire)
de band springt(the tire explodes)
LCC meeting, October 9th, 2008, VU University Amsterdam
Integrating the ontology: Sumo terms and axioms
Lexicon versus Ontology
Abstract Physical
H20 CO2
Element
Ontology
Process
PossessionTransaction
Organism
Dog
PoodleDog{buy}
{sell}LABELS for ROLES:{watchdog}EN, {waakhond}NL, {banken}JP((instance x Canine)(role x GuardingProcess))
NAMES for TYPES:{poodle}EN{poedel}NL{pudoru}JP((instance x Poodle)
subjobj
receivergiver
goods
subjobj
LABELS for ROLES:{bluswater}{theewater}{koffiewater}
ind obj
ind obj
LCC meeting, October 9th, 2008, VU University Amsterdam
Kyoto
• Yielding Ontologies for Transition-Based Organization• Funded:
– 7th Framework Program-ICT of the European Union: Intelligent Content and Semantics
• Goal: – Platform for knowledge sharing across languages and cultures– Enables knowledge transition and information search across different
target groups, transgressing linguistic, cultural and geographic boundaries.
– Open text mining and deep semantic search– Wiki environment that allows people in the field to maintain their
knowledge and agree on meaning without knowledge engineering skills• URL: http://www.kyoto-project.eu/• Duration: March 2008 – March 2011• Effort: 364 person months of work
LCC meeting, October 9th, 2008, VU University Amsterdam
KYOTO (ICT-211423) Overview • Languages:
– English, Dutch, Italian, Spanish, Basque, Chinese, Japanese • Domain:
– Environmental domain, BUT usable in any domain • Global:
– Both European and non-European languages• Available:
– Free: as open source system and data (GPL)• Future perspective:
– Content standardization that supports world wide communication– Global Wordnet Grid
LCC meeting, October 9th, 2008, VU University Amsterdam
Images
Index
Docs
URLs
Experts
Search
Dialogue
CO2 emission
water pollution
Capture
CitizensGovernorsCompanies
Domain
DomainWikyoto
Wordnets
Abstract PhysicalTop
Middlewater CO2
Substance
Universal Ontology
Process
Environmental organizations
Environmental organizations
Global Wordnet Grid
Kybots
FactMining
Tybots
ConceptMining Sudden increase
of CO2 emissionsin 2008 in Europe
LCC meeting, October 9th, 2008, VU University Amsterdam
User perspective
• Ecosystem services– nature as a resource: food, transport,
recreation, medicine, material– nature for waste absorption– economic dependency– state of nature– footprint– poverty
LCC meeting, October 9th, 2008, VU University Amsterdam
qualifies
qualifies
Lexicon versus Ontology
Abstract Physical
H20 CO2
Element
Ontology
Process
PossessionTransaction
Organism
Ecosystem services-Nature as a resource-Nature for waste absorption-State of nature-Threats to nature
branding rural products
sustainable products
green roof
alien invasive species
species migration
ecosystem-based drinking water production
Artifacts
green house gas
Spider
LCC meeting, October 9th, 2008, VU University Amsterdam
System components
• Wikyoto = wiki environment for a social group:– to model the terms and concepts of a domain and agree on their
meaning, within group, across languages and cultures– to define the types of knowledge and facts of interest
• Tybots = Term extraction robots, extract term data from text corpus
• Kybots = Knowledge yielding robots, extract facts from a text corpus
• Linguistic processors:– tokenizers, segmentizers, taggers, grammars – named entity recognition– word sense disambiguation– generate a layered text annotation in Kyoto Annotation Format
(KAF)
LCC meeting, October 9th, 2008, VU University Amsterdam
Capture ServerCapture Server
Document BaseLinear KAF
Document BaseLinear KAF
Tybot server(Term Extraction)
Tybot server(Term Extraction)
Extracted TermsGeneric K-TMF
Extracted TermsGeneric K-TMF
Term Editor(Wikyoto)
Term Editor(Wikyoto)
Domain OntologyOWL_DL
Domain OntologyOWL_DL
Domain WordnetK-LMF
Domain WordnetK-LMF
Kybot Server(Fact Extraction)
Kybot Server(Fact Extraction)
SemanticAnnotationSemantic
Annotation
Document BaseLinear Generic KAF
Document BaseLinear Generic KAF
Document BaseLinear KAF
Document BaseLinear KAF
Kybot EditorKybot Editor
KybotProfilesKybot
ProfilesConcept User
Fact User
LCC meeting, October 9th, 2008, VU University Amsterdam
SourceDocuments
LinguisticProcessors
[[the emission]NP [of greenhouse gases]PP [in agricultural areas]PP] NP
Morpho-syntactic analysis
English Wordnet
emission:2gas:1
area:1
greenhouse gas:1
rural area:1
geographical area:1
region:3
location:3 substance:1
emission:3
farmland:2
naturalprocess:1
in
of
Term hierarchy
emission gas
greenhouse gas
area
agricultural area
TYBOT ConceptMiners
Abstract Physical
H20 CO2
Substance
CO2Emission
WaterPollution
Ontology
Process
Chemical Reaction
GlobalWarming
GreenhouseGas
Ontologize
Axiomatize
(instance s1 Substance) (instance e1 Warming) (katalyist s1 e1)
Synthesize
CO2
CO2
Conceptual modeling
LCC meeting, October 9th, 2008, VU University Amsterdam
Fact mining by Kybots
SourceDocuments
LinguisticProcessors
[[the emission]NP [of greenhouse gases]PP [in agricultural areas]PP] NP
Morpho-syntactic analysis
Abstract Physical
H2O CO2
Substance
CO2 emission
water pollution
Ontology Wordnets &Linguistic Expressions
Generic
Process
Chemical Reaction
Logical Expressions
Domain
[[the emission]NP ] Process: e1 [of greenhouse gases]PP Patient: s2 [in agricultural areas]PP] Location: a3
Fact analysisPatient
Patient
LCC meeting, October 9th, 2008, VU University Amsterdam
Do populations always consist of marine species?
A.....
decline...
population.....Z
Are terrestrial species never
marine species?
Simplified Term Fragment
population
marinespecies
terrestrialspecies
Simplified Ontology Fragment
?Population
Group
KyotoServer
WIKIPEDIA
Hidden
Shown
.... populations declined
.....terrestrial andmarine species..
in forests.....declined
Do populations consist of
marine species?
InterviewAre terrestrial
species a type of
populations?
Interview
.... populations such as
terrestrial and marine species .....
Smart Kytext
KAF DE-TNTybots
DE-WN
G-WN
DE-KON
G-KON SUMOFactAF
KAF
Kybots
DOLCE
GEOplugin plugin
Facts in RDF Wordnets in LMF Ontologies in OWL
FRAMENET
emission:2gas:1
greenhouse gas:1
substance:1
emission:3
natural process:1
C02
Lexical database: wordnet
Abstract Physical
H20 CO2
Substance
CO2Emission
Process
ChemicalReaction
GlobalWarming
GreenhouseGas
Ontology
Maximalabstraction&
integrity
Languageneutralintegrity
gasgreen house gas -> gas-increase(AG)-in 2003 (TIME)CO2 -> green house gas-emission (PA)-in European countries (LO)
Term database
Generictext based
Sudden increase of green house gases in 2003........ C02 emission
in European countries....Green house gases such as C02, ....
Text corpus
Lineartext
ConceptMining
by Tybots
Synthesize Text miningby Kybots
Ontologize
Axiomatize
(instance s1 Substance) (instance e1 Warming) (katalyist s1 e1)
LCC meeting, October 9th, 2008, VU University Amsterdam
From sentiments and opinions in text to positions of political parties
• Most language use does not express facts but personal opinions and positions with respect to facts or issues, often disguised for some communicative or manipulative goal.
• CAMERA project involving 2 AIOs from FdL and 1 AIO from Political Sciences
• Combines contemporary theories and methods in linguistics and political science to develop an automated research tool for rich text-mining:– Complexity of language use, the linguistic modeling of
subjectivity and the representation of this knowledge in a lexicon. – Complex dimensionality of competition between political parties.
• Mining tool for language-meaning research can be applied to enhance the Kieskompas (Electoral Compass).
Corpus Linguistics
Political Text Corpus
QuantitativeText Analyis
ConcordanceSearch
Lexical Analysis Lexical database
AutomatedTagging &Analysis
ManualCoding
Political Analysis
SearchQuantitative
Data Analysis
Morpho-syntacticParsers
Modeling
Political Database
ManualCoding & Tagging
Linguistic rules
Interpretation rules
Co-occurrenceLexical acquisition
Derivation
aio-1
aio-2
aio-3
system integrator-4
Omstreden democratie:-Jan Kleinnijenhuis-Wouter van Atteveldt
LCC meeting, October 9th, 2008, VU University Amsterdam
AIO-1: Lexical model and acquisition for sentiment and opinion analysis
in Dutch text
• Words & expressions in political text
• Model sentiment, subjectivity, lexical framing and attitudinal implications
• Build a lexicon encoding these layers
• Validate the lexicon in the mining application applied to the text corpus
LCC meeting, October 9th, 2008, VU University Amsterdam
Levels of subjectivity
• sentiment orientation, e.g. – small (neutral), splendid (positive), dull (negative)– funeral (negative), birthday party (positive), meeting
(neutral)• explicit attitudinal and deontic implications
– hate, love, favour, desire, want– impossible, possible, can, cannot– demand, beg, hope, wish
• implicit attitudinal and deontic implications– neutral: describe, cite, quote– subjective: tell my story, shout, cry out, suggest
LCC meeting, October 9th, 2008, VU University Amsterdam
Some concepts of sayingThe reporter expresses attitude towards the subject (is not aware)
nazeggen:1, herhalen:4, echoën:2meesmuilen:1herkauwen:2toesnauwen:1, aanblaffen:2, sissen:2, toebijten:1, toeblaffen:1 toesmijten:2,toevoegen:4uitputten:3verzuchten:1 pretenderen:1, beweren:1
Subject of speech act has attitude towards (is aware):afzeggen:1, cancellen:1ontkennen:1, miskennen:1, ontveinzen:1toewensen:1, wensen:2verbieden:1aanzetten:12, beklemtonen:2, hameren:2, tamboereren:2 onderstrepen:2, onderlijnen:1, accentueren:1toezeggen:1, beloven:1uitlaten:5, beoordelen:1distantiëren:1erkennen:2, toegeven:1 opmerken:2, aantekenen:4
LCC meeting, October 9th, 2008, VU University Amsterdam
Synsets or lexical units
• {brilliant:3, glorious:4, magnificent:1, splendid:2}
• {bus:4, jalopy:1, heap:3}– has_hyperonym: {car:1, auto:1, automobile:1,
machine:4, motorcar:1}
• {fiets:1, brik:7, kar:3, karretje:2, rijwiel:1, velo:1}
LCC meeting, October 9th, 2008, VU University Amsterdam
The semantics of history
• Camera project involving 1 AIO from FdL and 1 AIO from FEW (Exact Science)
• Goal: an ontology and lexicon for a historical multimedia archive of the Rijksmuseum.
• Applied to an innovative information system for accessing the historical archive.
LCC meeting, October 9th, 2008, VU University Amsterdam
The semantics of history = semantics of change
• Represent different realities:– related through causal changes over time – representing different views or perspectives
on the same reality, e.g. form a different historical angle or from different geographical or social parties.
• Changes are typed as events
LCC meeting, October 9th, 2008, VU University Amsterdam
Events as key notions• Historical events:
– events considered from a distance in time and abstraction of detail.– referenced by names (WOII, de Val van Srebrenica), nouns (war) or
nominalizations (the violation of human rights)• News events:
– Reports on (the same) reality but more in the active verbal form: US soldiers shoot Iraqi citizens.
– Close to the actual event– lacking a historical abstraction and filtering.
• Both news and historic imply subjectivity and perspective on these events but probably make different selections and use different genres to convey this information.
• News becomes history over time, and we therefore expect a smooth transition in the use of language to refer to the same events, adding more and more historical perspective.
LCC meeting, October 9th, 2008, VU University Amsterdam
“Val van Srebrenica” in Wikipedia
• Headings:– 1992 ethnic cleansing campaign– The conflict in eastern Bosnia– Struggle for Srebrenica
• Text:– A fierce struggle for territorial control then ensued among the three
major groups in Bosnia: Bosniak (commonly known as 'Bosnian Muslims'), Serb and Croat. In the eastern part of Bosnia, close to Serbia, conflict was particularly fierce between Serbs and Bosniaks
– Serb military and paramilitary forces from the area and neighboring parts of eastern Bosnia and Serbia gained control of Srebrenica for several weeks in early 1992, killing and expelling Bosniak civilians. In May 1992, Bosnian government forces under the leadership of Naser Orić recaptured the town
– thus proceeded with the ethnic cleansing of Bosniaks from Bosniak ethnic territories in Eastern Bosnia and Central Podrinje
LCC meeting, October 9th, 2008, VU University Amsterdam
Letter from the Dutch minister of defense
• De afgelopen zes maanden werd de uitvoering van deze taken aanzienlijk bemoeilijkt door de Bosnisch-Servische weigering de enclave voldoende te laten bevoorraden. Door een gebrek aan brandstof moesten patrouilles te voet worden uitgevoerd. Ook blokkeerden de Bosnische Serviers sinds mei jl. de rotatie van het personeel van Dutchbat, waardoor de bezetting werd teruggebracht van 630 naar 430 blauwhelmen. De vijandelijkheden namen geleidelijk toe, waardoor op 3 juni jl. een observatiepost in het zuidoostelijke deel van de enclave moest worden opgegeven
• Historical terms: blokkade, val, opgave, overgave
SemiStructured
Data
FreeText
Data modelStructured
Terms&
Relations
EventOnt.
HistoricOnt.
Ontology
Lexicon
DataConversion
TermExtraction
Alignment
Ontolization
SmartIndexing
Objects
Smart Retrieval
Lexicalization
Lexical mapping
Validation
Events Locations People
conflictstruggleethnic cleansing….killingexpellinggain control
LCC meeting, October 9th, 2008, VU University Amsterdam
AIO at FdL
• Lexical framing of events in news reporting and historical descriptions.
• Use historical thesaurus to group all the words and expressions in a lexicon relative to the same events
• Differentiate implications of the lexical variation: packaging of events
• Classification of news