advances in wsd
TRANSCRIPT
-
7/29/2019 Advances in WSD
1/208
Advances inWord Sense Disambiguation
Tutorial at ACL 2005
June 25, 2005
Ted Pedersen
University of Minnesota, Duluth
http://www.d.umn.edu/~tpederse
Rada Mihalcea
University of North Texas
http://www.cs.unt.edu/~rada
-
7/29/2019 Advances in WSD
2/208
2
Goal of the Tutorial
Introduce the problem of word sense disambiguation
(WSD), focusing on the range of formulations and
approaches currently practiced.
Accessible to anyone with an interest in NLP.
Persuade you to work on word sense disambiguation
Its an interesting problem
Lots of good work already done, still more to do
There is infrastructure to help you get started
Persuade you to use word sense disambiguation in your
text applications.
-
7/29/2019 Advances in WSD
3/208
3
Outline of Tutorial
Introduction (Ted)
Methodolodgy (Rada)
Knowledge Intensive Methods (Rada)
Supervised Approaches (Ted) Minimally Supervised Approaches (Rada) / BREAK
Unsupervised Learning (Ted)
How to Get Started (Rada)
Conclusion (Ted)
-
7/29/2019 Advances in WSD
4/208
Part 1:
Introduction
-
7/29/2019 Advances in WSD
5/208
5
Outline
Definitions
Ambiguity for Humans and Computers
Very Brief Historical Overview
Theoretical Connections Practical Applications
-
7/29/2019 Advances in WSD
6/208
6
Definitions
Word sense disambiguation is the problem of selecting a
sense for a word from a set of predefined possibilities.
Sense Inventory usually comes from a dictionary or thesaurus.
Knowledge intensive methods, supervised learning, and
(sometimes) bootstrapping approaches
Word sense discrimination is the problem of dividing the
usages of a word into different meanings, without regard to
any particular existing sense inventory. Unsupervised techniques
-
7/29/2019 Advances in WSD
7/208
7
Outline
Definitions
Ambiguity for Humans and Computers
Very Brief Historical Overview
Theoretical Connections Practical Applications
-
7/29/2019 Advances in WSD
8/208
8
Computers versus Humans
Polysemymost words have many possible meanings.
A computer program has no basis for knowing which one
is appropriate, even if it is obvious to a human
Ambiguity is rarely a problem for humans in their day today communication, except in extreme cases
-
7/29/2019 Advances in WSD
9/208
9
Ambiguity for Humans - Newspaper Headlines!
DRUNK GETS NINE YEARS IN VIOLIN CASE
FARMER BILL DIES IN HOUSE
PROSTITUTES APPEAL TO POPE
STOLEN PAINTING FOUND BY TREE RED TAPE HOLDS UP NEW BRIDGE
DEER KILL 300,000
RESIDENTS CAN DROP OFF TREES
INCLUDE CHILDREN WHEN BAKING COOKIES
MINERS REFUSE TO WORK AFTER DEATH
-
7/29/2019 Advances in WSD
10/208
10
Ambiguity for a Computer
The fisherman jumped off the bank and into the water.
The bank down the street was robbed!
Back in the day, we had an entire bank of computers
devoted to this problem. The bank in that road is entirely too steep and is really
dangerous.
The plane took a bank to the left, and then headed off
towards the mountains.
-
7/29/2019 Advances in WSD
11/208
11
Outline
Definitions
Ambiguity for Humans and Computers
Very Brief Historical Overview
Theoretical Connections Practical Applications
-
7/29/2019 Advances in WSD
12/208
12
Early Days of WSD
Noted as problem for Machine Translation (Weaver, 1949)
A word can often only be translated if you know the specific sense
intended (A bill in English could be a pico or a cuenta in Spanish)
Bar-Hillel (1960) posed the following:
Little John was looking for his toy box. Finally, he found it. The
box was in the pen. John was very happy.
Is pen a writing instrument or an enclosure where children play?
declared it unsolvable, left the field of MT!
-
7/29/2019 Advances in WSD
13/208
13
Since then
1970s - 1980s
Rule based systems
Rely on hand crafted knowledge sources
1990s
Corpus based approaches
Dependence on sense tagged text
(Ide and Veronis, 1998) overview history from early days to 1998.
2000s
Hybrid Systems
Minimizing or eliminating use of sense tagged text
Taking advantage of the Web
-
7/29/2019 Advances in WSD
14/208
14
Outline
Definitions
Ambiguity for Humans and Computers
Very Brief Historical Overview
Interdisciplinary Connections Practical Applications
-
7/29/2019 Advances in WSD
15/208
15
Interdisciplinary Connections
Cognitive Science & Psychology
Quillian (1968), Collins and Loftus (1975) : spreading activation
Hirst (1987) developed marker passing model
Linguistics Fodor & Katz (1963) : selectional preferences
Resnik (1993) pursued statistically
Philosophy of Language
Wittgenstein (1958): meaning as use
For a large class of cases-though not for all-in which we employ
the word "meaning" it can be defined thus: the meaning of a word
is its use in the language.
-
7/29/2019 Advances in WSD
16/208
16
Outline
Definitions
Ambiguity for Humans and Computers
Very Brief Historical Overview
Theoretical Connections Practical Applications
-
7/29/2019 Advances in WSD
17/208
17
Practical Applications
Machine Translation
Translate bill from English to Spanish
Is it a pico or a cuenta?
Is it a bird jaw or an invoice?
Information Retrieval Find all Web Pages about cricket
The sport or the insect?
Question Answering
What is George Millers position on gun control?
The psychologist or US congressman?
Knowledge Acquisition Add to KB: Herb Bergson is the mayor of Duluth.
Minnesota or Georgia?
-
7/29/2019 Advances in WSD
18/208
18
References
(Bar-Hillel, 1960) The Present Status of Automatic Translations of Languages. InAdvances in Computers. Volume 1. Alt, F. (editor). Academic Press, New York, NY. pp91-163.
(Collins and Loftus, 1975) A Spreading Activation Theory of Semantic Memory.Psychological Review, (82) pp. 407-428.
(Fodor and Katz, 1963) The structure of semantic theory. Language (39). pp 170-210.
(Hirst, 1987) Semantic Interpretation and the Resolution of Ambiguity. Cambridge
University Press. (Ide and Vronis, 1998)Word Sense Disambiguation: The State of the Art..
Computational Linguistics (24) pp 1-40.
(Quillian, 1968) Semantic Memory. In Semantic Information Processing. Minsky, M.(editor). The MIT Press, Cambridge, MA. pp. 227-270.
(Resnik, 1993) Selection and Information: A Class-Based Approach to LexicalRelationships. Ph.D. Dissertation. University of Pennsylvania.
(Weaver, 1949): Translation. In Machine Translation of Languages: fourteen essays.Locke, W.N. and Booth, A.D. (editors) The MIT Press, Cambridge, Mass. pp. 15-23.
(Wittgenstein, 1958) Philosophical Investigations, 3rd edition. Translated by G.E.M.Anscombe. Macmillan Publishing Co., New York.
http://www.cs.vassar.edu/~ide/papers/idevero.pshttp://www.cs.vassar.edu/~ide/papers/idevero.ps -
7/29/2019 Advances in WSD
19/208
Part 2:
Methodology
-
7/29/2019 Advances in WSD
20/208
20
Outline
General considerations
All-words disambiguation
Targeted-words disambiguation
Word sense discrimination, sense discovery Evaluation (granularity, scoring)
-
7/29/2019 Advances in WSD
21/208
21
Ex: chair furniture or person
Ex: child young person or human offspring
Overview of the Problem
Many words have several meanings (homonymy / polysemy)
Determine which sense of a word is used in a specific sentence
Note:
often, the different senses of a word are closely related Ex: title - right of legal ownership
- document that is evidence of the legal ownership,
sometimes, several senses can be activated in a single context(co-activation)
Ex: This could bring competition to the trade
competition: - the act of competing
- the people who are competing
-
7/29/2019 Advances in WSD
22/208
22
Word Senses
The meaning of a word in a given context
Word sense representations
With respect to a dictionarychair= a seat for one person, with a support for the back; "he put his coatover the back of the chair and sat down"
chair= the position of professor; "he was awarded an endowed chair in
economics"
With respect to the translation in a second language
chair = chaise
chair = directeur
With respect to the context where it occurs (discrimination)
Sit on a chair Take a seat on this chair
The chairof the Math Department The chairof the meeting
-
7/29/2019 Advances in WSD
23/208
23
Approaches to Word Sense Disambiguation
Knowledge-Based Disambiguation use of external lexical resources such as dictionaries and thesauri
discourse properties
Supervised Disambiguation based on a labeled training set
the learning system has: a training set of feature-encoded inputs AND
their appropriate sense label (category)
Unsupervised Disambiguation based on unlabeled corpora
The learning system has: a training set of feature-encoded inputs BUT
NOT their appropriate sense label (category)
-
7/29/2019 Advances in WSD
24/208
24
All Words Word Sense Disambiguation
Attempt to disambiguate all open-class words in a text
He put his suit over the backof the chair
Knowledge-based approaches
Use information from dictionaries
Definitions / Examples for each meaning
Find similarity between definitions and current context
Position in a semantic network
Find that table is closer to chair/furniture than to chair/person
Use discourse properties
A word exhibits the same sense in a discourse / in a collocation
-
7/29/2019 Advances in WSD
25/208
25
All Words Word Sense Disambiguation
Minimally supervised approaches
Learn to disambiguate words using small annotated corpora
E.g. SemCorcorpus where all open class words are
disambiguated
200,000 running words
Most frequent sense
-
7/29/2019 Advances in WSD
26/208
26
Targeted Word Sense Disambiguation
Disambiguate one target word
Take a seat on this chair
The chairof the Math Department
WSD is viewed as a typical classification problem
use machine learning techniques to train a system
Training:
Corpus of occurrences of the target word, each occurrence
annotated with appropriate sense
Build feature vectors: a vector of relevant linguistic features that represents the context (ex:
a window of words around the target word)
Disambiguation:
Disambiguate the target word in new unseen text
-
7/29/2019 Advances in WSD
27/208
27
Targeted Word Sense Disambiguation
Take a window ofn word around the target word
Encode information about the words around the target word
typical features include: words, root forms, POS tags, frequency,
An electric guitar and bass player stand off to one side, not really part of
the scene, just as a sort of nod to gringo expectations perhaps.
Surrounding context (local features)
[ (guitar, NN1), (and, CJC), (player, NN1), (stand, VVB) ]
Frequent co-occurring words (topical features)
[fishing, big, sound, player, fly, rod, pound, double, runs, playing, guitar, band]
[0,0,0,1,0,0,0,0,0,0,1,0]
Other features:
[followed by "player", contains "show" in the sentence,]
[yes, no, ]
-
7/29/2019 Advances in WSD
28/208
28
Unsupervised Disambiguation
Disambiguate word senses:
without supporting tools such as dictionaries and thesauri
without a labeled training text
Without such resources, word senses are not labeled
We cannot say chair/furnitureor chair/person
We can:
Cluster/group the contexts of an ambiguous word into a number
of groups
Discriminate between these groups without actually labeling
them
-
7/29/2019 Advances in WSD
29/208
29
Unsupervised Disambiguation
Hypothesis: same senses of words will have similar neighboring
words
Disambiguation algorithm
Identify context vectors corresponding to all occurrences of a particular
word Partition them into regions of high density
Assign a sense to each such region
Sit on a chair
Take a seat on this chairThe chairof the Math Department
The chairof the meeting
-
7/29/2019 Advances in WSD
30/208
30
Evaluating Word Sense Disambiguation
Metrics: Precision = percentage of words that are tagged correctly, out of the
words addressed by the system
Recall = percentage of words that are tagged correctly, out of all wordsin the test set
Example
Test set of 100 words Precision = 50 / 75 = 0.66
System attempts 75 words Recall = 50 / 100 = 0.50
Words correctly disambiguated 50
Special tags are possible:
Unknown
Proper noun
Multiple senses
Compare to a gold standard
SEMCOR corpus, SENSEVAL corpus,
-
7/29/2019 Advances in WSD
31/208
31
Evaluating Word Sense Disambiguation
Difficulty in evaluation:
Nature of the senses to distinguish has a huge impact on results
Coarse versus fine-grained sense distinctionchair= a seat for one person, with a support for the back; "he put his coat
over the back of the chair and sat down
chair= the position ofprofessor; "he was awarded an endowed chair ineconomics
bank = a financial institution that accepts deposits and channels the moneyinto lending activities; "he cashed a check at the bank"; "that bank holdsthe mortgage on my home"
bank= a building in which commercial banking is transacted; "the bank ison the corner of Nassau and Witherspoon
Sense maps
Cluster similar senses
Allow for both fine-grained and coarse-grained evaluation
-
7/29/2019 Advances in WSD
32/208
32
Bounds on Performance
Upper and Lower Bounds on Performance:
Measure of how well an algorithm performs relative to the difficulty ofthe task.
Upper Bound:
Human performance
Around 97%-99% with few and clearly distinct senses
Inter-judge agreement:
With words with clear & distinct senses95% and up
With polysemous words with related senses65%70%
Lower Bound (or baseline):
The assignment of a random sense / the most frequent sense
90% is excellent for a word with 2 equiprobable senses
90% is trivial for a word with 2 senses with probability ratios of 9 to 1
-
7/29/2019 Advances in WSD
33/208
33
References
(Gale, Church and Yarowsky 1992) Gale, W., Church, K., and Yarowsky, D. Estimating
upper and lower bounds on the performance of word-sense disambiguation programs
ACL 1992.
(Miller et. al., 1994) Miller, G., Chodorow, M., Landes, S., Leacock, C., and Thomas, R.
Using a semantic concordance for sense identification. ARPA Workshop 1994.
(Miller, 1995) Miller, G. Wordnet: A lexical database. ACM, 38(11) 1995.
(Senseval) Senseval evaluation exercises http://www.senseval.org
-
7/29/2019 Advances in WSD
34/208
Part 3:
Knowledge-based Methods for
Word Sense Disambiguation
-
7/29/2019 Advances in WSD
35/208
35
Outline
Task definition
Machine Readable Dictionaries
Algorithms based on Machine Readable Dictionaries
Selectional Restrictions
Measures of Semantic Similarity
Heuristic-based Methods
-
7/29/2019 Advances in WSD
36/208
36
Task Definition
Knowledge-based WSD = class of WSD methods relying
(mainly) on knowledge drawn from dictionaries and/or raw
text
Resources
Yes
Machine Readable Dictionaries
Raw corpora
No
Manually annotated corpora Scope
All open-class words
-
7/29/2019 Advances in WSD
37/208
37
Machine Readable Dictionaries
In recent years, most dictionaries made available in
Machine Readable format (MRD)
Oxford English Dictionary
Collins
Longman Dictionary of Ordinary Contemporary English (LDOCE)
Thesaurusesadd synonymy information
Roget Thesaurus
Semantic networksadd more semantic relations
WordNet
EuroWordNet
-
7/29/2019 Advances in WSD
38/208
38
MRDA Resource for Knowledge-based WSD
For each word in the language vocabulary, an MRDprovides:
A list of meanings
Definitions (for all word meanings)
Typical usage examples (for most word meanings)
WordNet definitions/examples for the nounplant
1. buildings for carrying on industrial labor; "they built a large plant to
manufacture automobiles
2. a living organism lacking the power of locomotion3. something planted secretly for discovery by another; "the police used a plant to
trick the thieves"; "he claimed that the evidence against him was a plant"
4. an actor situated in the audience whose acting is rehearsed but seems
spontaneous to the audience
-
7/29/2019 Advances in WSD
39/208
39
MRDA Resource for Knowledge-based WSD
A thesaurus adds:
An explicit synonymy relation between word meanings
A semantic network adds:
Hypernymy/hyponymy (IS-A), meronymy/holonymy (PART-OF),
antonymy, entailnment, etc.
WordNet synsets for the noun plant
1. plant, works, industrial plant
2. plant, flora, plant life
WordNet related concepts for the meaning plant life
{plant, flora, plant life}
hypernym: {organism, being}
hypomym: {house plant}, {fungus},
meronym: {plant tissue}, {plant part}
holonym: {Plantae, kingdom Plantae, plant kingdom}
-
7/29/2019 Advances in WSD
40/208
40
Outline
Task definition
Machine Readable Dictionaries
Algorithms based on Machine Readable Dictionaries
Selectional Restrictions
Measures of Semantic Similarity
Heuristic-based Methods
-
7/29/2019 Advances in WSD
41/208
41
Lesk Algorithm
(Michael Lesk 1986): Identify senses of words in contextusing definition overlap
Algorithm:
1. Retrieve from MRD all sense definitions of the words to be
disambiguated
2. Determine the definition overlap for all possible sense combinations
3. Choose senses that lead to highest overlap
Example: disambiguate PINE CONE
PINE1. kinds of evergreen tree with needle-shaped leaves2. waste away through sorrow or illness
CONE
1. solid body which narrows to a point2. something of this shape whether solid or hollow
3. fruit of certain evergreen trees
Pine#1 Cone#1 = 0
Pine#2 Cone#1 = 0Pine#1 Cone#2 = 1
Pine#2 Cone#2 = 0
Pine#1 Cone#3 = 2
Pine#2 Cone#3 = 0
-
7/29/2019 Advances in WSD
42/208
42
Lesk Algorithm for More than Two Words?
I saw a man who is 98 years old and can still walk and tell jokes
nine open class words: see(26), man(11),year(4), old(8), can(5), still(4),
walk(10), tell(8),joke(3)
43,929,600 sense combinations! How to find the optimal sensecombination?
Simulated annealing (Cowie, Guthrie, Guthrie 1992)
Define a function E = combination of word senses in a given text.
Find the combination of senses that leads to highest definition overlap
(redundancy)1. Start with E = the most frequent sense for each word
2. At each iteration, replace the sense of a random word in the set with a
different sense, and measure E
3. Stop iterating when there is no change in the configuration of senses
-
7/29/2019 Advances in WSD
43/208
43
Lesk Algorithm: A Simplified Version
Original Lesk definition: measure overlap between sense definitions forall words in context
Identify simultaneously the correct senses for all words in context
Simplified Lesk (Kilgarriff & Rosensweig 2000): measure overlap
between sense definitions of a word and current context
Identify the correct sense for one word at a time
Search space significantly reduced
-
7/29/2019 Advances in WSD
44/208
44
Lesk Algorithm: A Simplified Version
Example: disambiguate PINE in
Pine cones hanging in a tree
PINE
1. kinds of evergreen tree with needle-shaped leaves
2. waste away through sorrow or illness
Pine#1 Sentence = 1
Pine#2 Sentence = 0
Algorithm for simplified Lesk:
1.Retrieve from MRD all sense definitions of the word to be
disambiguated
2.Determine the overlap between each sense definition and the
current context
3.Choose the sense that leads to highest overlap
-
7/29/2019 Advances in WSD
45/208
45
Evaluations of Lesk Algorithm
Initial evaluation by M. Lesk 50-70% on short samples of text manually annotated set, with respect
to Oxford Advanced Learners Dictionary
Simulated annealing
47% on 50 manually annotated sentences
Evaluation on Senseval-2 all-words data, with back-off to
random sense (Mihalcea & Tarau 2004)
Original Lesk: 35%
Simplified Lesk: 47%
Evaluation on Senseval-2 all-words data, with back-off to
most frequent sense (Vasilescu, Langlais, Lapalme 2004)
Original Lesk: 42%
Simplified Lesk: 58%
-
7/29/2019 Advances in WSD
46/208
46
Outline
Task definition
Machine Readable Dictionaries
Algorithms based on Machine Readable Dictionaries
Selectional Preferences
Measures of Semantic Similarity
Heuristic-based Methods
-
7/29/2019 Advances in WSD
47/208
47
Selectional Preferences
A way to constrain the possible meanings of words in agiven context
E.g. Wash a dish vs. Cook a dish
WASH-OBJECT vs. COOK-FOOD
Capture information about possible relations betweensemantic classes Common sense knowledge
Alternative terminology Selectional Restrictions
Selectional Preferences
Selectional Constraints
-
7/29/2019 Advances in WSD
48/208
48
Acquiring Selectional Preferences
From annotated corpora
Circular relationship with the WSD problem
Need WSD to build the annotated corpus
Need selectional preferences to derive WSD
From raw corpora
Frequency counts
Information theory measures
Class-to-class relations
-
7/29/2019 Advances in WSD
49/208
49
Preliminaries: Learning Word-to-Word Relations
An indication of the semantic fit between two words
1. Frequency counts
Pairs of words connected by a syntactic relations
2. Conditional probabilities
Condition on one of the words
),,( 21RWWCount
),(
),,(),|(
2
2121
RWCount
RWWCountRWWP
-
7/29/2019 Advances in WSD
50/208
50
Learning Selectional Preferences (1)
Word-to-class relations (Resnik 1993)
Quantify the contribution of a semantic class using all the concepts
subsumed by that class
where
)(
),|(log),|(
)(
),|(log),|(
),,(
2
1212
2
1212
21
2CP
RWCPRWCP
CP
RWCPRWCP
RCWA
C
22)(
),,(),,(
),(
),,(
),|(
2
2121
1
21
12
CW WCount
RWWCountRCWCount
RWCount
RCWCount
RWCP
-
7/29/2019 Advances in WSD
51/208
51
Learning Selectional Preferences (2) Determine the contribution of a word sense based on the assumption of
equal sense distributions: e.g. plant has two senses 50% occurences are sense 1, 50% are sense 2
Example: learning restrictions for the verb to drink
Find high-scoring verb-object pairs
Find prototypical object classes (high association score)
Co-occ score Verb Object
11.75 drink tea
11.75 drink Pepsi
11.75 drink champagne
10.53 drink liquid
10.2 drink beer9.34 drink wine
A(v,c) Object class
3.58 (beverage, [drink, ])
2.05 (alcoholic_beverage, [intoxicant, ])
-
7/29/2019 Advances in WSD
52/208
52
Learning Selectional Preferences (3)
Other algorithms
Learn class-to-class relations (Agirre and Martinez, 2002)
E.g.: ingest food is a class-to-class relation foreat chicken
Bayesian networks (Ciaramita and Johnson, 2000) Tree cut model (Li and Abe, 1998)
-
7/29/2019 Advances in WSD
53/208
53
Using Selectional Preferences for WSD
Algorithm:1. Learn a large set of selectional preferences for a given syntactic
relation R
2. Given a pair of words W1W
2connected by a relation R
3. Find all selectional preferences W1C (word-to-class) or C
1C
2
(class-to-class) that apply
4. Select the meanings of W1
and W2
based on the selected semantic
class
Example: disambiguatecoffeeindrinkcoffee
1. (beverage) a beverage consisting of an infusion of ground coffee beans
2. (tree) any of several small trees native to the tropical Old World
3. (color) a medium to dark brown color
Given the selectional preferenceDRINK BEVERAGE : coffee#1
-
7/29/2019 Advances in WSD
54/208
54
Evaluation of Selectional Preferences for WSD
Data set
mainly on verb-object, subject-verb relations extracted from
SemCor
Compare against random baseline
Results (Agirre and Martinez, 2000) Average results on 8 nouns
Similar figures reported in (Resnik 1997)
Object Subject
Precision Recall Precision RecallRandom 19.2 19.2 19.2 19.2
Word-to-word 95.9 24.9 74.2 18.0
Word-to-class 66.9 58.0 56.2 46.8
Class-to-class 66.6 64.8 54.0 53.7
-
7/29/2019 Advances in WSD
55/208
55
Outline
Task definition
Machine Readable Dictionaries
Algorithms based on Machine Readable Dictionaries
Selectional Restrictions
Measures of Semantic Similarity
Heuristic-based Methods
-
7/29/2019 Advances in WSD
56/208
56
Semantic Similarity
Words in a discourse must be related in meaning, for thediscourse to be coherent (Haliday and Hassan, 1976)
Use this property for WSDIdentify related meanings for
words that share a common context
Context span:
1. Local context: semantic similarity between pairs of words
2. Global context: lexical chains
-
7/29/2019 Advances in WSD
57/208
57
Semantic Similarity in a Local Context
Similarity determined between pairs of concepts, or
between a word and its surrounding context
Relies on similarity metrics on semantic networks
(Rada et al. 1989)
carnivore
wild dogwolf
bearfeline, felidcanine, canidfissiped mamal, fissiped
dachshund
hunting doghyena dogdingo
hyenadog
terrier
-
7/29/2019 Advances in WSD
58/208
58
Semantic Similarity Metrics (1) Input: two concepts (same part of speech)
Output: similarity measure (Leacock and Chodorow 1998)
E.g. Similarity(wolf,dog) = 0.60 Similarity(wolf,bear) = 0.42
(Resnik 1995)
Define information content, P(C) = probability of seeing a concept of type
C in a large corpus
Probability of seeing a concept = probability of seeing instances of thatconcept
Determine the contribution of a word sense based on the assumption of
equal sense distributions:
e.g. plant has two senses 50% occurrences are sense 1, 50% are sense 2
D
CCPathCCSimilarity
2
),(log),( 2121 , D is the taxonomy depth
))(log()( CPCIC
-
7/29/2019 Advances in WSD
59/208
59
Semantic Similarity Metrics (2)
Similarity using information content (Resnik 1995) Define similarity between two concepts (LCS = Least
Common Subsumer)
Alternatives (Jiang and Conrath 1997)
Other metrics:
Similarity using information content (Lin 1998)
Similarity using gloss-based paths across different hierarchies (Mihalceaand Moldovan 1999)
Conceptual density measure between noun semantic hierarchies andcurrent context (Agirre and Rigau 1995)
Adapted Lesk algorithm (Banerjee and Pedersen 2002)
)),((),( 2121 CCLCSICCCSimilarity
))()((
)),((2),(
21
2121
CICCIC
CCLCSICCCSimilarity
-
7/29/2019 Advances in WSD
60/208
60
Semantic Similarity Metrics for WSD
Disambiguate target words based on similarity with one
word to the left and one word to the right (Patwardhan, Banerjee, Pedersen 2002)
Evaluation:
1,723 ambiguous nouns from Senseval-2
Among 5 similarity metrics, (Jiang and Conrath 1997) provide the
best precision (39%)
Example: disambiguate PLANT in plant with flowers
PLANT1. plant, works, industrial plant
2. plant, flora, plant life
Similarity (plant#1, flower) = 0.2
Similarity (plant#2, flower) = 1.5 : plant#2
-
7/29/2019 Advances in WSD
61/208
61
Semantic Similarity in a Global Context
Lexical chains (Hirst and St-Onge 1988), (Haliday and Hassan 1976)
A lexical chain is a sequence of semantically related words, whichcreates a context and contributes to the continuity of meaning and thecoherence of a discourse
Algorithmfor finding lexical chains:1. Select the candidate words from the text. These are words for which we cancompute similarity measures, and therefore most of the time they have thesame part of speech.
2. For each such candidate word, and for each meaning for this word, find achain to receive the candidate word sense, based on a semantic relatedness
measure between the concepts that are already in the chain, and the candidateword meaning.
3. If such a chain is found, insert the word in this chain; otherwise, create a newchain.
-
7/29/2019 Advances in WSD
62/208
62
Semantic Similarity of a Global Context
A very long traintraveling along the railswith a constant velocityv in a
certain direction
train #1: public transport
#2: order set of things
#3: piece of cloth
travel
#1 change location
#2: undergo transportation
rail #1: a barrier
# 2: a bar of steel for trains
#3: a small bird
-
7/29/2019 Advances in WSD
63/208
63
Lexical Chains for WSD
Identify lexical chains in a text
Usually target one part of speech at a time
Identify the meaning of words based on their membership
to a lexical chain
Evaluation:
(Galley and McKeown 2003) lexical chains on 74 SemCor texts
give 62.09%
(Mihalcea and Moldovan 2000) on five SemCor texts give 90%with 60% recall
lexical chains anchored on monosemous words
(Okumura and Honda 1994) lexical chains on five Japanese texts
give 63.4%
-
7/29/2019 Advances in WSD
64/208
64
Outline
Task definition
Machine Readable Dictionaries
Algorithms based on Machine Readable Dictionaries
Selectional Restrictions
Measures of Semantic Similarity
Heuristic-based Methods
-
7/29/2019 Advances in WSD
65/208
65
Example: plant/flora is used more often thanplant/factory
- annotate any instance ofPLANT asplant/flora
Most Frequent Sense (1)
Identify the most often used meaning and use this meaning
by default
Word meanings exhibit a Zipfian distribution
E.g. distribution of word senses in SemCor
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 2 3 4 5 6 7 8 9 10
Sense number
Frequency Noun
Verb
Adj
Adv
-
7/29/2019 Advances in WSD
66/208
66
Most Frequent Sense (2) Method 1: Find the most frequent sense in an annotated
corpus Method 2: Find the most frequent sense using a method
based on distributional similarity (McCarthy et al. 2004)
1. Given a word w, find the top k distributionally similar words
Nw = {n1, n2, , nk}, with associated similarity scores{dss(w,n1), dss(w,n2), dss(w,nk)}
2. For each sense wsi of w, identify the similarity with the words nj,
using the sense of nj that maximizes this score
3. Rank senses wsi of w based on the total similarity score
)),((max),(
where
,),'(
),(),()(
)(
)('
xinsensesns
ji
Nn
wsensesws
ji
ji
ji
nswswnssnwswnss
nwswnss
nwswnssnwdsswsScore
jx
wj
i
-
7/29/2019 Advances in WSD
67/208
67
Most Frequent Sense(3)
Word senses pipe #1 = tobacco pipe
pipe #2 = tube of metal or plastic
Distributional similar words
N = {tube, cable, wire, tank, hole, cylinder, fitting, tap, }
For each word in N, find similarity with pipe#i (using the sense thatmaximizes the similarity)
pipe#1tube (#3) = 0.3
pipe#2tube (#1) = 0.6
Compute score for each sense pipe#i
score (pipe#1) = 0.25 score (pipe#2) = 0.73
Note: results depend on the corpus used to find distributionallysimilar words => can find domain specific predominant senses
-
7/29/2019 Advances in WSD
68/208
68
E.g. The ambiguous word PLANT occurs 10 times in a discourse
all instances ofplant carry the same meaning
One Sense Per Discourse
A word tends to preserve its meaning across all its occurrences in agiven discourse (Gale, Church, Yarowksy 1992)
What does this mean?
Evaluation:
8 words with two-way ambiguity, e.g. plant, crane, etc.
98% of the two-word occurrences in the same discourse carry the same
meaning
The grain of salt: Performance depends on granularity (Krovetz 1998) experiments with words with more than two senses
Performance of one sense per discourse measured on SemCor is approx.
70%
-
7/29/2019 Advances in WSD
69/208
69
The ambiguous word PLANT preserves its meaning in all its
occurrences within the collocationindustrial plant, regardless
of the context where this collocation occurs
One Sense per Collocation
A word tends to preserver its meaning when used in the samecollocation (Yarowsky 1993)
Strong for adjacent collocations
Weaker as the distance between words increases
An example
Evaluation:
97% precision on words with two-way ambiguity
Finer granularity:
(Martinez and Agirre 2000) tested the one sense per collocation
hypothesis on text annotated with WordNet senses
70% precision on SemCor words
References
-
7/29/2019 Advances in WSD
70/208
70
References (Agirre and Rigau, 1995) Agirre, E. and Rigau, G.A proposal for word sense disambiguation
using conceptual distance. RANLP 1995.
(Agirre and Martinez 2001) Agirre, E. and Martinez, D. Learning class-to-class selectional
preferences. CONLL 2001.
(Banerjee and Pedersen 2002) Banerjee, S. and Pedersen, T. An adapted Lesk algorithm forword sense disambiguation using WordNet. CICLING 2002.
(Cowie, Guthrie and Guthrie 1992), Cowie, L. and Guthrie, J. A. and Guthrie, L.: Lexicaldisambiguation using simulated annealing. COLING 2002.
(Gale, Church and Yarowsky 1992) Gale, W., Church, K., and Yarowsky, D. One sense per
discourse. DARPA workshop 1992. (Halliday and Hasan 1976) Halliday, M. and Hasan, R., (1976). Cohesion in English.
Longman.
(Galley and McKeown 2003) Galley, M. and McKeown, K. (2003) Improving word sensedisambiguation in lexical chaining. IJCAI 2003
(Hirst and St-Onge 1998) Hirst, G. and St-Onge, D. Lexical chains as representations ofcontext in the detection and correction of malaproprisms. WordNet: An electronic lexicaldatabase, MIT Press.
(Jiang and Conrath 1997) Jiang, J. and Conrath, D. Semantic similarity based on corpusstatistics and lexical taxonomy. COLING 1997.
(Krovetz, 1998) Krovetz, R.More than one sense per discourse. ACL-SIGLEX 1998.
(Lesk, 1986) Lesk, M. Automatic sense disambiguation using machine readable dictionaries:How to tell a pine cone from an ice cream cone. SIGDOC 1986.
(Lin 1998) Lin, DAn information theoretic definition of similarity. ICML 1998.
References
-
7/29/2019 Advances in WSD
71/208
71
(Martinez and Agirre 2000) Martinez, D. and Agirre, E. One sense per collocation andgenre/topic variations. EMNLP 2000.
(Miller et. al., 1994) Miller, G., Chodorow, M., Landes, S., Leacock, C., and Thomas, R.Using a semantic concordance for sense identification. ARPA Workshop 1994.
(Miller, 1995) Miller, G. Wordnet: A lexical database. ACM, 38(11) 1995.
(Mihalcea and Moldovan, 1999) Mihalcea, R. and Moldovan, D. A method for wordsense disambiguation of unrestricted text. ACL 1999.
(Mihalcea and Moldovan 2000) Mihalcea, R. and Moldovan, D. An iterative approachto word sense disambiguation. FLAIRS 2000.
(Mihalcea, Tarau, Figa 2004) R. Mihalcea, P. Tarau, E. FigaPageRank on Semantic
Networks with Application to Word Sense Disambiguation, COLING 2004. (Patwardhan, Banerjee, and Pedersen 2003) Patwardhan, S. and Banerjee, S. and
Pedersen, T. Using Measures of Semantic Relatedeness for Word Sense Disambiguation.CICLING 2003.
(Rada et al 1989) Rada, R. and Mili, H. and Bicknell, E. and Blettner, M. Developmentand application of a metric on semantic nets. IEEE Transactions on Systems, Man, andCybernetics, 19(1) 1989.
(Resnik 1993) Resnik, P. Selection and Information: A Class-Based Approach to LexicalRelationships. University of Pennsylvania 1993.
(Resnik 1995) Resnik, P. Using information content to evaluate semantic similarity.IJCAI 1995.
(Vasilescu, Langlais, Lapalme 2004) F. Vasilescu, P. Langlais, G. Lapalme "Evaluatingvariants of the Lesk approach for disambiguating words, LREC 2004.
(Yarowsky, 1993) Yarowsky, D. One sense per collocation. ARPA Workshop 1993.
-
7/29/2019 Advances in WSD
72/208
Part 4:
Supervised Methods of Word
Sense Disambiguation
-
7/29/2019 Advances in WSD
73/208
73
Outline
What is Supervised Learning?
Task Definition
Single Classifiers
Nave Bayesian Classifiers
Decision Lists and Trees
Ensembles of Classifiers
-
7/29/2019 Advances in WSD
74/208
74
What is Supervised Learning?
Collect a set of examples that illustrate the various possibleclassifications or outcomes of an event.
Identify patterns in the examples associated with each
particular class of the event.
Generalize those patterns into rules.
Apply the rules to classify a new event.
f
-
7/29/2019 Advances in WSD
75/208
75
Learn from these examples :
when do I go to the store?
Day
CLASS
Go to Store?
F1
Hot Outside?
F2
Slept Well?
F3
Ate Well?
1 YES YES NO NO
2 NO YES NO YES
3 YES NO NO NO
4 NO NO NO YES
L f h l
-
7/29/2019 Advances in WSD
76/208
76
Learn from these examples :
when do I go to the store?
Day
CLASS
Go to Store?
F1
Hot Outside?
F2
Slept Well?
F3
Ate Well?
1 YES YES NO NO
2 NO YES NO YES
3 YES NO NO NO
4 NO NO NO YES
-
7/29/2019 Advances in WSD
77/208
77
Outline
What is Supervised Learning?
Task Definition
Single Classifiers
Nave Bayesian Classifiers
Decision Lists and Trees
Ensembles of Classifiers
-
7/29/2019 Advances in WSD
78/208
78
Task Definition
Supervised WSD: Class of methods that induces a classifier frommanually sense-tagged text using machine learning techniques.
Resources
Sense Tagged Text
Dictionary (implicit source of sense inventory)
Syntactic Analysis (POS tagger, Chunker, Parser, )
Scope
Typically one target word per context
Part of speech of target word resolved
Lends itself to targeted word formulation
Reduces WSD to a classification problem where a target word is
assigned the most appropriate sense from a given set of possibilities
based on the context in which it occurs
-
7/29/2019 Advances in WSD
79/208
79
Sense Tagged Text
Bonnie and Clyde are two really famous criminals, I thinkthey were bank/1 robbers
My bank/1 charges too much for an overdraft.
I went to the bank/1 to deposit my check and get a new ATM
card.The University of Minnesota has an East and a West Bank/2
campus right on the Mississippi River.
My grandfather planted his pole in the bank/2 and got a great
big catfish!The bank/2is pretty muddy, I cant walk there.
T B f W d
-
7/29/2019 Advances in WSD
80/208
80
Two Bags of Words
(Co-occurrences in the window of context)
FINANCIAL_BANK_BAG:
a an and are ATM Bonnie card charges check Clyde
criminals deposit famous for get I much My new overdraft
really robbers the they think to too two went were
RIVER_BANK_BAG:
a an and big campus cant catfish East got grandfather great
has his I in is Minnesota Mississippi muddy My of on planted
pole pretty right River The the there University walk West
-
7/29/2019 Advances in WSD
81/208
81
Simple Supervised Approach
Given a sentence S containing bank:
For each word Wi in S
If Wi is in FINANCIAL_BANK_BAG then
Sense_1 = Sense_1 + 1;If Wi is in RIVER_BANK_BAG then
Sense_2 = Sense_2 + 1;
If Sense_1 > Sense_2 then print Financialelse if Sense_2 > Sense_1 then print River
else print Cant Decide;
-
7/29/2019 Advances in WSD
82/208
82
Supervised Methodology
Create a sample oftraining data where a given target wordis manually annotatedwith a sense from apredeterminedset of possibilities.
One tagged word per instance/lexical sample disambiguation
Select a set of features with which to represent context. co-occurrences, collocations, POS tags, verb-obj relations, etc...
Convert sense-taggedtraining instances to feature vectors.
Apply a machine learning algorithm to induce a classifier.
Formstructure or relation among features
Parametersstrength of feature interactions
Convert a held outsample of test data into feature vectors.
correct sense tags are known but not used
Apply classifier to test instances to assign a sense tag.
-
7/29/2019 Advances in WSD
83/208
83
From Text to Feature Vectors
My/pronoun grandfather/noun used/verb to/prep fish/verbalong/adv the/det banks/SHORE of/prep the/det
Mississippi/noun River/noun. (S1)
The/det bank/FINANCE issued/verb a/det check/noun
for/prep the/det amount/noun of/prep interest/noun. (S2)
P-2 P-1 P+1 P+2 fish check river interest SENSE TAG
S1 adv det prep det Y N Y N SHORE
S2 det verb det N Y N Y FINANCE
-
7/29/2019 Advances in WSD
84/208
84
Supervised Learning Algorithms
Once data is converted to feature vector form, anysupervised learning algorithm can be used. Many have
been applied to WSD with good results:
Support Vector Machines
Nearest Neighbor Classifiers Decision Trees
Decision Lists
Nave Bayesian Classifiers
Perceptrons
Neural Networks
Graphical Models
Log Linear Models
-
7/29/2019 Advances in WSD
85/208
85
Outline
What is Supervised Learning?
Task Definition
Nave Bayesian Classifier
Decision Lists and Trees
Ensembles of Classifiers
-
7/29/2019 Advances in WSD
86/208
86
Nave Bayesian Classifier
Nave Bayesian Classifier well known in MachineLearning community for good performance across a range
of tasks (e.g., Domingos and Pazzani, 1997)
Word Sense Disambiguation is no exception
Assumes conditional independence among features, giventhe sense of a word.
The form of the model is assumed, but parameters are estimated
from training instances
When applied to WSD, features are often a bag of wordsthat come from the training data
Usually thousands of binary features that indicate if a word is
present in the context of the target word (or not)
-
7/29/2019 Advances in WSD
87/208
87
Bayesian Inference
Given observed features, what is most likely sense?
Estimate probability of observed features given sense
Estimate unconditional probability of sense
Unconditional probability of features is a normalizingterm, doesnt affect sense classification
),...,3,2,1(
)()*|,...,3,2,1(),...,3,2,1|(
FnFFFp
SpSFnFFFpFnFFFSp
-
7/29/2019 Advances in WSD
88/208
88
Nave Bayesian Model
S
F2 F3 F4F1 Fn
)|(*...*)|2(*)|1()|,...,2,1( SFnpSFpSFpSFnFFP
-
7/29/2019 Advances in WSD
89/208
89
The Nave Bayesian Classifier
Given 2,000 instances of bank, 1,500 for bank/1 (financial sense)and 500 for bank/2 (river sense)
P(S=1) = 1,500/2000 = .75
P(S=2) = 500/2,000 = .25 Given credit occurs 200 times with bank/1 and 4 times with bank/2.
P(F1=credit) = 204/2000 = .102
P(F1=credit|S=1) = 200/1,500 = .133
P(F1=credit|S=2) = 4/500 = .008
Given a test instance that has one feature credit
P(S=1|F1=credit) = .133*.75/.102 = .978 P(S=2|F1=credit) = .008*.25/.102 = .020
)(*)|(*...*)|1(argmax SpSFnpSFpsense Ssense
-
7/29/2019 Advances in WSD
90/208
90
Comparative Results
(Leacock, et. al. 1993) compared Nave Bayes with aNeural Network and a Context Vector approach when
disambiguating six senses ofline
(Mooney, 1996) compared Nave Bayes with a Neural
Network, Decision Tree/List Learners, Disjunctive andConjunctive Normal Form learners, and a perceptron when
disambiguating six senses ofline
(Pedersen, 1998) compared Nave Bayes with Decision
Tree, Rule Based Learner, Probabilistic Model, etc. whendisambiguating line and 12 other words
All found that Nave Bayesian Classifier performed as
well as any of the other methods!
-
7/29/2019 Advances in WSD
91/208
91
Outline
What is Supervised Learning?
Task Definition
Nave Bayesian Classifiers
Decision Lists and Trees
Ensembles of Classifiers
-
7/29/2019 Advances in WSD
92/208
92
Decision Lists and Trees
Very widely used in Machine Learning.
Decision trees used very early for WSD research (e.g.,
Kelly and Stone, 1975; Black, 1988).
Represent disambiguation problem as a series of questions
(presence of feature) that reveal the sense of a word. List decides between two senses after one positive answer
Tree allows for decision among multiple senses after a series of
answers
Uses a smaller, more refined set of features than bag ofwords and Nave Bayes.
More descriptive and easier to interpret.
-
7/29/2019 Advances in WSD
93/208
93
Decision List for WSD (Yarowsky, 1994)
Identify collocational features from sense tagged data.
Word immediately to the left or right of target :
I have my bank/1 statement.
The riverbank/2 is muddy.
Pair of words to immediate left or right of target : The worlds richestbank/1 is here in New York.
The river bank/2 is muddy.
Words found within k positions to left or right of target,
where k is often 10-50 : My creditis just horrible because my bank/1 has made several
mistakes with my accountand the balance is very low.
-
7/29/2019 Advances in WSD
94/208
94
Building the Decision List
Sort order of collocation tests using log of conditional
probabilities.
Words most indicative of one sense (and not the other)
will be ranked highly.
))|2(
)|1((log
inCollocatioiFSp
inCollocatio
iFSp
Abs
-
7/29/2019 Advances in WSD
95/208
95
Computing DL score
Given 2,000 instances of bank, 1,500 for bank/1(financial sense) and 500 for bank/2 (river sense)
P(S=1) = 1,500/2,000 = .75
P(S=2) = 500/2,000 = .25
Given credit occurs 200 times with bank/1 and 4
times with bank/2. P(F1=credit) = 204/2,000 = .102
P(F1=credit|S=1) = 200/1,500 = .133
P(F1=credit|S=2) = 4/500 = .008
From Bayes Rule
P(S=1|F1=credit) = .133*.75/.102 = .978 P(S=2|F1=credit) = .008*.25/.102 = .020
DL Score = abs (log (.978/.020)) = 3.89
-
7/29/2019 Advances in WSD
96/208
96
Using the Decision List
Sort DL-score, go through test instance looking formatching feature. First match reveals sense
DL-score Feature Sense
3.89 creditwithin bank Bank/1 financial
2.20 bankis muddy Bank/2 river
1.09 pole within bank Bank/2 river
0.00 of the bank N/A
-
7/29/2019 Advances in WSD
97/208
97
Using the Decision List
CREDIT?
BANK/1 FINANCIAL IS MUDDY?
POLE?BANK/2 RIVER
BANK/2 RIVER
-
7/29/2019 Advances in WSD
98/208
98
Learning a Decision Tree
Identify the feature that most cleanly divides the trainingdata into the known senses.
Cleanly measured by information gain or gain ratio.
Create subsets of training data according to feature values.
Find another feature that most cleanly divides a subset ofthe training data.
Continue until each subset of training data is pure or asclean as possible.
Well known decision tree learning algorithms include ID3
and C4.5 (Quillian, 1986, 1993) In Senseval-1, a modified decision list (which supported
some conditional branching) was most accurate for EnglishLexical Sample task (Yarowsky, 2000)
-
7/29/2019 Advances in WSD
99/208
99
Supervised WSD with Individual Classifiers
Many supervised Machine Learning algorithms have beenapplied to Word Sense Disambiguation, most workreasonably well.
(Witten and Frank, 2000) is a great intro. to supervised learning.
Features tend to differentiate among methods more thanthe learning algorithms.
Good sets of features tend to include: Co-occurrences or keywords (global)
Collocations (local)
Bigrams (local and global) Part of speech (local)
Predicate-argument relations
Verb-object, subject-verb,
Heads of Noun and Verb Phrases
C f R l
-
7/29/2019 Advances in WSD
100/208
100
Convergence of Results
Accuracy of different systems applied to the same datatends to converge on a particular value, no one system
shockingly better than another.
Senseval-1, a number of systems in range of 74-78% accuracy for
English Lexical Sample task. Senseval-2, a number of systems in range of 61-64% accuracy for
English Lexical Sample task.
Senseval-3, a number of systems in range of 70-73% accuracy for
English Lexical Sample task
What to do next?
O li
-
7/29/2019 Advances in WSD
101/208
101
Outline
What is Supervised Learning?
Task Definition
Nave Bayesian Classifiers
Decision Lists and Trees
Ensembles of Classifiers
E bl f Cl ifi
-
7/29/2019 Advances in WSD
102/208
102
Ensembles of Classifiers
Classifier error has two components (Bias and Variance) Some algorithms (e.g., decision trees) try and build a
representation of the training dataLow Bias/High Variance
Others (e.g., Nave Bayes) assume a parametric form and dont
represent the training dataHigh Bias/Low Variance
Combining classifiers with different bias variance
characteristics can lead to improved overall accuracy
Bagging a decision tree can smooth out the effect of
small variations in the training data (Breiman, 1996)
Sample with replacement from the training data to learn multiple
decision trees.
Outliers in training data will tend to be obscured/eliminated.
E bl C id ti
-
7/29/2019 Advances in WSD
103/208
103
Ensemble Considerations
Must choose different learning algorithms withsignificantly different bias/variance characteristics.
Nave Bayesian Classifier versus Decision Tree
Must choose feature representations that yield significantly
different (independent?) views of the training data. Lexical versus syntactic features
Must choose how to combine classifiers.
Simple Majority Voting
Averaging of probabilities across multiple classifier output Maximum Entropy combination (e.g., Klein, et. al., 2002)
E bl R lt
-
7/29/2019 Advances in WSD
104/208
104
Ensemble Results
(Pedersen, 2000) achieved state of art for interestand linedata using ensemble of Nave Bayesian Classifiers.
Many Nave Bayesian Classifiers trained on varying sized
windows of context / bags of words.
Classifiers combined by a weighted vote
(Florian and Yarowsky, 2002) achieved state of the art for
Senseval-1 and Senseval-2 data using combination of six
classifiers.
Rich set of collocational and syntactic features.
Combined via linear combination of top three classifiers.
Many Senseval-2 and Senseval-3 systems employed
ensemble methods.
R f
-
7/29/2019 Advances in WSD
105/208
105
References
(Black, 1988) An experiment in computational discrimination of English word senses.
IBM Journal of Research and Development (32) pg. 185-194. (Breiman, 1996) The heuristics of instability in model selection. Annals of Statistics
(24) pg. 2350-2383.
(Domingos and Pazzani, 1997) On the Optimality of the Simple Bayesian Classifier
under Zero-One Loss, Machine Learning (29) pg. 103-130.
(Domingos, 2000) A Unified Bias Variance Decomposition for Zero-One and Squared
Loss. In Proceedings of AAAI. Pg. 564-569. (Florian an dYarowsky, 2002) Modeling Consensus: Classifier Combination for Word
Sense Disambiguation. In Proceedings of EMNLP, pp 25-32.
(Kelly and Stone, 1975). Computer Recognition of English Word Senses, North Holland
Publishing Co., Amsterdam.
(Klein, et. al., 2002) Combining Heterogeneous Classifiers for Word-Sense
Disambiguation, Proceedings of Senseval-2. pg. 87-89. (Leacock, et. al. 1993) Corpus based statistical sense resolution. In Proceedings of the
ARPA Workshop on Human Language Technology. pg. 260-265.
(Mooney, 1996) Comparative experiments on disambiguating word senses: An
illustration of the role of bias in machine learning. Proceedings of EMNLP. pg. 82-91.
R f
-
7/29/2019 Advances in WSD
106/208
106
References
(Pedersen, 1998) Learning Probabilistic Models of Word Sense Disambiguation. Ph.D.Dissertation. Southern Methodist University.
(Pedersen, 2000) A simple approach to building ensembles of Naive Bayesian classifiers
for word sense disambiguation. In Proceedings of NAACL.
(Quillian, 1986). Induction of Decision Trees. Machine Learning (1). pg. 81-106.
(Quillian, 1993). C4.5 Programs for Machine Learning. San Francisco, Morgan
Kaufmann. (Witten and Frank, 2000). Data MiningPractical Machine Learning Tools and
Techniques with Java Implementations. Morgan-Kaufmann. San Francisco.
(Yarowsky, 1994) Decision lists for lexical ambiguity resolution: Application to accent
restoration in Spanish and French. In Proceedings of ACL. pp. 88-95.
(Yarowsky, 2000) Hierarchical decision lists for word sense disambiguation. Computers
and the Humanities, 34.
-
7/29/2019 Advances in WSD
107/208
Part 5:
Minimally Supervised Methods
for Word Sense Disambiguation
O tli
-
7/29/2019 Advances in WSD
108/208
108
Outline
Task definition What does minimally supervised mean?
Bootstrapping algorithms
Co-training
Self-training Yarowsky algorithm
Using the Web for Word Sense Disambiguation
Web as a corpus
Web as collective mind
Task Definition
-
7/29/2019 Advances in WSD
109/208
109
Task Definition
SupervisedWSD = learning sense classifiers starting withannotated data
Minimally supervised WSD = learning sense classifiers
from annotated data, with minimal human supervision
Examples Automatically bootstrap a corpus starting with a few human
annotated examples
Use monosemous relatives / dictionary definitions to automatically
construct sense tagged data
Rely on Web-users + active learning for corpus annotation
Outline
-
7/29/2019 Advances in WSD
110/208
110
Outline
Task definition What does minimally supervised mean?
Bootstrapping algorithms
Co-training
Self-training Yarowsky algorithm
Using the Web for Word Sense Disambiguation
Web as a corpus
Web as collective mind
Bootstrapping WSD Classifiers
-
7/29/2019 Advances in WSD
111/208
111
Bootstrapping WSD Classifiers
Build sense classifiers with little training data Expand applicability of supervised WSD
Bootstrapping approaches
Co-training Self-training
Yarowsky algorithm
Bootstrapping Recipe
-
7/29/2019 Advances in WSD
112/208
112
Bootstrapping Recipe
Ingredients (Some) labeled data
(Large amounts of) unlabeled data
(One or more) basic classifiers
Output Classifier that improves over the basic classifiers
-
7/29/2019 Advances in WSD
113/208
113
plants#1and animals
industry plant#2
building the only atomic plant
plantgrowth is retarded
a herb or flowering plant
a nuclear powerplant
building a new vehicle plant the animal and plantlife
the passion-fruit plant
Classifier 1
Classifier 2
plant#1growth is retarded
a nuclear powerplant#2
Co-training / Self-training
-
7/29/2019 Advances in WSD
114/208
114
Co training / Self training
1. Create a pool of examples U'
choose P random examples from U
2. Loop for I iterations
Train Cion L and label U'
Select G most confident examples and add to L
maintain distribution in L Refill U' with examples from U
keep U' at constant size P
A set L of labeled training examples
A set U of unlabeled examples
Classifiers Ci
Co training
-
7/29/2019 Advances in WSD
115/208
115
(Blum and Mitchell 1998) Two classifiers
independent views
[independence condition can be relaxed]
Co-training in Natural Language Learning Statistical parsing (Sarkar 2001)
Co-reference resolution (Ng and Cardie 2003)
Part of speech tagging (Clark, Curran and Osborne 2003)
...
Co-training
Self-training
-
7/29/2019 Advances in WSD
116/208
116
Self-training
(Nigam and Ghani 2000) One single classifier
Retrain on its own output
Self-training for Natural Language Learning
Part of speech tagging (Clark, Curran and Osborne 2003) Co-reference resolution (Ng and Cardie 2003)
several classifiers through bagging
Parameter Setting for Co-training/Self-training
-
7/29/2019 Advances in WSD
117/208
117
Parameter Setting for Co-training/Self-training
1. Create a pool of examples U' choose P random examples from U
2. Loop for I iterations
Train Cion L and label U'
Select G most confident examples and add to L maintain distribution in L
Refill U' with examples from U
keep U' at constant size P
Pool size
Iterations
Growth size
A major drawback of bootstrapping
No principled method for selecting optimal values for these
parameters (Ng and Cardie 2003)
Experiments with Co-training / Self-training
-
7/29/2019 Advances in WSD
118/208
118
for WSD
Training / Test data Senseval-2 nouns (29 ambiguous nouns)
Average corpus size: 95 training examples, 48 test examples
Raw data British National Corpus
Average corpus size: 7,085 examples
Co-training
Two classifiers: local and topical classifiers
Self-training One classifier: global classifier
(Mihalcea 2004)
Parameter Settings
-
7/29/2019 Advances in WSD
119/208
119
Parameter Settings
Parameter ranges
P = {1, 100, 500, 1000, 1500, 2000, 5000} G = {1, 10, 20, 30, 40, 50, 100, 150, 200}
I = {1, ..., 40}
29 nouns 120,000 runs
Upper bound in co-training/self-training performance
Optimised on test set
Basic classifier: 53.84%
Optimal self-training: 65.61%
Optimal co-training: 65.75%
~25% error reduction
Per-word parameter setting:
Co-training = 51.73%
Self-training = 52.88%
Global parameter setting
Co-training = 55.67%
Self-training = 54.16%
Example: lady
basic = 61.53%
self-training = 84.61% [20/100/39]
co-training = 82.05% [1/1000/3]
Yarowsky Algorithm
-
7/29/2019 Advances in WSD
120/208
120
Yarowsky Algorithm
(Yarowsky 1995) Similar to co-training
Differs in the basic assumption (Abney 2002)
view independence (co-training) vs. precision independence(Yarowsky algorithm)
Relies on two heuristics and a decision list
One sense per collocation :
Nearby words provide strong and consistent clues as to the sense of a
target word
One sense per discourse :
The sense of a target word is highly consistent within a singledocument
Learning Algorithm
-
7/29/2019 Advances in WSD
121/208
121
Learning Algorithm
A decision list is used to classify instances of target word :
the loss of animal and plant species through extinction
Classification is based on the highest ranking rule thatmatches the target context
LogL Collocation Sense
9.31 flower (within +/- k words) A (living)
9.24 job (within +/- k words) B (factory)
9.03 fruit (within +/- k words) A (living)
9.02 plant species A (living)
... ...
Bootstrapping Algorithm
-
7/29/2019 Advances in WSD
122/208
122
Bootstrapping Algorithm
All occurrences of the target word are identified
A small training set of seed data is tagged with word sense
Sense-B: factory
Sense-A: life
Bootstrapping Algorithm
-
7/29/2019 Advances in WSD
123/208
123
Bootstrapping Algorithm
Seed set grows and residual set shrinks .
Bootstrapping Algorithm
-
7/29/2019 Advances in WSD
124/208
124
Bootstrapping Algorithm
Convergence: Stop when residual set stabilizes
Bootstrapping Algorithm
-
7/29/2019 Advances in WSD
125/208
125
Bootstrapping Algorithm
Iterative procedure: Train decision list algorithm on seed set
Classify residual data with decision list
Create new seed set by identifying samples that are tagged with a
probability above a certain threshold
Retrain classifier on new seed set
Selecting training seeds
Initial training set should accurately distinguish among possible
senses
Strategies:
Select a single, defining seed collocation for each possible sense.
Ex: life and manufacturing for targetplant
Use words from dictionary definitions
Hand-label most frequent collocates
Evaluation
-
7/29/2019 Advances in WSD
126/208
126
Evaluation
Test corpus: extracted from 460 million word corpus of multiplesources (news articles, transcripts, novels, etc.)
Performance of multiple models compared with:
supervised decision lists
unsupervised learning algorithm of Schtze (1992), based on
alignment of clusters with word senses
Word Senses Supervised Unsupervised
Schtze
Unsupervised
Bootstrapping
plant living/factory 97.7 92 98.6
space volume/outer 93.9 90 93.6
tank vehicle/container 97.1 95 96.5
motion legal/physical 98.0 92 97.9
-
Avg. - 96.1 92.2 96.5
Outline
-
7/29/2019 Advances in WSD
127/208
127
Outline
Task definition What does minimally supervised mean?
Bootstrapping algorithms
Co-training
Self-training Yarowsky algorithm
Using the Web for Word Sense Disambiguation
Web as a corpus
Web as collective mind
The Web as a Corpus
-
7/29/2019 Advances in WSD
128/208
128
e Web as a Co pus
Use the Web as a large textual corpus Build annotated corpora using monosemous relatives
Bootstrap annotated corpora starting with few seeds
Similar to (Yarowsky 1995)
Use the (semi)automatically tagged data to train WSDclassifiers
Monosemous Relatives
-
7/29/2019 Advances in WSD
129/208
129
Monosemous Relatives
Idea: determine a phrase (SP) which uniquely identifies thesense of a word (W#i)
1. Determine one or more Search Phrases from a machine readable
dictionary using several heuristics
2. Search the Web using the Search Phrases from step 1.
3. Replace the Search Phrases in the examples gathered at 2 with W#i.
Output: sense annotated corpus for the word sense W#i
As a pastime, she enjoyed reading.
Evaluate the interestingness of the website.
As an interest, she enjoyed reading.
Evaluate the interest of the website.
Heuristics to Identify Monosemous Relatives
-
7/29/2019 Advances in WSD
130/208
130
Heuristics to Identify Monosemous Relatives
Synonyms Determine a monosemous synonym
remember#1 has recollect as monosemous synonymSP=recollect
Dictionary definitions (1)
Parse the gloss and determine the set of single phrase definitions
produce#5 has the definition bring onto the market or release 2definitions: bring onto the market and release
eliminate release as being ambiguousSP=bring onto the market
Dictionary defintions (2)
Parse the gloss and determine the set of single phrase definitions
Replace the stop words with the NEAR operator
Strengthen the query: concatenate the words from the current synset usingthe AND operator
produce#6 has the synset {grow, raise, farm, produce}and the definitioncultivate by growing
SP=cultivate NEAR growing AND (grow OR raise OR farm OR produce)
Heuristics to Identify Monosemous Relatives
-
7/29/2019 Advances in WSD
131/208
131
Dictionary definitions (3) Parse the gloss and determine the set of single phrase definitions
Keep only the head phrase
Strengthen the query: concatenate the words from the current synset using
the AND operator
company#5 has the synset {party,company}and the definition band ofpeople associated in some activity
SP=band of people AND (company OR party)
Example
-
7/29/2019 Advances in WSD
132/208
132
Building annotated corpora for the noun interest.# Synset Definition
1 {interest#1, involvement} sense of concern with and curiosity aboutsomeone or something
2 {interest#2,interestingness} the power of attracting or holding ones interest3 {sake, interest#3} reason for wanting something done4 {interest#4} fixed charge for borrowing money; usually a
percentage of the amount borrowed5 {pastime,interest#5} a subject or pursuit that occupies ones time and
thoughts
6 {interest#6, stake} a right or legal share of something; financialinvolvement with something
7 {interest#7, interest group} a social group whose members control some fieldof activity and who have common aims
Sense # Search phrase
1 sense of concern AND (interest OR involvement)
2 interestigness
3 reason for wanting AND (interest OR sake)4 fixed charge AND interest
percentage of amount AND interest
5 pastime
6 right share AND (interest OR stake)
legal share AND (interest OR stake)
financial involvement AND (interest OR stake)
7 interest group
Example
-
7/29/2019 Advances in WSD
133/208
133
p
Gather 5,404 examples Check the first 70 examples 67 correct; 95.7% accuracy.
1. I appreciate the genuine interest#1 which motivated you to write your
message.2. The webmaster of this site warrants neither accuracy, nor interest#2.
3. He forgives us not only for our interest#3, but for his own.
4.Interest#4 coverage, including rents, was 3.6x
5. As an interest#5, she enjoyed gardening and taking part into church
activities.
6. Voted on issues, they should have abstained because of direct and
indirect personal interests#6in the matters of hand.
7. The Adam Smith Society is a new interest#7organized within the APA.
Experimental Evaluation
-
7/29/2019 Advances in WSD
134/208
134
Tests on 20 words 7 nouns, 7 verbs, 3 adjectives, 3 adverbs (120 word meanings)
manually check the first 10 examples of each sense of a word
=> 91% accuracy (Mihalcea 1999)
Word Polysemy
count
Examples
in SemCor
Total exam-
ples acquired
Examples ma-
nually checked
Correct
examples
interest 7 139 5404 70 67
report 7 71 4196 70 63
company 9 90 6292 80 77
school 7 146 2490 59 54
produce 7 148 4982 67 60
remember 8 166 3573 67 57
write 8 285 2914 69 67speak 4 147 4279 40 39
small 14 192 10954 107 92
clearly 4 48 4031 29 28
TOTAL
(20 words)
120 2582 80741 1080 978
Web-based Bootstrapping
-
7/29/2019 Advances in WSD
135/208
135
pp g
Similar to Yarowsky algorithm Relies on data gathered from the Web
1. Create a set of seeds (phrases) consisting of:
Sense tagged examples in SemCor
Sense tagged examples from WordNet
Additional sense tagged examples, if available
Phrase?
At least two open class words;
Words involved in a semantic relation (e.g. noun phrase, verb-object,
verb-subject, etc.)
2. Search the Web using queries formed with the seed expressions found
at Step 1
Add to the generated corpus of maximum of N text passages
Results competitive with manually tagged corpora (Mihalcea 2002)
The Web as Collective Mind
-
7/29/2019 Advances in WSD
136/208
136
Two different views of the Web:
collection of Web pages
very large group of Web users
Millions of Web users can contribute their knowledge to a
data repository
Open Mind Word Expert (Chklovski and Mihalcea, 2002)
Fast growing rate:
Started in April 2002
Currently more than 100,000 examples of noun senses in several
languages
OMWE
online
-
7/29/2019 Advances in WSD
137/208
137 http://teach-computers.org
Open Mind Word Expert: Quantity and Quality
-
7/29/2019 Advances in WSD
138/208
138
Data
A mix of different corpora: Treebank, Open Mind Common Sense, Los
Angeles Times, British National Corpus Word senses
Based on WordNet definitions
Active learning to select the most informative examples for learning
Use two classifiers trained on existing annotated data
Select items where the two classifiers disagree for human annotation Quality:
Two tags per item
One tag per item per contributor
Evaluations:
Agreement rates of about 65% - comparable to the agreementsrates obtained when collecting data for Senseval-2 with trainedlexicographers
Replicability: tests on 1,600 examples of interest led to 90%+replicability
References
(Ab 2002) Ab S B i P di f ACL 2002
-
7/29/2019 Advances in WSD
139/208
139
(Abney 2002) Abney, S.Bootstrapping. Proceedings of ACL 2002.
(Blum and Mitchell 1998) Blum, A. and Mitchell, T. Combining labeled and unlabeled
data with co-training. Proceedings of COLT 1998.
(Chklovski and Mihalcea 2002) Chklovski, T. and Mihalcea, R.Building a sense taggedcorpus with Open Mind Word Expert. Proceedings of ACL 2002 workshop on WSD.
(Clark, Curran and Osborne 2003) Clark, S. and Curran, J.R. and Osborne, M.Bootstrapping POS taggers using unlabelled data. Proceedings of CoNLL 2003.
(Mihalcea 1999) Mihalcea, R. An automatic method for generating sense taggedcorpora. Proceedings of AAAI 1999.
(Mihalcea 2002) Mihalcea, R. Bootstrapping large sense tagged corpora. Proceedingsof LREC 2002.
(Mihalcea 2004) Mihalcea, R. Co-training and Self-training for Word SenseDisambiguation. Proceedings of CoNLL 2004.
(Ng and Cardie 2003) Ng, V. and Cardie, C. Weakly supervised natural languagelearning without redundant views. Proceedings of HLT-NAACL 2003.
(Nigam and Ghani 2000) Nigam, K. and Ghani, R. Analyzing the effectiveness andapplicability of co-training. Proceedings of CIKM 2000.
(Sarkar 2001) Sarkar, A.Applying cotraining methods to statistical parsing. Proceedingsof NAACL 2001.
(Yarowsky 1995) Yarowsky, D. Unsupervised word sense disambiguation rivalingsupervised methods. Proceedings of ACL 1995.
-
7/29/2019 Advances in WSD
140/208
Part 6:
Unsupervised Methods of Word
Sense Disambiguation
Outline
-
7/29/2019 Advances in WSD
141/208
141
What is Unsupervised Learning? Task Definition
Agglomerative Clustering
LSI/LSA
Sense Discrimination Using Parallel Texts
What is Unsupervised Learning?
-
7/29/2019 Advances in WSD
142/208
142
Unsupervised learning identifies patterns in a large sampleof data, without the benefit of any manually labeled
examples or external knowledge sources
These patterns are used to divide the data into clusters,
where each member of a cluster has more in common withthe other members of its own cluster than any other
Note! If you remove manual labels from supervised data
and cluster, you may not discover the same classes as in
supervised learning
Supervised Classification identifies features that trigger a sense tag
Unsupervised Clustering finds similarity between contexts
Cluster this Data!
Facts about my day
-
7/29/2019 Advances in WSD
143/208
143
Facts about my day
Day
F1
Hot Outside?
F2
Slept Well?
F3
Ate Well?
1 YES NO NO
2 YES NO YES
3 NO NO NO
4 NO NO YES
Cluster this Data!
Facts about my day
-
7/29/2019 Advances in WSD
144/208
144
Facts about my day
Day
F1
Hot Outside?
F2
Slept Well?
F3
Ate Well?
1 YES NO NO
2 YES NO YES
3 NO NO NO
4 NO NO YES
Cluster this Data!
-
7/29/2019 Advances in WSD
145/208
145
Day
F1
Hot Outside?
F2
Slept Well?
F3
Ate Well?
1 YES NO NO
2 YES NO YES
3 NO NO NO
4 NO NO YES
Outline
-
7/29/2019 Advances in WSD
146/208
146
What is Unsupervised Learning? Task Definition
Agglomerative Clustering
LSI/LSA
Sense Discrimination Using Parallel Texts
Task Definition
-
7/29/2019 Advances in WSD
147/208
147
Unsupervised Word Sense Discrimination: A class ofmethods that cluster words based on similarity of context
Strong Contextual Hypothesis
(Miller and Charles, 1991): Words with similar meanings tend to
occur in similar contexts
(Firth, 1957): You shall know a word by the company it keeps.
words that keep the same company tend to have similar meanings
Only use the information available in raw text, do not use
outside knowledge sources or manual annotations
No knowledge of existing sense inventories, so clusters are
not labeled with senses
Task Definition
-
7/29/2019 Advances in WSD
148/208
148
Resources:
Large Corpora
Scope:
Typically one targeted word per context to be discriminated
Equivalently, measure similarity among contexts
Features may be identified in separate training data, or in the
data to be clustered
Does not assign senses or labels to clusters
Word Sense Discrimination reduces to the problem of
finding the targeted words that occur in the most similarcontexts and placing them in a cluster
Outline
-
7/29/2019 Advances in WSD
149/208
149
What is Unsupervised Learning? Task Definition
Agglomerative Clustering
LSI/LSA
Sense Discrimination Using Parallel Texts
Agglomerative Clustering
-
7/29/2019 Advances in WSD
150/208
150
Create a similarity matrix of instances to be discriminated Results in a symmetric instance by instance matrix, where each
cell contains the similarity score between a pair of instances
Typically a first order representation, where similarity is based on
the features observed in the pair of instances
Apply Agglomerative Clustering algorithm to matrix
To start, each instance is its own cluster
Form a cluster from the most similar pair of instances
Repeat until the desired number of clusters is obtained
Advantages : high quality clustering
Disadvantagescomputationally expensive, must carry
out exhaustive pair wise comparisons)( )( YX YX
Measuring Similarity
-
7/29/2019 Advances in WSD
151/208
151
Integer Values
Matching Coefficient
Jaccard Coefficient
Dice Coefficient
Real Values
Cosine
YX
YX
YX
YX
YX
2
YX
YX
Instances to be Clustered
-
7/29/2019 Advances in WSD
152/208
152
P-2 P-1 P+1 P+2 fish check river interest
S1 adv det prep det Y N Y N
S2 det prep det N Y N Y
S3 det adj verb det Y N N N
S4 det noun noun noun N N N N
S1 S2 S3 S4
S1 3 4 2
S2 3 2 0
S3 4 2 1
S4 2 0 1
Average Link Clustering
aka McQuittys Similarity Analysis
-
7/29/2019 Advances in WSD
153/208
153
Q y y y
S1 S2 S3 S4
S1 3 4 2
S2 3 2 0
S3 4 2 1
S4 2 0 1
S1S3 S2 S4
S1S3
S2 0
S4 0
5.22
23
5.22
23
5.12
12
5.12
12
S1S3S2 S4
S1S3S2
S4
5.12
5.15.1
5.12
5.15.1
Evaluation of Unsupervised Methods
-
7/29/2019 Advances in WSD
154/208
154
If Sense tagged text is available, can be used for evaluation But dont use sense tags for clustering or feature selection!
Assume that sense tags represent true clusters, and
compare these to discovered clusters
Find mapping of clusters to senses that attains maximum accuracy Pseudo words are especially useful, since it is hard to find
data that is discriminated
Pick two words or names from a corpus, and conflate them into
one name. Then see how well you can discriminate.
http://www.d.umn.edu/~kulka020/kanaghaName.html
Baseline Algorithmgroup all instances into one cluster,
this will reach accuracy equal to majority classifier
Baseline Performance
-
7/29/2019 Advances in WSD
155/208
155
S1 S2 S3 Totals
C1 0 0 0 0
C2 0 0 0 0
C3 80 35 55 170
Totals 80 35 55 170
S3 S2 S1 Totals
C1 0 0 0 0
C2 0 0 0 0
C3 55 35 80 170
Totals 55 35 80 170
(0+0+55)/170 = .32 (0+0+80)/170 = .47
if C3 is S3 if C3 is S1
Evaluation
-
7/29/2019 Advances in WSD
156/208
156
Suppose that C1 is labeled S1, C2 as S2, and C3 as S3
Accuracy = (10 + 0 + 10) / 170 = 12%
Diagonal shows how many members of the cluster actually belong to
the sense given on the column
Can the columns be rearranged to improve the overall accuracy?
Optimally assign clusters to senses
S1 S2 S3 Totals
C1 10 30 5 45
C2 20 0 40 60
C3 50 5 10 65
Totals 80 35 55 170
Evaluation
-
7/29/2019 Advances in WSD
157/208
157
The assignment of C1 to S2,C2 to S3, and C3 to S1
results in 120/170 = 71%
Find the ordering of the
columns in the matrix that
maximizes the sum of thediagonal.
This is an instance of the
Assignment Problem from
Operations Research, or
finding the Maximal
Matching of a Bipartite
Graph from Graph Theory.
S2 S3 S1 Totals
C1 30 5 10 45
C2 0 40 20 60C3 5 10 50 65
Totals 35 55 80 170
Agglomerative Approach
-
7/29/2019 Advances in WSD
158/208
158
(Pedersen and Bruce, 1997) explore discrimination with asmall number (approx 30) of features near target word.
Morphological form of target word (1)
Part of Speech two words to left and right of target word (4)
Co-occurrences (3) most frequent content words in context
Unrestricted collocations (19) most frequent words located oneposition to left or right of target, OR
Content collocations (19) most frequent content words located oneposition to left or right of target
Features identified in the instances be clustered
Similarity measured by matching coefficient Clustered with McQuittys Similarity Analysis, Wards
Method, and the EM Algorithm
Found that McQuittys method was the most accurate
Experimental Evaluation
-
7/29/2019 Advances in WSD
159/208
159
Adjectives
Chief, 86% majority (1048)
Common, 84% majority (1060)
Last, 94% majority (3004)
Public, 68% majority (715)
Nouns
Bill, 68% majority (1341)
Concern, 64% majority (1235)
Drug, 57% majority (1127)
Interest, 59% majority (2113)
Line, 37% majority (1149)
Verbs Agree, 74% majority (1109)
Close, 77% majority (1354)
Help, 78% majority (1267)
Include, 91% majority (1526)
Adjectives
Chief, 86%
Common, 80%
Last, 79%
Public, 63%
Nouns
Bill, 75%
Concern, 68%
Drug, 65%
Interest, 65%
Line, 42%
Verbs Agree, 69%
Close, 72%
Help, 70%
Include, 77%
Analysis
-
7/29/2019 Advances in WSD
160/208
160
Unsupervised methods may not discover clustersequivalent to the classes learned in supervised learning
Evaluation based on assuming that sense tags represent the
true cluster are likely a bit harsh. Alternatives?
Humans could look at the members of each cluster and determinethe nature of the relationship or meaning that they all share
Use the contents of the cluster to generate a descriptive label that
could be inspected by a human
First order feature sets may be problematic with smaller
amounts of data since these features must occur exactly inthe test instances in order to be matched
Outline
-
7/29/2019 Advances in WSD
161/208
161
What is Unsupervised Learning? Task Definition
Agglomerative Clustering
LSI/LSA
Sense Discrimination Using Parallel Texts
Latent Semantic Indexing/Analysis
-
7/29/2019 Advances in WSD
162/208
162
Adapted by (Schtze, 1998) to word sense discrimination
Represent training data as word co-occurrence matrix
Reduce the dimensionality of the co-occurrence matrix via
Singular Value Decomposition (SVD)
Significant dimensions are associated with concepts Represent the instances of a target word to be clustered by
taking the average of all the vectors associated with all the
words in that context
Context represented by an averaged vector
Measure the similarity amongst instances via cosine and
record in similarity matrix, or cluster the vectors directly
Co-occurrence matrix
-
7/29/2019 Advances in WSD
163/208
163
apple blood cells ibm data box tissue graphics memory organ plasma
pc 2 0 0 1 3 1 0 0 0 0 0
body 0 3 0 0 0 0 2 0 0 2 1
disk 1 0 0 2 0 3 0 1 2 0 0
petri 0 2 1 0 0 0 2 0 1 0 1
lab 0 0 3 0 2 0 2 0 2 1 3
sales 0 0 0 2 3 0 0 1 2 0 0
linux 2 0 0 1 3 2 0 1 1 0 0
debt 0 0 0 2 3 4 0 2 0 0 0
Singular Value Decomposition
A=UDV
-
7/29/2019 Advances in WSD
164/208
164
U
-
7/29/2019 Advances in WSD
165/208
165
.35 .09 -.2 .52 -.09 .40 .02 .63 .20 -.00 -.02
.05 -.49 .59 .44 .08 -.09 -.44 -.04 -.6 -.02 -.01
.35 .13 .39 -.60 .31 .41 -.22 .20 -.39 .00 .03
.08 -.45 .25 -.02 .17