lecture: word sense disambiguation

Seman&c Analysis in Language Technology http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm

Word Sense Disambiguation

Marina San(ni

[email protected]

Department of Linguis(cs and Philology

Uppsala University, Uppsala, Sweden

Spring 2016

1

Previous Lecture: Word Senses •  Homonomy, polysemy, synonymy, metonymy, etc.

Prac(cal ac(vi(es: 1) SELECTIONAL RESTRICTIONS 2) MANUAL DISAMBIGUATION OF EXAMPLES USING SENSEVAL SENSES

AIMS OF PRACTICAL ACTIVITiES: •  STUDENTS SHOULD GET ACQUINTED WITH REAL DATA •  EXPLORATIONS OF APPLICATIONS, RESOURCES AND METHODS.

2

No preset solu$ons (this slide is to tell you that you are doing well J )

•  Whatever your experience with data, it is a valuable experience: •  Disappointment •  Frustra(on •  Feeling lost •  Happiness •  Power •  Excitement •  … •  All the students so far (also in previous courses) have presented their own solu(ons… many different solu(ons and it is ok… 3

J&M own solu$ons: Selec$onal Restric$ons (just for your records, does not mean they are necessearily beMer than yours… )

4

Other possible solu$ons… •  Kissàconcrete sense: touching

with lips/mouth •  animate kiss [using lips/mouth] animate/inanimate

•  Ex: he kissed her; •  The dolphin kissed the kid •  Why does the pope kiss the ground a^er he disembarks ...

•  Kissàfigura(ve sense: touching •  animate kiss inanimate •  Ex: "Walk as if you are kissing the Earth

with your feet."

5

pursed lips?

NO solu$on or comments provided for Senseval

•  All your impressions and feelings are plausible and acceptable J

6

Remember that in both ac$vi$es…

•  You have experienced cases of POLYSEMY!

•  YOU HAVE TRIED TO DISAMBIGUATE THE SENSES MANUALLY, IE WITH YOUR HUMAN SKILLS…

7

Previous lecture: end

8

Today: Word Sense Disambigua$on (WSD) •  Given:

•  A word in context; •  A fixed inventory of poten(al word senses; •  Create a system that automa(cally decides which sense of the word is correct in that context.

Word Sense Disambigua$on: Defini$on •  Word Sense Disambitua(on (WSD) is the TASK of determining the

correct sense of a word in context. •  It is an automa(c task: we create a system that automa-cally

disambiguates the senses for us.

•  Useful for many NLP tasks: informa(on retrieval (apple or apple ?), ques(on answering (does United serve Philadelphia?), machine transla(on (eng ”bat” à It: pipistrello or mazza ?)

10

Anecdote: the poison apple

•  In 1954, Alan Turing died a^er bi(ng into an apple laced with cyanide

•  It was said that this half-‐biten apple inspired the Apple logo… but apparently it is a legend J

•  hmp://mentalfloss.com/ar(cle/64049/did-‐alan-‐turing-‐inspire-‐apple-‐logo

11

Be alert…

•  Word sense ambiguity is pervasive !!!

12

Acknowledgements Most slides borrowed or adapted from:

Dan Jurafsky and James H. Mar(n

Dan Jurafsky and Christopher Manning, Coursera

J&M(2015, dra^): hmps://web.stanford.edu/~jurafsky/slp3/

Outline: WSD Methods

•  Thesaurus/Dic(onary Methods •  Supervised Machine Learning •  Semi-‐Supervised Learning (self-‐reading)

14


Dic(onary and Thesaurus Methods

The Simplified Lesk algorithm

•  Let’s disambiguate “bank” in this sentence: The bank can guarantee deposits will eventually cover future tui(on costs because it invests in adjustable-‐rate mortgage securi(es.

•  given the following two WordNet senses:

16.6 • WSD: DICTIONARY AND THESAURUS METHODS 13

function SIMPLIFIED LESK(word, sentence) returns best sense of word

best-sense most frequent sense for wordmax-overlap 0context set of words in sentencefor each sense in senses of word dosignature set of words in the gloss and examples of senseoverlap COMPUTEOVERLAP(signature, context)if overlap > max-overlap then

max-overlap overlapbest-sense sense

endreturn(best-sense)

Figure 16.6 The Simplified Lesk algorithm. The COMPUTEOVERLAP function returns thenumber of words in common between two sets, ignoring function words or other words on astop list. The original Lesk algorithm defines the context in a more complex way. The Cor-pus Lesk algorithm weights each overlapping word w by its � logP(w) and includes labeledtraining corpus data in the signature.

bank1 Gloss: a financial institution that accepts deposits and channels themoney into lending activities

Examples: “he cashed a check at the bank”, “that bank holds the mortgageon my home”

bank2 Gloss: sloping land (especially the slope beside a body of water)Examples: “they pulled the canoe up on the bank”, “he sat on the bank of

the river and watched the currents”

Sense bank1 has two non-stopwords overlapping with the context in (16.19):deposits and mortgage, while sense bank2 has zero words, so sense bank1 is chosen.

There are many obvious extensions to Simplified Lesk. The original Lesk algo-rithm (Lesk, 1986) is slightly more indirect. Instead of comparing a target word’ssignature with the context words, the target signature is compared with the signaturesof each of the context words. For example, consider Lesk’s example of selecting theappropriate sense of cone in the phrase pine cone given the following definitions forpine and cone.

pine 1 kinds of evergreen tree with needle-shaped leaves2 waste away through sorrow or illness

cone 1 solid body which narrows to a point2 something of this shape whether solid or hollow3 fruit of certain evergreen trees

In this example, Lesk’s method would select cone3 as the correct sense since twoof the words in its entry, evergreen and tree, overlap with words in the entry for pine,whereas neither of the other entries has any overlap with words in the definition ofpine. In general Simplified Lesk seems to work better than original Lesk.

The primary problem with either the original or simplified approaches, how-ever, is that the dictionary entries for the target words are short and may not provideenough chance of overlap with the context.3 One remedy is to expand the list ofwords used in the classifier to include words related to, but not contained in, their

3 Indeed, Lesk (1986) notes that the performance of his system seems to roughly correlate with thelength of the dictionary entries.

The Simplified Lesk algorithm

The bank can guarantee deposits will eventually cover future tui(on costs because it invests in adjustable-‐rate mortgage securi(es.

16.6 • WSD: DICTIONARY AND THESAURUS METHODS 13

function SIMPLIFIED LESK(word, sentence) returns best sense of word

best-sense most frequent sense for wordmax-overlap 0context set of words in sentencefor each sense in senses of word dosignature set of words in the gloss and examples of senseoverlap COMPUTEOVERLAP(signature, context)if overlap > max-overlap then

max-overlap overlapbest-sense sense

endreturn(best-sense)

Figure 16.6 The Simplified Lesk algorithm. The COMPUTEOVERLAP function returns thenumber of words in common between two sets, ignoring function words or other words on astop list. The original Lesk algorithm defines the context in a more complex way. The Cor-pus Lesk algorithm weights each overlapping word w by its � logP(w) and includes labeledtraining corpus data in the signature.

bank1 Gloss: a financial institution that accepts deposits and channels themoney into lending activities

Examples: “he cashed a check at the bank”, “that bank holds the mortgageon my home”

bank2 Gloss: sloping land (especially the slope beside a body of water)Examples: “they pulled the canoe up on the bank”, “he sat on the bank of

the river and watched the currents”

Sense bank1 has two non-stopwords overlapping with the context in (16.19):deposits and mortgage, while sense bank2 has zero words, so sense bank1 is chosen.

There are many obvious extensions to Simplified Lesk. The original Lesk algo-rithm (Lesk, 1986) is slightly more indirect. Instead of comparing a target word’ssignature with the context words, the target signature is compared with the signaturesof each of the context words. For example, consider Lesk’s example of selecting theappropriate sense of cone in the phrase pine cone given the following definitions forpine and cone.

pine 1 kinds of evergreen tree with needle-shaped leaves2 waste away through sorrow or illness

cone 1 solid body which narrows to a point2 something of this shape whether solid or hollow3 fruit of certain evergreen trees

In this example, Lesk’s method would select cone3 as the correct sense since twoof the words in its entry, evergreen and tree, overlap with words in the entry for pine,whereas neither of the other entries has any overlap with words in the definition ofpine. In general Simplified Lesk seems to work better than original Lesk.

The primary problem with either the original or simplified approaches, how-ever, is that the dictionary entries for the target words are short and may not provideenough chance of overlap with the context.3 One remedy is to expand the list ofwords used in the classifier to include words related to, but not contained in, their

3 Indeed, Lesk (1986) notes that the performance of his system seems to roughly correlate with thelength of the dictionary entries.

Choose sense with most word overlap between gloss and context (not coun(ng func(on words)

Drawback

•  Glosses and examples migh be too short and may not provide enough chance to overlap with the context of the word to be disambiguated.

18

The Corpus(-‐based) Lesk algorithm

•  Assumes we have some sense-‐labeled data (like SemCor) •  Take all the sentences with the relevant word sense:

These short, "streamlined" mee-ngs usually are sponsored by local banks1, Chambers of Commerce, trade associa-ons, or other civic organiza-ons.

•  Now add these to the gloss + examples for each sense, call it the “signature” of a sense. Basically, it is an expansion of the dic(onary entry.

•  Choose sense with most word overlap between context and signature (ie. the context words provided by the resources).

Corpus Lesk: IDF weigh$ng •  Instead of just removing func(on words

•  Weigh each word by its `promiscuity’ across documents •  Down-‐weights words that occur in every `document’ (gloss, example, etc) •  These are generally func(on words, but is a more fine-‐grained measure

•  Weigh each overlapping word by inverse document frequency (IDF).

20

Graph-‐based methods •  First, WordNet can be viewed as a graph

•  senses are nodes •  rela(ons (hypernymy, meronymy) are edges •  Also add edge between word and unambiguous gloss words

21

toastn4

drinkv1

drinkern1

drinkingn1

potationn1

sipn1

sipv1

beveragen1 milkn

1

liquidn1foodn

1

drinkn1

helpingn1

supv1

consumptionn1

consumern1

consumev1

An undirected graph is set of nodes tha are connected together by bidirec(onal edges (lines).

How to use the graph for WSD

“She drank some milk” •  choose the most central sense (several algorithms have been proposed recently)

22

drinkv1

drinkern1

beveragen1

boozingn1

foodn1

drinkn1 milkn

1

milkn2

milkn3

milkn4

drinkv2

drinkv3

drinkv4

drinkv5

nutrimentn1

“drink” “milk”

Word Meaning and Similarity

Word Similarity: Thesaurus Methods

beg: c_w8

Word Similarity

•  Synonymy: a binary rela(on •  Two words are either synonymous or not

•  Similarity (or distance): a looser metric •  Two words are more similar if they share more features of meaning

•  Similarity is properly a rela(on between senses •  We do not say “The word “bank” is not similar to the word “slope” “, bu w say.

•  Bank1 is similar to fund3

•  Bank2 is similar to slope5

•  But we’ll compute similarity over both words and senses

Why word similarity

•  Informa(on retrieval •  Ques(on answering •  Machine transla(on •  Natural language genera(on •  Language modeling •  Automa(c essay grading •  Plagiarism detec(on •  Document clustering

Word similarity and word relatedness

•  We o^en dis(nguish word similarity from word relatedness •  Similar words: near-‐synonyms •  car, bicycle: similar

•  Related words: can be related any way •  car, gasoline: related, not similar

Cf. Synonyms: car & automobile

Two classes of similarity algorithms

•  Thesaurus-‐based algorithms •  Are words “nearby” in hypernym hierarchy? •  Do words have similar glosses (defini(ons)?

•  Distribu(onal algorithms: next (me! •  Do words have similar distribu(onal contexts?

Path-‐based similarity

•  Two concepts (senses/synsets) are similar if they are near each other in the thesaurus hierarchy •  =have a short path between them •  concepts have path 1 to themselves

Refinements to path-‐based similarity •  pathlen(c1,c2) = (distance metric) = 1 + number of edges in the

shortest path in the hypernym graph between sense nodes c1 and c2

•  simpath(c1,c2) =

•  wordsim(w1,w2) = max sim(c1,c2) c1∈senses(w1),c2∈senses(w2)

1pathlen(c1,c2 )

Sense similarity metric: 1 over the distance!

Word similarity metric: max similarity among pairs of senses.

For all senses of w1 and all senses of w2, take the similarity between each of the senses of w1 and each of the senses of w2 and then take the maximum similarity between those pairs.

Example: path-‐based similarity simpath(c1,c2) = 1/pathlen(c1,c2)

simpath(nickel,coin) = 1/2 = .5

simpath(fund,budget) = 1/2 = .5 simpath(nickel,currency) = 1/4 = .25 simpath(nickel,money) = 1/6 = .17 simpath(coinage,Richter scale) = 1/6 = .17

Problem with basic path-‐based similarity

•  Assumes each link represents a uniform distance •  But nickel to money seems to us to be closer than nickel to standard

•  Nodes high in the hierarchy are very abstract •  We instead want a metric that

•  Represents the cost of each edge independently •  Words connected only through abstract nodes •  are less similar

Informa$on content similarity metrics

•  In simple words: •  We define the probability of a concept C as the probability that a randomly selected word in a corpus is an instance of that concept.

•  Basically, for each random word in a corpus we compute how probable it is that it belongs to a certain concepts.

Resnik 1995. Using informa(on content to evaluate seman(c similarity in a taxonomy. IJCAI

Formally: Informa$on content similarity metrics

•  Let’s define P(c) as: •  The probability that a randomly selected word in a corpus is an instance of concept c

•  Formally: there is a dis(nct random variable, ranging over words, associated with each concept in the hierarchy •  for a given concept, each observed noun is either

•  a member of that concept with probability P(c) •  not a member of that concept with probability 1-P(c)

•  All words are members of the root node (En(ty) •  P(root)=1

•  The lower a node in hierarchy, the lower its probability

Resnik 1995. Using informa(on content to evaluate seman(c similarity in a taxonomy. IJCAI

Informa$on content similarity

•  For every word (ex “natural eleva(on”), we count all the words in that concepts, and then we normalize by the total number of words in the corpus.

•  we get a probability value that tells us how probable it is that a random word is a an instance of that concept

P(c) =count(w)

w∈words(c)∑

N

geological-‐forma(on

shore

hill

natural eleva(on

coast

cave

gromo ridge

…

en(ty

In order o compute the probability of the term "natural eleva(on", we take ridge, hill + natural eleva(on itself

Informa$on content similarity •  WordNet hierarchy augmented with probabili(es P(c)

D. Lin. 1998. An Informa(on-‐Theore(c Defini(on of Similarity. ICML 1998

Informa$on content: defini$ons

1.  Informa(on content: 1.  IC(c) = -log P(c)

2.  Most informa(ve subsumer (Lowest common subsumer) LCS(c1,c2) = The most informa(ve (lowest) node in the hierarchy subsuming both c1 and c2

IC aka…

•  A lot of people prefer the term surprisal to informa(on or to informa(on content.

-‐log p(x) It measures the amount of surprise generated by the event x. The smaller the probability of x, the bigger the surprisal is. It's helpful to think about it this way, par(cularly for linguis(cs examples. 37

Using informa$on content for similarity: the Resnik method

•  The similarity between two words is related to their common informa(on

•  The more two words have in common, the more similar they are

•  Resnik: measure common informa(on as: •  The informa(on content of the most informa(ve (lowest) subsumer (MIS/LCS) of the two nodes

•  simresnik(c1,c2) = -log P( LCS(c1,c2) )

Philip Resnik. 1995. Using Informa(on Content to Evaluate Seman(c Similarity in a Taxonomy. IJCAI 1995. Philip Resnik. 1999. Seman(c Similarity in a Taxonomy: An Informa(on-‐Based Measure and its Applica(on to Problems of Ambiguity in Natural Language. JAIR 11, 95-‐130.

Dekang Lin method

•  Intui(on: Similarity between A and B is not just what they have in common

•  The more differences between A and B, the less similar they are: •  Commonality: the more A and B have in common, the more similar they are •  Difference: the more differences between A and B, the less similar

•  Commonality: IC(common(A,B)) •  Difference: IC(descrip(on(A,B)-‐IC(common(A,B))

Dekang Lin. 1998. An Informa(on-‐Theore(c Defini(on of Similarity. ICML

Dekang Lin similarity theorem •  The similarity between A and B is measured by the ra(o

between the amount of informa(on needed to state the commonality of A and B and the informa(on needed to fully describe what A and B are

simLin(A,B)∝

IC(common(A,B))IC(description(A,B))

•  Lin (altering Resnik) defines IC(common(A,B)) as 2 x informa(on of the LCS

simLin(c1,c2 ) =2 logP(LCS(c1,c2 ))logP(c1)+ logP(c2 )

Lin similarity func$on

simLin(A,B) =2 logP(LCS(c1,c2 ))logP(c1)+ logP(c2 )

simLin(hill, coast) =2 logP(geological-formation)logP(hill)+ logP(coast)

=2 ln0.00176

ln0.0000189+ ln0.0000216= .59

The (extended) Lesk Algorithm

•  A thesaurus-‐based measure that looks at glosses •  Two concepts are similar if their glosses contain similar words

•  Drawing paper: paper that is specially prepared for use in dra^ing •  Decal: the art of transferring designs from specially prepared paper to a wood or glass or metal surface

•  For each n-‐word phrase that’s in both glosses •  Add a score of n2 •  Paper and specially prepared for 1 + 22 = 5 •  Compute overlap also for other rela(ons •  glosses of hypernyms and hyponyms

Summary: thesaurus-‐based similarity

Libraries for compu$ng thesaurus-‐based similarity

•  NLTK •  hmp://nltk.github.com/api/nltk.corpus.reader.html?highlight=similarity -‐ nltk.corpus.reader.WordNetCorpusReader.res_similarity

•  WordNet::Similarity •  hmp://wn-‐similarity.sourceforge.net/ •  Web-‐based interface:

•  hmp://marimba.d.umn.edu/cgi-‐bin/similarity/similarity.cgi

44

Machine Learning based approach

Basic idea

•  If we have data that has been hand-‐labelled with correct word senses, we can used a supervised learning approach and learn from it! •  We need to extract features and train a classifier •  The output of training is an automa(c system capable of assigning sense labels TO unlabelled words in a context.

46

Two variants of WSD task •  Lexical Sample task

•  (we need labelled corpora for individual senses) •  Small pre-‐selected set of target words (ex difficulty) •  And inventory of senses for each word •  Supervised machine learning: train a classifier for each word

•  All-‐words task •  (each word in each sentence is labelled with a sense) •  Every word in an en(re text •  A lexicon with senses for each word

SENSEVAL 1-‐2-‐3

Supervised Machine Learning Approaches

•  Summary of what we need: •  the tag set (“sense inventory”) •  the training corpus •  A set of features extracted from the training corpus •  A classifier

Supervised WSD 1: WSD Tags

•  What’s a tag? A dic(onary sense?

•  For example, for WordNet an instance of “bass” in a text has 8 possible tags or labels (bass1 through bass8).

8 senses of “bass” in WordNet 1.  bass -‐ (the lowest part of the musical range) 2.  bass, bass part -‐ (the lowest part in polyphonic music) 3.  bass, basso -‐ (an adult male singer with the lowest voice) 4.  sea bass, bass -‐ (flesh of lean-‐fleshed saltwater fish of the family

Serranidae) 5.  freshwater bass, bass -‐ (any of various North American lean-‐fleshed

freshwater fishes especially of the genus Micropterus) 6.  bass, bass voice, basso -‐ (the lowest adult male singing voice) 7.  bass -‐ (the member with the lowest range of a family of musical

instruments) 8.  bass -‐ (nontechnical name for any of numerous edible marine and

freshwater spiny-‐finned fishes)

SemCor

<wf pos=PRP>He</wf> <wf pos=VB lemma=recognize wnsn=4 lexsn=2:31:00::>recognized</wf> <wf pos=DT>the</wf> <wf pos=NN lemma=gesture wnsn=1 lexsn=1:04:00::>gesture</wf> <punc>.</punc>

51

SemCor: 234,000 words from Brown Corpus, manually tagged with WordNet senses

Supervised WSD: Extract feature vectors Intui$on from Warren Weaver (1955):

“If one examines the words in a book, one at a (me as through an opaque mask with a hole in it one word wide, then it is obviously impossible to determine, one at a (me, the meaning of the words… But if one lengthens the slit in the opaque mask, un(l one can see not only the central word in ques(on but also say N words on either side, then if N is large enough one can unambiguously decide the meaning of the central word… The prac(cal ques(on is : ``What minimum value of N will, at least in a tolerable frac(on of cases, lead to the correct choice of meaning for the central word?”

the window

Feature vectors

•  Vectors of sets of feature/value pairs

Two kinds of features in the vectors

•  Colloca$onal features and bag-‐of-‐words features

•  Colloca$onal/Paradigma$c •  Features about words at specific posi(ons near target word •  O^en limited to just word iden(ty and POS

•  Bag-‐of-‐words •  Features about words that occur anywhere in the window (regardless of posi(on) •  Typically limited to frequency counts

Generally speaking, a collocation is a sequence of words or terms that co-occur more often than would be expected by chance. But here the meaning is not exactly this…

Examples

•  Example text (WSJ):

An electric guitar and bass player stand off to one side not really part of the scene

•  Assume a window of +/-‐ 2 from the target

Examples

•  Example text (WSJ)

An electric guitar and bass player stand off to one side not really part of the scene,

•  Assume a window of +/-‐ 2 from the target

Colloca$onal features

•  Posi(on-‐specific informa(on about the words and colloca(ons in window

•  guitar and bass player stand

•  word 1,2,3 grams in window of ±3 is common

10 CHAPTER 16 • COMPUTING WITH WORD SENSES

ually tagged with WordNet senses (Miller et al. 1993, Landes et al. 1998). In ad-dition, sense-tagged corpora have been built for the SENSEVAL all-word tasks. TheSENSEVAL-3 English all-words test data consisted of 2081 tagged content word to-kens, from 5,000 total running words of English from the WSJ and Brown corpora(Palmer et al., 2001).

The first step in supervised training is to extract features that are predictive ofword senses. The insight that underlies all modern algorithms for word sense disam-biguation was famously first articulated by Weaver (1955) in the context of machinetranslation:

If one examines the words in a book, one at a time as through an opaquemask with a hole in it one word wide, then it is obviously impossibleto determine, one at a time, the meaning of the words. [. . . ] But ifone lengthens the slit in the opaque mask, until one can see not onlythe central word in question but also say N words on either side, thenif N is large enough one can unambiguously decide the meaning of thecentral word. [. . . ] The practical question is : “What minimum value ofN will, at least in a tolerable fraction of cases, lead to the correct choiceof meaning for the central word?”

We first perform some processing on the sentence containing the window, typi-cally including part-of-speech tagging, lemmatization , and, in some cases, syntacticparsing to reveal headwords and dependency relations. Context features relevant tothe target word can then be extracted from this enriched input. A feature vectorfeature vectorconsisting of numeric or nominal values encodes this linguistic information as aninput to most machine learning algorithms.

Two classes of features are generally extracted from these neighboring contexts,both of which we have seen previously in part-of-speech tagging: collocational fea-tures and bag-of-words features. A collocation is a word or series of words in acollocationposition-specific relationship to a target word (i.e., exactly one word to the right, orthe two words starting 3 words to the left, and so on). Thus, collocational featurescollocational

featuresencode information about specific positions located to the left or right of the targetword. Typical features extracted for these context words include the word itself, theroot form of the word, and the word’s part-of-speech. Such features are effective atencoding local lexical and grammatical information that can often accurately isolatea given sense.

For example consider the ambiguous word bass in the following WSJ sentence:(16.17) An electric guitar and bass player stand off to one side, not really part of

the scene, just as a sort of nod to gringo expectations perhaps.A collocational feature vector, extracted from a window of two words to the rightand left of the target word, made up of the words themselves, their respective parts-of-speech, and pairs of words, that is,

[wi�2,POSi�2,wi�1,POSi�1,wi+1,POSi+1,wi+2,POSi+2,wi�1i�2,w

i+1i ] (16.18)

would yield the following vector:[guitar, NN, and, CC, player, NN, stand, VB, and guitar, player stand]

High performing systems generally use POS tags and word collocations of length1, 2, and 3 from a window of words 3 to the left and 3 to the right (Zhong and Ng,2010).

The second type of feature consists of bag-of-words information about neigh-boring words. A bag-of-words means an unordered set of words, with their exactbag-of-words

10 CHAPTER 16 • COMPUTING WITH WORD SENSES

ually tagged with WordNet senses (Miller et al. 1993, Landes et al. 1998). In ad-dition, sense-tagged corpora have been built for the SENSEVAL all-word tasks. TheSENSEVAL-3 English all-words test data consisted of 2081 tagged content word to-kens, from 5,000 total running words of English from the WSJ and Brown corpora(Palmer et al., 2001).

The first step in supervised training is to extract features that are predictive ofword senses. The insight that underlies all modern algorithms for word sense disam-biguation was famously first articulated by Weaver (1955) in the context of machinetranslation:

If one examines the words in a book, one at a time as through an opaquemask with a hole in it one word wide, then it is obviously impossibleto determine, one at a time, the meaning of the words. [. . . ] But ifone lengthens the slit in the opaque mask, until one can see not onlythe central word in question but also say N words on either side, thenif N is large enough one can unambiguously decide the meaning of thecentral word. [. . . ] The practical question is : “What minimum value ofN will, at least in a tolerable fraction of cases, lead to the correct choiceof meaning for the central word?”

We first perform some processing on the sentence containing the window, typi-cally including part-of-speech tagging, lemmatization , and, in some cases, syntacticparsing to reveal headwords and dependency relations. Context features relevant tothe target word can then be extracted from this enriched input. A feature vectorfeature vectorconsisting of numeric or nominal values encodes this linguistic information as aninput to most machine learning algorithms.

Two classes of features are generally extracted from these neighboring contexts,both of which we have seen previously in part-of-speech tagging: collocational fea-tures and bag-of-words features. A collocation is a word or series of words in acollocationposition-specific relationship to a target word (i.e., exactly one word to the right, orthe two words starting 3 words to the left, and so on). Thus, collocational featurescollocational

featuresencode information about specific positions located to the left or right of the targetword. Typical features extracted for these context words include the word itself, theroot form of the word, and the word’s part-of-speech. Such features are effective atencoding local lexical and grammatical information that can often accurately isolatea given sense.

For example consider the ambiguous word bass in the following WSJ sentence:(16.17) An electric guitar and bass player stand off to one side, not really part of

the scene, just as a sort of nod to gringo expectations perhaps.A collocational feature vector, extracted from a window of two words to the rightand left of the target word, made up of the words themselves, their respective parts-of-speech, and pairs of words, that is,

[wi�2,POSi�2,wi�1,POSi�1,wi+1,POSi+1,wi+2,POSi+2,wi�1i�2,w

i+1i ] (16.18)

would yield the following vector:[guitar, NN, and, CC, player, NN, stand, VB, and guitar, player stand]

High performing systems generally use POS tags and word collocations of length1, 2, and 3 from a window of words 3 to the left and 3 to the right (Zhong and Ng,2010).

The second type of feature consists of bag-of-words information about neigh-boring words. A bag-of-words means an unordered set of words, with their exactbag-of-words

Bag-‐of-‐words features

•  “an unordered set of words” – posi(on ignored •  Choose a vocabulary: a useful subset of words in a training corpus

•  Either: the count of how o^en each of those terms occurs in a given window OR just a binary “indicator” 1 or 0

Co-‐Occurrence Example

•  Assume we’ve semled on a possible vocabulary of 12 words in “bass” sentences:

[fishing, big, sound, player, fly, rod, pound, double, runs, playing, guitar, band]

•  The vector for: guitar and bass player stand [0,0,0,1,0,0,0,0,0,0,1,0]


Classifica(on

Classifica$on •  Input: •  a word w and some features f •  a fixed set of classes C = {c1, c2,…, cJ}

•  Output: a predicted class c∈C

Any kind of classifier •  Naive Bayes •  Logis(c regression •  Neural Networks •  Support-‐vector machines •  k-‐Nearest Neighbors •  etc.

The end

62

lecture: word sense disambiguation

Education