11
Measuring Similarity Measuring Similarity Between Between
Concepts and ContextsConcepts and ContextsTed Pedersen Ted Pedersen
Department of Computer ScienceDepartment of Computer ScienceUniversity of Minnesota, DuluthUniversity of Minnesota, Duluthhttp://www.d.umn.edu/~tpedersehttp://www.d.umn.edu/~tpederse
22
The problems…The problems…
Recognize similar (or related) conceptsRecognize similar (or related) concepts frog : amphibianfrog : amphibian Duluth : snowDuluth : snow
Recognize similar contextsRecognize similar contexts I bought some food at the store : I bought some food at the store :
I purchased something to eat at the marketI purchased something to eat at the market
33
Similarity and RelatednessSimilarity and Relatedness
Two concepts are similar if they are Two concepts are similar if they are connected by connected by is-a is-a relationships.relationships. A frog A frog is-a-kind-of is-a-kind-of amphibianamphibian An illness An illness is-a is-a heath_conditionheath_condition
Two concepts can be related many ways…Two concepts can be related many ways… A human A human has-a-part has-a-part liver liver Duluth Duluth receives-a-lot-of receives-a-lot-of snowsnow
……similarity is one way to be related similarity is one way to be related
44
The approaches…The approaches…
Measure conceptual similarity using a Measure conceptual similarity using a structured repository of knowledge structured repository of knowledge Lexical database WordNetLexical database WordNet
Measure contextual similarity using Measure contextual similarity using knowledge lean methods that are based knowledge lean methods that are based on co-occurrence information from large on co-occurrence information from large corporacorpora
55
Why measure conceptual similarity? Why measure conceptual similarity?
A word will take the sense that is most A word will take the sense that is most related to the surrounding contextrelated to the surrounding context I love I love JavaJava, especially the beaches and the , especially the beaches and the
weather. weather. I love I love JavaJava, especially the support for , especially the support for
concurrent programming.concurrent programming. I love I love javajava, especially first thing in the morning , especially first thing in the morning
with a bagel. with a bagel.
66
Word Sense DisambiguationWord Sense Disambiguation
……can be performed by finding the sense of a can be performed by finding the sense of a word most related to its neighborsword most related to its neighbors
Here, we define similarity and relatedness with Here, we define similarity and relatedness with respect to WordNetrespect to WordNet WordNet::Similarity WordNet::Similarity http://wn-similarity.sourceforge.nethttp://wn-similarity.sourceforge.net
WordNet::SenseRelateWordNet::SenseRelate AllWords – assign a sense to every content wordAllWords – assign a sense to every content word TargetWord – assign a sense to a given wordTargetWord – assign a sense to a given word http://senserelate.sourceforge.net http://senserelate.sourceforge.net
77
SenseRelateSenseRelate
For each sense of a target word in contextFor each sense of a target word in context For each content word in the contextFor each content word in the context
• For each sense of that content wordFor each sense of that content word Measure similarity/relatedness between sense of target Measure similarity/relatedness between sense of target
word and sense of content word with WordNet::Similarityword and sense of content word with WordNet::Similarity Keep running sum for score of each sense of targetKeep running sum for score of each sense of target
Pick sense of target word with highest Pick sense of target word with highest score with words in contextscore with words in context
88
WordNet::SimilarityWordNet::Similarity Path based measuresPath based measures
Shortest path (path)Shortest path (path) Wu & Palmer (wup)Wu & Palmer (wup) Leacock & Chodorow (lch)Leacock & Chodorow (lch) Hirst & St-Onge (hso)Hirst & St-Onge (hso)
Information content measuresInformation content measures Resnik (res)Resnik (res) Jiang & Conrath (jcn)Jiang & Conrath (jcn) Lin (lin)Lin (lin)
Gloss based measuresGloss based measures Banerjee and Pedersen (lesk)Banerjee and Pedersen (lesk) Patwardhan and Pedersen (vector, vector_pairs)Patwardhan and Pedersen (vector, vector_pairs)
99
watercraft
instrumentality
object
artifact
conveyance
vehicle
motor-vehicle
car boat
ark
article
ware
table-ware
cutlery
fork
from Jiang and Conrath [1997]
1010
Path FindingPath Finding
Find shortest is-a path between two concepts?Find shortest is-a path between two concepts? Rada, et. al. (1989)Rada, et. al. (1989) Scaled by depth of hierarchyScaled by depth of hierarchy
• Leacock & Chodorow (1998)Leacock & Chodorow (1998) Depth of subsuming concept scaled by sum of the Depth of subsuming concept scaled by sum of the
depths of individual concepts depths of individual concepts • Wu and Palmer (1994)Wu and Palmer (1994)
1111
watercraft
instrumentality
object
artifact
conveyance
vehicle
motor-vehicle
car boat
ark
article
ware
table-ware
cutlery
fork
1212
Information ContentInformation Content
Measure of specificity in is-a hierarchy (Resnik, 1995)Measure of specificity in is-a hierarchy (Resnik, 1995) -log (probability of concept)-log (probability of concept) High information content values mean very specific concepts High information content values mean very specific concepts
(like pitch-fork and basketball shoe)(like pitch-fork and basketball shoe)
Count how often a concept occurs in a corpusCount how often a concept occurs in a corpus Increment the count associated with that concept, and Increment the count associated with that concept, and
propagate the count up!propagate the count up! If based on word forms, increment all concepts associated If based on word forms, increment all concepts associated
with that formwith that form
1313
Observed “car”...Observed “car”...
motor vehicle (327 +1)
*root* (32783 + 1)
minicab (6)
cab (23)
car (73 +1) bus (17)
stock car (12)
1414
Observed “stock car”...Observed “stock car”...
motor vehicle (328+1)
*root* (32784+1)
minicab (6)
cab (23)
car (74+1) bus (17)
stock car (12+1)
1515
After Counting Concepts... After Counting Concepts...
motor vehicle (329) IC = 1.998
*root* (32785)
minicab (6)
cab (23)
car (75) bus (17)
stock car (13) IC = 3.042
1616
Similarity and Information ContentSimilarity and Information Content
Resnik (1995) use information content of least Resnik (1995) use information content of least common subsumer to express similarity between common subsumer to express similarity between two conceptstwo concepts
Lin (1998) scale information content of least Lin (1998) scale information content of least common subsumer with sum of information common subsumer with sum of information content of two conceptscontent of two concepts
Jiang & Conrath (1997) find difference between Jiang & Conrath (1997) find difference between least common subsumer’s information content least common subsumer’s information content and the sum of the two individual conceptsand the sum of the two individual concepts
1717
Why doesn’t this Why doesn’t this solve problem?solve problem?
Concepts must be organized in a Concepts must be organized in a hierarchy, and connected in that hierarchyhierarchy, and connected in that hierarchy Limited to comparing nouns with nouns, or Limited to comparing nouns with nouns, or
maybe verbs with verbsmaybe verbs with verbs Limited to similarity measures (is-a)Limited to similarity measures (is-a)
What about mixed parts of speech?What about mixed parts of speech? Murder (noun) and horrible (adjective)Murder (noun) and horrible (adjective) Tobacco (noun) and drinking (verb)Tobacco (noun) and drinking (verb)
1818
Using Dictionary Glosses Using Dictionary Glosses to Measure Relatednessto Measure Relatedness
Lesk (1985) Algorithm – measure relatedness of two Lesk (1985) Algorithm – measure relatedness of two concepts by counting the number of shared words in their concepts by counting the number of shared words in their definitionsdefinitions
Cold - a mild Cold - a mild viral viral infection involving the nose and respiratory passages (but infection involving the nose and respiratory passages (but not the lungs)not the lungs)
Flu - an acute febrile highly contagious Flu - an acute febrile highly contagious viral viral diseasedisease Adapted Lesk (Banerjee & Pedersen, 2003) – exapand Adapted Lesk (Banerjee & Pedersen, 2003) – exapand
g;losses to include those concepts directly relatedg;losses to include those concepts directly related Cold - a common cold affecting the nasal passages and resulting in Cold - a common cold affecting the nasal passages and resulting in
congestion and sneezing and headache; mild congestion and sneezing and headache; mild viralviral infection involving the nose infection involving the nose and and respiratoryrespiratory passages (but not the lungs); a passages (but not the lungs); a disease disease affecting the affecting the respiratoryrespiratory system system
Flu - an acute and highly contagious Flu - an acute and highly contagious respiratoryrespiratory diseasedisease of swine caused by of swine caused by the orthomyxovirus thought to be the same virus that caused the 1918 the orthomyxovirus thought to be the same virus that caused the 1918 influenza pandemic; an acute febrile highly contagious influenza pandemic; an acute febrile highly contagious viral viral disease; a disease; a disease disease that can be communicated from one person to anotherthat can be communicated from one person to another
1919
Context/Gloss VectorsContext/Gloss Vectors
Leskian approaches require exact matches in glossesLeskian approaches require exact matches in glosses Glosses are short, may use related but not identical Glosses are short, may use related but not identical
wordswords Solution? Expand glosses by replacing each content word Solution? Expand glosses by replacing each content word
with a co-occurrence vector derived from corporawith a co-occurrence vector derived from corpora Rows are words found in glosses, columns represent Rows are words found in glosses, columns represent
their co-occurring words in a corpus, cell values are their their co-occurring words in a corpus, cell values are their log-likelihoodlog-likelihood
Average the word vectors to create a single vector that Average the word vectors to create a single vector that represents the gloss/sense represents the gloss/sense Patwardhan & Pedersen, 2003Patwardhan & Pedersen, 2003
Measure relatedness using cosine rather than exact match!Measure relatedness using cosine rather than exact match!
2020
Gloss/Context VectorsGloss/Context Vectors
2121
ExperimentExperiment
Senseval-2 data consists of 73 nouns, verbs, Senseval-2 data consists of 73 nouns, verbs, and adjectives, approximately 8,600 “training” and adjectives, approximately 8,600 “training” examples and 4,300 “test” examples. examples and 4,300 “test” examples. Best supervised system 64%Best supervised system 64% SenseRelate 53% (lesk, vector)SenseRelate 53% (lesk, vector) Most frequent sense 48%Most frequent sense 48%
2222
ResultsResults
SenseRelate achieves disambiguation SenseRelate achieves disambiguation accuracy better than most frequent sense!!accuracy better than most frequent sense!! This is more unusual than you would think.This is more unusual than you would think.
Window of context is defined by position, Window of context is defined by position, includes 2 content words to both the left includes 2 content words to both the left and right which are measured against the and right which are measured against the word being disambiguated. word being disambiguated. Positional proximity is not always associated Positional proximity is not always associated
with semantic similarity.with semantic similarity.
2323
Why this doesn’tWhy this doesn’t solve the problem.. solve the problem..
WordNetWordNet Nouns – 80,000 conceptsNouns – 80,000 concepts Verbs – 13,000 conceptsVerbs – 13,000 concepts Adjectives – 18,000 conceptsAdjectives – 18,000 concepts Adverbs – 4,000 conceptsAdverbs – 4,000 concepts
Words not found in WordNet can’t be Words not found in WordNet can’t be disambiguated by SenseRelatedisambiguated by SenseRelate
2424
Knowledge Lean MethodsKnowledge Lean Methods
Can measure similarity between two Can measure similarity between two words by comparing co-occurrence words by comparing co-occurrence vectors created for each.vectors created for each.
Can measure similarity of two contexts by Can measure similarity of two contexts by representing them as 2representing them as 2ndnd order co- order co-occurrence vectors and comparing. occurrence vectors and comparing.
2525
Word Sense DiscriminationWord Sense Discrimination
Cluster different senses of words like Cluster different senses of words like line line or or interestinterest based on contextual similarity. based on contextual similarity. Pedersen & Bruce, 1997Pedersen & Bruce, 1997 Schutze, 1998Schutze, 1998 Purandare & Pedersen, 2004Purandare & Pedersen, 2004
Hard to evaluate, senses of words are Hard to evaluate, senses of words are somewhat ill defined, distinctions made by somewhat ill defined, distinctions made by clustering methods may or may not correspond clustering methods may or may not correspond with human intuitionswith human intuitions
http://http://senseclusters.sourceforge.netsenseclusters.sourceforge.net
2626
Name DiscriminationName Discrimination
Names that occur in similar contexts may Names that occur in similar contexts may refer to the same person.refer to the same person. George MillerGeorge Miller is an eminent psychologist. is an eminent psychologist. George MillerGeorge Miller is one of the founders of is one of the founders of
modern cognitive science. modern cognitive science. George MillerGeorge Miller is a member of the US House of is a member of the US House of
Representatives. Representatives.
2727
2828
2929
3030
3131
ObjectiveObjective
Given some number of contexts containing Given some number of contexts containing “John Smith”, identify those that are similar “John Smith”, identify those that are similar to each otherto each other
Group similar contexts together, assume Group similar contexts together, assume they are associated with single individualthey are associated with single individual
Generate an identifying label from the Generate an identifying label from the content of the different clusterscontent of the different clusters
3232
Similarity of Context? Similarity of Context? Second order Co-occurrencesSecond order Co-occurrences
He drives his car fast / Jim speeds in his autoHe drives his car fast / Jim speeds in his auto
Car -> motor, garage, gasoline, insuranceCar -> motor, garage, gasoline, insurance Auto -> motor, insurance, gasoline, accidentAuto -> motor, insurance, gasoline, accident
Car and Auto occur with many of the same words. They Car and Auto occur with many of the same words. They are therefore similar! are therefore similar!
Less direct relationship, more resistant to sparsity!Less direct relationship, more resistant to sparsity!
3333
Feature SelectionFeature Selection
Bigrams – two word sequences that Bigrams – two word sequences that may have one intervening word may have one intervening word between thembetween them Frequency > 1Frequency > 1 Log-likelihood ratio > 3.841Log-likelihood ratio > 3.841 OR stop listOR stop list
Must occur within Ft positions of target, Must occur within Ft positions of target, Ft typically set to 5 or 20 Ft typically set to 5 or 20
3434
Second Order Context Second Order Context RepresentationRepresentation
Bigrams used to create matrixBigrams used to create matrix Cell values = log-likelihood of word pairCell values = log-likelihood of word pair
Rows are co-occurrence vector for a wordRows are co-occurrence vector for a word Represent context by averaging vectors of Represent context by averaging vectors of
words in that contextwords in that context Context includes the Cxt positions around the Context includes the Cxt positions around the
target, where Cxt is typically 5 or 20.target, where Cxt is typically 5 or 20.
3535
22ndnd Order Context Vectors Order Context Vectors
He won an Oscar, but He won an Oscar, but Tom HanksTom Hanks is still a nice guy. is still a nice guy.
06272.852.913362.608420.0321176.8451.021O2contex
t
018818.55
000205.5469
134.5102
guy
000136.0441
29.57600Oscar
008.739951.781230.5203324.9818.5533won
needlefamilywarmovieactorfootballbaseball
3636
Limitations of 2Limitations of 2ndnd order order
052.2700.9204.210
28.7203.2401.2802.53
Weapon
Missile
ShootFireDestroy
Murder
Kill
17.77014.646.222.1034.2
19.232.36072.701.28
2.56
ExecuteCommandBomb
PipeFireCDBurn
3737
Singular Value DecompositionSingular Value Decomposition
What it does (for sure):What it does (for sure): Smoothes out zeroesSmoothes out zeroes Finds Principal ComponentsFinds Principal Components
What it might do: What it might do: Capture PolysemyCapture Polysemy Word Space to Semantic SpaceWord Space to Semantic Space
3838
After context representation…After context representation…
Second order vector is an average of word Second order vector is an average of word vectors that make up context, captures vectors that make up context, captures indirect relationshipsindirect relationships Reduced by SVD to principal componentsReduced by SVD to principal components
Now, cluster the vectors!Now, cluster the vectors! We use the method of repeated bisectionsWe use the method of repeated bisections CLUTOCLUTO
3939
Evaluation Evaluation (before mapping)(before mapping)
21152C46112C31711C223010C1
4040
Evaluation Evaluation (after mapping)(after mapping)
2015212C4
17
1
1
0
55111215
10612C3
10171C2
152310C1
4141
Majority Sense ClassifierMajority Sense Classifier
4242
Experimental DataExperimental Data
Created from AFE GigaWord corpusCreated from AFE GigaWord corpus 170,969,00 words170,969,00 words May 1994-May 1997May 1994-May 1997 December 2001-June 2002December 2001-June 2002 Created name conflated pseudo wordsCreated name conflated pseudo words
25 words to left and right of target25 words to left and right of target
4343
Name Conflated DataName Conflated Data
51.4%231,069
JapAnce
112,357France118,712Japan
53.9%46,431JorGypt21,762Egyptian25,539Jordan
56.0%13,734MonSlo6,176SlobodanMilosovic
7,846Shimon Peres
58.6%5,807MSIIBM2,406IBM3,401Microsoft
73.7%4,073JikRol1,071Rolf Ekeus
3,002Tajik
69.3%2,452RoBeck740David Beckham
1,652Ronaldo
Maj.TotalNewCountNameCountName
4444
Cxt 5Cxt 5 Cxt 20Cxt 20
# # Maj.Maj. Ft 5Ft 5 Ft 20Ft 20 Ft 5Ft 5 Ft 20Ft 20
RobeckRobeck 2,4522,452 69.369.3 57.357.3 72.772.7 85.985.9 54.754.7
JikRolJikRol 4,0734,073 73.773.7 94.794.7 96.296.2 91.091.0 90.490.4
MSIIBMMSIIBM 5,8075,807 58.658.6 47.747.7 51.351.3 68.068.0 60.060.0
MonSLoMonSLo 13,73413,734 56.056.0 62.862.8 96.696.6 54.654.6 91.491.4
JorGyptJorGypt 46,43146,431 53.953.9 56.656.6 59.159.1 57.057.0 53.053.0
JapAnceJapAnce 231,069231,069 51.451.4 51.151.1 51.151.1 50.350.3 50.350.3
4545
ConclusionsConclusions Tradeoff between size of context and feature Tradeoff between size of context and feature
selection space selection space Context small – Feature large : narrow window Context small – Feature large : narrow window
around target word where many possible around target word where many possible features representedfeatures represented
Context large – Feature small : large window Context large – Feature small : large window around target word where a selective set of around target word where a selective set of features representedfeatures represented
SVD didn’t help/hurtSVD didn’t help/hurt Results shown are without SVDResults shown are without SVD
4646
Ongoing workOngoing work
Creating Path Finding Measures of Creating Path Finding Measures of RelatednessRelatedness
Stopping Clustering AutomaticallyStopping Clustering Automatically Cluster labelingCluster labeling
……Bring together finding conceptual Bring together finding conceptual similarity and contextual similarity similarity and contextual similarity
4747
Thanks to…Thanks to… WordNet::Similarity and SenseRelateWordNet::Similarity and SenseRelate
http://wn-similarity.sourceforge.nethttp://wn-similarity.sourceforge.net http://http://senserelate.sourceforge.netsenserelate.sourceforge.net
Siddharth Patwardhan Siddharth Patwardhan Satanjeev BanerjeeSatanjeev Banerjee Jason MichelizziJason Michelizzi
SenseClusters SenseClusters http://http://senseclusters.sourceforge.netsenseclusters.sourceforge.net
Anagha KulkarniAnagha Kulkarni Amruta PurandareAmruta Purandare