qualitative differences between human behvaioral data and co-occurrence models of semantic...
TRANSCRIPT
Qualitative Differences Between Human Data and Co-occurrenceModels of Semantic Similarity
Gabriel Recchia
• Corpus-based methods
• Pointwise Mutual Information (Church & Hanks, 1989)
• Spreading Activation (Anderson & Pirolli, 1984; Farahat, Pirolli, & Markova, 2004)
• Latent Semantic Analysis (Landauer & Dumais, 1997)
• Generalized Latent Semantic Analysis (Matveeva et al., 2005)
• WordNet-based methods
• Resnik (1995), Jiang & Conrath (1997), Hirst & St-Onge (1998), Leacock & Chodorow (1998), Lin (1998), Pedersen et al. (2004)
Background
Compares the probability of observing word x and word y together (the joint probability) with the probabilities of observing x and y independently (chance)
1. Pointwise Mutual Information (PMI)
I(x,y) = log2P(x,y)
P(x)P(y)
1. Pointwise Mutual Information (PMI)
I(x,y) (# of docs containing {x and y})
(# of docs containing x) (# of docs containing y)
Compares the probability of observing word x and word y together (the joint probability) with the probabilities of observing x and y independently (chance)
• PMI a metric, not a process model
• Like PMI, simple vector space models– reward co-occurrences– penalize highly frequent words
• Unlike PMI, they also store latent/indirect similarity information
2. Vector Addition Model
• Build a term-document matrix where element (i,j) describes the frequency of term i in document j
• Apply log-entropy weighing scheme to decrease the weight of high-frequency words
• Use singular value decomposition to find an approximation to the term-document matrix with lower rank k
• Optimize k for the task at hand
3. Latent Semantic Analysis (LSA)
(Landauer & Dumais, 1997)
• Forced-choice synonymy tests
• TOEFL synonymy test (Landauer & Dumais, 1997) (TOEFL)
• ESL synonymy test (Turney, 2001) (ESL)
• Semantic similarity judgments
• Rubenstein & Goodenough, 1965 (RG)
• Miller & Charles, 1991 (MC)
• Resnik, 1995 (R)
• Finkelstein et al.’s “WordSimilarity 353,” 2002 (WS353)
PMI VectorsESL 0.35 0.28
TOEFL 0.41 0.33RG 0.47 0.14R 0.46 0.17
MC 0.46 0.12WS353 0.54 0.57
Trained on small Wikipedia subset:
PMI vs. LSA
Task PMI (Wiki subset) LSA (Wiki subset)
ESL .35 .36
TOEFL .41 .44
RG .47 .62
R .46 .60
MC .46 .46
WS353 .54 .57
PMI usingfull Wikipedia
WN NSS.F NSS.T SA.N SA.W LSA.T
ESL .62 .70 .44 .56 .39 .51 .44TOEFL .64 .87 .59 .50 .61 .59 .55
RG .78 .88 .62 .53 .49 .39 .69R .86 .90 .56 .54 .49 .52 .74
MC .76 .77 .61 .56 .45 .45 .61WS353 .73 .46 .60 .59 .40 .38 .60
• WN: Wordnet::Similarity vector measure
• NSS.F: Normalized Search Similarity, using Factiva business news corpus
• NSS.T: Normalized Search Similarity, using TASA corpus
• SA.N: Spreading Activation, using Google counts restricted to nytimes.com
• SA.W: Spreading Activation, using Google counts restricted to wikipedia.org
• LSA.T: LSA, using TASA corpus
PMI vs. Vector Addition vs. LSA
TaskPMI
(Wikipedia)Vectors
(Wikipedia)LSA
(Wiki subset)
ESL .62 0.44 .36
TOEFL .64 0.54 .44
RG .78 0.62 .62
R .86 0.73 .60
MC .76 0.68 .46
WS353 .73 0.45 .57
cor(pmi, hum) cor(vec, hum) cor(pmi, vec)ESL -- -- --
TOEFL -- -- --RG 0.78 0.62 0.52R 0.86 0.73 0.64
MC 0.76 0.68 0.57WS353 0.73 0.45 0.54
Rank (Humans) Rank (Humans)
most similar least similar most similar least similar
Ran
k (P
MI)
Ran
k (V
ecto
rs)
Rubenstein & Goodenough Word Pairs
Rank (Humans) Rank (Humans)
most similar least similar most similar least similar
Ran
k (P
MI)
Ran
k (V
ecto
rs)
Resnik Word Pairs
Rank (Humans) Rank (Humans)
most similar least similar most similar least similar
Ran
k (P
MI)
Ran
k (V
ecto
rs)
Miller & Charles Word Pairs
Rank (Humans) Rank (Humans)
most similar least similar most similar least similar
Ran
k (P
MI)
Ran
k (V
ecto
rs)
WordSim353 Word Pairs
Proportion of Word Pairs that PMI Assigns a Similarity Value of 0 to, By Task
ESL TOEFL RG R MC WS353
.26 .34 .25 .21 .20 .04
Rubenstein & Goodenough Word Pairs
Vectors PMI
HumansLSA
Resnik Word Pairs
Vectors PMI
HumansLSA
Miller & Charles Word Pairs
Vectors PMI
HumansLSA
WordSim353 Word Pairs
Vectors PMI
HumansLSA
Vectors PMI
Humans (Forward Strengths)LSA
USF Free Association Norms
• Looking at rank correlations alone obscures important distributional properties that should not be ignored if the goal is to emulate human semantic representations
• Closer attention to qualitative trends should guide model design