qualitative differences between human behvaioral data and co-occurrence models of semantic...

Qualitative Differences Between Human Data and Co-occurrenceModels of Semantic Similarity

Gabriel Recchia

• Corpus-based methods

• Pointwise Mutual Information (Church & Hanks, 1989)

• Spreading Activation (Anderson & Pirolli, 1984; Farahat, Pirolli, & Markova, 2004)

• Latent Semantic Analysis (Landauer & Dumais, 1997)

• Generalized Latent Semantic Analysis (Matveeva et al., 2005)

• WordNet-based methods

• Resnik (1995), Jiang & Conrath (1997), Hirst & St-Onge (1998), Leacock & Chodorow (1998), Lin (1998), Pedersen et al. (2004)

Background

Compares the probability of observing word x and word y together (the joint probability) with the probabilities of observing x and y independently (chance)

1. Pointwise Mutual Information (PMI)

I(x,y) = log2P(x,y)

P(x)P(y)

1. Pointwise Mutual Information (PMI)

I(x,y) (# of docs containing {x and y})

(# of docs containing x) (# of docs containing y)

Compares the probability of observing word x and word y together (the joint probability) with the probabilities of observing x and y independently (chance)

• PMI a metric, not a process model

• Like PMI, simple vector space models– reward co-occurrences– penalize highly frequent words

• Unlike PMI, they also store latent/indirect similarity information

2. Vector Addition Model

• Build a term-document matrix where element (i,j) describes the frequency of term i in document j

• Apply log-entropy weighing scheme to decrease the weight of high-frequency words

• Use singular value decomposition to find an approximation to the term-document matrix with lower rank k

• Optimize k for the task at hand

3. Latent Semantic Analysis (LSA)

(Landauer & Dumais, 1997)

• Forced-choice synonymy tests

• TOEFL synonymy test (Landauer & Dumais, 1997) (TOEFL)

• ESL synonymy test (Turney, 2001) (ESL)

• Semantic similarity judgments

• Rubenstein & Goodenough, 1965 (RG)

• Miller & Charles, 1991 (MC)

• Resnik, 1995 (R)

• Finkelstein et al.’s “WordSimilarity 353,” 2002 (WS353)

PMI VectorsESL 0.35 0.28

TOEFL 0.41 0.33RG 0.47 0.14R 0.46 0.17

MC 0.46 0.12WS353 0.54 0.57

Trained on small Wikipedia subset:

PMI vs. LSA

Task PMI (Wiki subset) LSA (Wiki subset)

ESL .35 .36

TOEFL .41 .44

RG .47 .62

R .46 .60

MC .46 .46

WS353 .54 .57

PMI usingfull Wikipedia

WN NSS.F NSS.T SA.N SA.W LSA.T

ESL .62 .70 .44 .56 .39 .51 .44TOEFL .64 .87 .59 .50 .61 .59 .55

RG .78 .88 .62 .53 .49 .39 .69R .86 .90 .56 .54 .49 .52 .74

MC .76 .77 .61 .56 .45 .45 .61WS353 .73 .46 .60 .59 .40 .38 .60

• WN: Wordnet::Similarity vector measure

• NSS.F: Normalized Search Similarity, using Factiva business news corpus

• NSS.T: Normalized Search Similarity, using TASA corpus

• SA.N: Spreading Activation, using Google counts restricted to nytimes.com

• SA.W: Spreading Activation, using Google counts restricted to wikipedia.org

• LSA.T: LSA, using TASA corpus

PMI vs. Vector Addition vs. LSA

TaskPMI

(Wikipedia)Vectors

(Wikipedia)LSA

(Wiki subset)

ESL .62 0.44 .36

TOEFL .64 0.54 .44

RG .78 0.62 .62

R .86 0.73 .60

MC .76 0.68 .46

WS353 .73 0.45 .57

cor(pmi, hum) cor(vec, hum) cor(pmi, vec)ESL -- -- --

TOEFL -- -- --RG 0.78 0.62 0.52R 0.86 0.73 0.64

MC 0.76 0.68 0.57WS353 0.73 0.45 0.54

Rank (Humans) Rank (Humans)

most similar least similar most similar least similar

Ran

k (P

MI)

Ran

k (V

ecto

rs)

Rubenstein & Goodenough Word Pairs



Ran

k (P

MI)

Ran

k (V

ecto

rs)

Resnik Word Pairs



Ran

k (P

MI)

Ran

k (V

ecto

rs)

Miller & Charles Word Pairs



Ran

k (P

MI)

Ran

k (V

ecto

rs)

WordSim353 Word Pairs

Proportion of Word Pairs that PMI Assigns a Similarity Value of 0 to, By Task

ESL TOEFL RG R MC WS353

.26 .34 .25 .21 .20 .04

Rubenstein & Goodenough Word Pairs

Vectors PMI

HumansLSA

Resnik Word Pairs

Vectors PMI

HumansLSA

Miller & Charles Word Pairs

Vectors PMI

HumansLSA

WordSim353 Word Pairs

Vectors PMI

HumansLSA

Vectors PMI

Humans (Forward Strengths)LSA

USF Free Association Norms

• Looking at rank correlations alone obscures important distributional properties that should not be ignored if the goal is to emulate human semantic representations

• Closer attention to qualitative trends should guide model design

qualitative differences between human behvaioral data and co-occurrence models of semantic...

Education

rank humans rank humans

pmi vectors esl

rank correlations

tasa corpus pmi

task esl toefl rg r

proportion of word pairs

similarity information

mc resnik