more data trumps smarter algorithms: training computational models of semantics on large corpora
DESCRIPTION
Gabriel Recchia, Cognitive ScienceMichael N. Jones, PsychologyIndiana University, BloomingtonPresented atTRANSCRIPT
More Data Trumps Smarter Algorithms: Training Computational Models of
Semantics on Large Corpora
Gabriel Recchia, Cognitive ScienceMichael N. Jones, Psychology
Indiana University, Bloomington
car / automobilelad / wizardgem / jewelglass / magicianjourney / voyageasylum / madhouse
0.9750.1750.8750.0250.8750.9
Compares the probability of observing word x and word y together (the joint probability) with the probabilities of observing x and y independently (chance)
Pointwise Mutual Information (PMI)
I(x,y) = log2P(x,y)
P(x)P(y)
(Church & Hanks, 1990)
• Build a term-document matrix where element (i,j) describes the frequency of term i in document j
• Apply log-entropy weighing scheme to decrease the weight of high-frequency words
• Use singular value decomposition to find an approximation to the term-document matrix with lower rank k
• Optimize k for the task at hand
Latent Semantic Analysis (LSA)
(Landauer & Dumais, 1997)
• Criticisms
• Scalability
• Incrementality
• Lessons from computational linguistics:simple models that can be trained on more data often outperform complex models that are restricted to less
Latent Semantic Analysis (LSA)
• Forced-choice tests
• TOEFL synonymy test (Landauer & Dumais, 1997) (TOEFL)
• ESL synonymy test (Turney, 2001) (ESL)
• Semantic similarity judgments
• Rubenstein & Goodenough, 1965 (RG)
• Miller & Charles, 1991 (MC)
• Resnik, 1995 (R)
• Finkelstein et al.’s “WordSimilarity 353,” 2002 (WS353)
PMI vs. LSA (Budiu et al., 2007)
Task PMI (TASA) LSA (TASA)
ESL .22 .44
TOEFL .22 .60
RG .61 .64
R .61 .71
MC .65 .75
WS353 .58 .60
PMI vs. LSA (Budiu et al., 2007)
Task PMI (Stanford) LSA (TASA)
ESL .52 .44
TOEFL .51 .60
RG .75 .64
R .83 .71
MC .79 .75
WS353 .71 .60
• Budiu et al. (2007) concluded that PMI performs better when given more data
• But, they had a confound: Corpus size was confounded with document size and type of text (web documents vs. carefully constructed sentences in textbooks)
PMI vs. LSA
Experiment 1
Task PMI (Wiki subset) LSA (Wiki subset)
ESL .35 .36
TOEFL .41 .44
RG .47 .62
R .46 .60
MC .46 .46
WS353 .54 .57
PMI vs. LSA
Experiment 1
Task PMI (full Wikipedia) LSA (Wiki subset)
ESL .62 .36
TOEFL .64 .44
RG .78 .62
R .86 .60
MC .76 .46
WS353 .73 .57
Experiment 2
PMI trained on lots of data outperforms LSA trained on less. How does it compare with other measures of semantic relatedness?
To find out: compare with other publicly available measures at the Rensselaer Measures of Semantic Relatedness website, cwl-projects.cogsci.rpi.edu/msr(Veksler et al., 2008)
• Latent Semantic Analysis (Landauer & Dumais, 1997)
• Spreading Activation (Anderson & Pirolli, 1984; Farahat, Pirolli, & Markova, 2004)
• Normalized Search Similarity (Cilibrasi and Vitányi, 2007; Veksler et al., 2008)
• WordNet::Similarity vector measure (Pedersen et al., 2004)
Experiment 2
Experiment 2
PMI usingfull Wikipedia
WN NSS.F NSS.T SA.N SA.W LSA.T
ESL .62 .70 .44 .56 .39 .51 .44TOEFL .64 .87 .59 .50 .61 .59 .55
RG .78 .88 .62 .53 .49 .39 .69R .86 .90 .56 .54 .49 .52 .74
MC .76 .77 .61 .56 .45 .45 .61WS353 .73 .46 .60 .59 .40 .38 .60
• WN: Wordnet::Similarity vector measure
• NSS.F: Normalized Search Similarity, using Factiva business news corpus
• NSS.T: Normalized Search Similarity, using TASA corpus
• SA.N: Spreading Activation, using Google counts restricted to nytimes.com
• SA.W: Spreading Activation, using Google counts restricted to wikipedia.org
• LSA.T: LSA, using TASA corpus
Discussion
Released a tool for calculating PMI scores: http://www.indiana.edu/~clcl/lmoss/
Discussion
• PMI does not take latent information into account; unmodified, is not plausible
• However, its success when scaled to data on the order of human experience favors models based on simple co-occurrences (e.g. models based on vector addition)
Discussion
• Simple, scalable, incremental models of semantic similarity show promise
• Suggests complexity should be left in the data rather than added to the model
• Publicly available tool allows non-programmers to retrieve corpus-specific semantic similarity scores
Conclusions