take and took, gaggle and goose, book and read: evaluating the utility of vector differences for...
TRANSCRIPT
Evaluating the Utility of Vector Differences for Lexical RelationLearning
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin
August, 09 2016
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 1 / 23
The utility of difference vectors
DIFFVEC = word2 −word1
Vector Difference, or OffsetMikolov et al, 2013: king−man + woman ≈ queenCAPITAL-CITY: Paris− France + Poland ≈Warsawor PLURALISATION: cars− car + apple ≈ apples
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 2 / 23
The utility of difference vectors
DIFFVEC = word2 −word1
Vector Difference, or OffsetMikolov et al, 2013: king−man + woman ≈ queenCAPITAL-CITY: Paris− France + Poland ≈Warsawor PLURALISATION: cars− car + apple ≈ apples
Can Diffw1,w2be clustered or classified into a broad coverage set of lexical relations?
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 3 / 23
Types of relations
Lexical semantic relationsLEXSEMHyper: Hypernymy (animal,dog)LEXSEMMero: Meronymy (bird,wing)LEXSEMEvent: Object’s Action (zip, coat)
Morphosyntactic relationsVERBPast: Present, 1st→ Past (know, knew)
VERB3: Present, 1st→ Present, 3rd (know, knows)VERB3Past: Present, 3rd→ Past (knows, knew)
NOUNSP: Singular→ Plural (year, years)
Morphosemantic relationsVERBNOUN: Nominalisation of a verb (drive,drift)PREFIX: Prefixing with re morpheme (vote, revote)
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 4 / 23
Word Embeddings
Name Dimensions Training dataw2v 300 100× 109
GloVe 200 6× 109
SENNA 100 37× 106
HLBL 200 37× 106
w2vwiki 300 50× 106
GloVewiki 300 50× 106
SVDwiki 300 50× 106
The Models Usedw2v (Mikolov et al.,2013),
GloVe (Pennington et al.,2014),
SENNA (Collobert et al., 2011),
HLBL (Mnih and Hinton,2009)
PPMI+SVD (Levy and Goldberg, 2015)
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 5 / 23
Closed-World Experiments
Closed-World setting: Multi-class classifierLet {(wi ,wj)} be a set of word pairsR = {rk} be a set of binary lexical relations(wi ,wj) 7→ rk ∈ R,i.e. all word pairs can be uniquely classified according to a relation in R
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 6 / 23
Spectral clustering: t-SNE projection for 10 samples per class
LEXSEMAttr
LEXSEMCause
NOUNColl
LEXSEMEvent
LEXSEMHyper
LVCLEXSEMMero
NOUNSP
PREFIX
LEXSEMRef
LEXSEMSpace
VERB3
VERB3Past
VERBPast
VERBNOUN
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 7 / 23
Methods
Clustering algorithmSpectral clustering (von Luxburg, 2007).Two hyperparameters: (1) the number of clusters; and (2) the pairwise similarity measure forcomparing DIFFVECs.
Clustering evaluationV-Measure (Rosenberg, Hirschberg, 2007)
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 8 / 23
Clustering results
10 20 30 40 50 60 70 80
0.15
0.20
0.25
0.30
0.35
0.40
Number of clusters
V-M
easu
re
w2vw2v wikiGloVe
GloVe wikiSVD wiki
HLBL
SENNA
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 9 / 23
Clustering results
Incorrectly classified due to ambiguity/one word overwhelming anotherstudies− study⇒ VERBNOUN
saw− utensil⇒ VERBPast
tigers− ambush⇒ NOUNColl
Single ”hypernym”-specific clustersnecklace− unit, wristband− unit, hairpin− unit
Semantic sub-clustersmovement verb− animal nounfood verb− food nounaction verb− profession noun
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 10 / 23
From Clustering to Classification
From Clustering to ClassificationEncouraged by the results of the clustering experiment,we next move to classification experiments.We train multi-class linear classifier to differentiate between the relations types.
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 11 / 23
Classification: Multi-class linear SVM, F-scores
0.0
0.2
0.4
0.6
0.8
1.0
Hyper
Event
Mer
o
Noun SP
Verb 3
Verb Pas
t
Verb 3P
ast
Prefix Re
Noun Coll
Micr
oAvg
Baseline W2V W2V:Wiki SVD:Wiki
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 12 / 23
Open-World Experiments
Open-World setting: Binary classifierLet {(wi ,wj)} be a set of word pairsR = {rk} be a set of binary lexical relations(wi ,wj) 7→ rk ∈ R ∪ {φ}where φ signifies the fact that none of the relations in R apply to the word pair
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 13 / 23
Binary classification
Adding some random pairs, i.e. randomly linked word pairs
Generating random pairs(1) Seed word proportional to their frequency in Wikipedia⇒(2) take the Cartesian product over pairs of words from the seed lexicon⇒(3) sample word pairs uniformly from this set
Training the classifiersTrain 9 binary SVM classifiers with RBF kernel and evaluate on test set augmented with randomsamples
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 14 / 23
Open-World Results
Open−World
0.0
0.2
0.4
0.6
0.8
1.0
Hyper
Event
Mer
o
Noun SP
Verb 3
Verb Pas
t
Verb 3P
ast
Prefix Re
Noun Coll
Pr: No NS Re: No NS Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 15 / 23
Binary classification
Resultscorrectly captured many of the true classes of the relations (high recall), but also many of therandom samples as being related (low precision):
(have,works), (turn, took), (works, started)⇒ VERB3, VERBPast and VERB3Past
NOUNColl: everything related to animalsLEXSEMMero: mainly relations consisting of nouns
Relational similarity ≈ a combination of attributional similarities.Some of them (e.g., syntactic) the classifier captures, and some(e.g., semantics) might be missing.
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 16 / 23
Binary classification
The classifier over-generalizes→ Add extra negative samples to the training data.
Negative Samplesopposite pairs: switching the order of word pairs, Opposw1,w2 = word1 −word2
shuffled pairs: replacing w2 with a random word from the same relation,Shuffw1,w2 = word′
2 −word1.
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 17 / 23
Binary classification with negative samples
Usage of Negative Samples
0.0
0.2
0.4
0.6
0.8
1.0
Hyper
Event
Mer
o
Noun SP
Verb 3
Verb Pas
t
Verb 3P
ast
Prefix Re
Noun Coll
Pr: No NS Pr: With NS Re: No NS Re: With NS
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 18 / 23
Lexical memorization
A Short ExampleTrain:
LEXSEMHyper: Hypernymy (animal,dog)LEXSEMHyper: Hypernymy (animal, cat)LEXSEMHyper: Hypernymy (animal,monkey)
Then in Test:LEXSEMHyper: Hypernymy : (animal,banana)
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 19 / 23
Lexical memorization: No lexical overlap between test and train
0 1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
1.0
Volume of random word pairs
P/R
/F
PP+neg
RR+neg
FF+neg
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 20 / 23
Conclusion
many types of morphosyntactic differences are captured by DIFFVECs, morphosemanticrelations are a bit harder and lexical semantic relations are captured less wellclassification over the DIFFVECs works extremely well in a closed-world setup, but less well overopen dataWith the introduction of automatically-generated negative samples, however, the resultsimproved substantially
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 21 / 23
Open Questions
Could some examples be more representative of the relation type?How much data do we need for the best generalization?Some morphosemantic (derivations:prefixing) and lexical relations (meronyms) are still hard tocapture.
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 22 / 23
Thanks
Thank you for your time and attention! Questions?See more details here: http://arxiv.org/abs/1509.01692
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and Tim Baldwin Evaluating the Utility of Vector Differences for Lexical Relation Learning 23 / 23