overview of peter d. turney’s work on similarity
DESCRIPTION
Overview of Peter D. Turney’s Work on Similarity. From 2001-2008. similarity. Attributional similarity (2001 - 2003) the degree to which two words are synonymous also known as Semantic relatedness and semantic association Relational similarity (2005 - 2008) - PowerPoint PPT PresentationTRANSCRIPT
Overview of Peter D. Turney’s Work o
n Similarity
From 2001-2008
similarity Attributional similarity (2001 - 2003)
the degree to which two words are synonymous also known as
Semantic relatedness and semantic association
Relational similarity (2005 - 2008) the degree to which two relations are analogous
Objective evaluation of the approaches by
Attributional similarity 80 TOFEL Synonym questions
Relational similarity 374 SAT analogy questions
2001 Mining the Web for
Synonyms: PMI-IR versus LSA on TOEFL
In Proceedings of the 12th European
Conference on Machine Learning,
pages 491–502, Springer, Berlin, 2001.
1 Introduction
识别同义词: 给定一个词和一组候选词,从候选词中选出与给定
词意义最相近的一个。 核心思想:基于 co-occurrence
“a word is characterized by the company it keeps”
1 Introduction: idea
给定一个词 problem 和一组候选词 {choice1, choice2, …, choicen} 计算 choicei 的 score(choicei) ,得分最高的即为同义词。
uses Pointwise Mutual Information (PMI) to analyze statistical data collected by Information
Retrieval (IR).
ii 2
i
p(problem & choice )score(choice ) = log
p(problem)p(choice )
2 formula
Score 1:
Score 2: NEAR为十个单词以内
i1 i
i
hits(problem AND choice )score (choice ) =
hits(choice )
i2 i
i
hits(problem NEAR choice )score (choice ) =
hits(choice )
2 formula
Score 3: 避免反义词如 big vs. small
Score 4: 引入上下文 context
context word的选择:只选一个(保证样本数)
3 i
i i
i i
score (choice ) =
hits((problem NEAR choice ) AND NOT ((problem OR choice ) NEAR "not"))
hits(choice AND NOT (choice NEAR "not"))
4 i
i i
i i
score (choice ) =
hits((problem NEAR choice ) AND context AND NOT ((problem OR choice ) NEAR "not"))
hits(choice AND context AND NOT (choice NEAR "not"))
3 Experiments
Compare with LSA: Latent Semantic Analysis
利用百科全书构造初始矩阵 X : 61,000 * 30,473 文档片段:整篇文档 压缩降维: SVD Element: tfidf weight Similarity: cosine
学生的 TOFEL 成绩
Dataset: 80 个 TOFEL 试题
50 个 ESL 考试题
3 Experiments: PMI-IR Vs. LSA
时间效率 PMI-IR :程序简单,耗时少
2s/query * 8 querys ,几乎全部耗时在网络交互 并行: 2S
LSA :耗时长 61,000 * 30,473 压缩到 61,000 *300 , UNIX Station 需时大
约三小时
3 Experiments
80 个 TOFEL 试题, 50 个 ESL考试题 PMI-IR : 73.75%(59/80) 74%(37/50) 留学生: 64.5%(51.6/80) LSA: 64.4%(51.5/80)
性能 : PMI-IR WIN: 10% 原因
NEAR 的使用, Smaller chunk size LSA 64.4% PMI-IR with AND 62.5% PMI-IR with NEAR 72.5%
4 Conclusion
结合 PMI 和 IR 用共现来衡量词语间的相关程度
PMI 利用向引擎发送查询
解决了数据稀疏的问题
2003Combining independent modules
in lexical multiple-choice problems
In RANLP-03, pages 482–489,
Borovets, Bulgaria(RANLP: Recent Advances in Natural Language Proc
essing )
1 Introduction
There are several approaches to natural language problems
No one will be the best for all problem instances.
How about combine them?
1 Introduction
two main contributions introduces and evaluates several new modules
for answering multiple-choice synonym questions and analogy questions.
3 merging rules presents a novel product rule compares it with other 2 similar merging rules.
2 Merging rules: the parameter
The parameter of the rules: w ph
ij >= 0 represents the probability
第 i 个 module 1 <= i <= n 第 h 个 instance 1 <= h <= m. 第 j 个 choice 1 <= j <= k
Dh,wj be the probability
assigned by the merging rule to choice j of training instance h when the weights are set to w.
1<= a(h) <= k be the correct answer for instance
, '
( )'arg maxh w
a hw hw D
2 Merging rules: old
mixture rule: very common
归一化
logarithmic rule
,
1
nh w hj i ij
i
M w p
,
1
exp ln ( ) i
nwh w h h
j i ij ijii
L w p p
, , ,
1
kh w h w h wj j j
j
D L L
, , ,
1
kh w h w h wj j j
j
D M M
2 Merging rules: novel
product rule
, ( (1 ) )h w hj i ij iiP w p w k
, , ,
1
kh w h w h wj j j
j
D P P
3 Synonym: dataset
a training set of 431 4-choice synonym questions
randomly divided them into 331 training questions and 100 testing questions. Optimize w with the training set
3 Synonym: Modules LSA PMI-IR Thesaurus
queries Wordsmyth (www.wordsmyth.net) Create synonyms lists for both stem and choices scored them by their overlap
Connector used summary pages from querying Google with a pair of
words Weighted sum of
the times when the words appear separated by a symbol [, ”, :, ,, =, /, ,, (, ] means, defined, equals, synonym, whitespace, and
the number of times “dictionary” or “thesaurus” appear
3 Synonym: combine results 3 rules’ accuracies are nearly identical the product and logarithmic rules assign
higher probabilities to correct answers as evidenced by the mean likelihood.
3 Synonym: compare with other approaches
4 Analogies: dataset
374 5-choice instances randomly split the collection into 274 training i
nstances and 100 testing instances. Eg. cat:meow::
(a) mouse:scamper,(b) bird:peck, (c) dog:bark, (d) horse:groom,(e) lion:scratch
4 Analogies: modules
Phrase vectors Create vector r to present the relationship betwee
n X and Y. Phrases with 128 patterns
Eg. “X for Y", “Y with X", “X in the Y", “Y on X“ Query and record the number of hits Measure by cosine
Thesaurus paths (WordNet) degree of similarity between paths
4 Analogies: combine results
Lexical relation modules a set of more specific modules using the WordNet 9 modules: Each checks a relationship
Synonym, Antonym, Hypernym, Hyponym, Meronym:substance, Meronym:part, Meronym:member, Holonym:substance, Holonym:member.
Check the stem first, then the choices Similarity
Make use of definition Similarity:dict uses dictionary.com and Similarity:wordsmyth uses wordsmyth.net
Given A:B::C:D, similarity = sim (A, C) + sim (B, D)
5 Conclusion
applied three trained merging rules to TOEFL questions Accuracy: 97.5%
provided first results on a challenging analogy task with a set of novel modules that use both lexical databases and statistical information. Accuracy: 45%
the popular mixture rule was consistently weaker than the logarithmic and product rules at assigning high probabilities to correct answers.
State of the art (accuracy)
LSA HUMAN PIM-IR
(2001)
HYBRID
(2003)
Synonym
question
64.4% 64.5% 73.75% 97.5%
HYBRID HUMAN
Analogies 45% 57%
2005 Corpus-based Learning of
Analogies and Semantic Relations
IJCAI 2005
Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, Edinburgh,
Scotland, UK, July 30-August 5, 2005.
1 Introduction
Verbal analogy: VSM A:B :: C:D The novelty of the paper is the application of VSM
to measure the similarity between relationships. Noun-modifier pairs relations: supervised nea
rest neighbour algorithm Dataset: Nastase and Szpakowicz (2003), 600 n
one-modifier pairs.
1 Introduction: examples
Analogy
Noun-modifier pairs relations Laser printer Relation: instrument
2 Solving Analogy Problems
assign scores to candidate analogies A:B::C:D For multiple-choice questions, guess highest
scoring choice Sim(R1, R2) difficulty is that R1 and R2 are implicit
attempt to learn R1 and R2 using unsupervised learning from a very large corpus
2 Solving Analogy Problems: Vector Space Model
create vectors, r1 and r2, that represent features of R1 and R2
measure the similarity of R1 and R2 by the cosine of the angle θ between r1 and r2
2 Solving Analogy Problems:简易图解版 Generate vector for each word pair
Joining terms:
“X for Y", “Y with X", “X in the Y", “Y on X“
vector
[ log(hit1), log(hit2)…, log(hit128) ]
Word PairA:B
64 joining terms
phrases
searchhits vector
log
2 Solving Analogy Problems: experiment
2 Solving Analogy Problems: experiment
3 Noun-Modifier Semantic Relations
First attempt to classify semantic relations without a lexicon.
30 Semantic Relations of training data
3 Noun-Modifier Semantic Relations: algorithm
nearest neighbour supervised learning nearest neighbour = cosine
Cosine (training pair, testing pair) vector of 128 elements, same joining terms as
before
3 Noun-Modifier Semantic Relations:Experiment for the 30 Classes
30 Semantic Relations
F when precision and recall are balanced 26.5%
F for random guessing 3.3%
much better than random guessing but still much room for improvement
30 classes is hard too many possibilities for confusing classes
try 5 classes instead group classes together
5 Semantic Relations
F for the 5 Classes
5 Semantic Relations
F when precision and recall are balanced 43.2%
F for random guessing 20.0%
better than random guessing better than 30 classes
26.5% but still room for improvement
Execution Time
experiments presented here required 76,800 queries to AltaVista 600 word pairs × 128 queries per word pair = 76,800 queries
as courtesy to AltaVista, inserted a five second delay between each query processing 76,800 queries took about five
days
Conclusion
The cosine metric in the VSM used to Analogy Classify semantic relations
It performs much better than random guessing, but below human levels.
State of the art
accuracy HYBRID
(2003)
VSM
(2005)
HUMAN
Analogies 45% 47% 57%
F-measure VSM
(2005)
Noun-Modifier
(5 classes)
43.2%
2006aSimilarity of Semantic
Relations
Computational Linguistics, 32(3):379–416.
1 Introduction
Latent Relational Analysis (LRA) LRA extends the VSM approach of Turney an
d Littman (2005) in three ways: The connecting patterns are derived automatical
ly from the corpus, instead of using a fixed set of patterns.
Singular Value Decomposition (SVD) is used to smooth the frequency data.
automatically generated synonyms are used to explore variations of the word pairs.
2 A short description of LRA简易图解版 Generate vector for each word pair
Word PairA:B
64 joining terms
phrases
searchhits vector
log
A’:B, A:B’同义词扩展
熵 *log(hit)
矩阵
SVD
Calculate avg(cosine)
自动获得的pattern
3 Experiment: Word Analogy Questions Baseline LSA
Matrix: 17,232 * 8,000, density of 5.8% Time required: 209:49:36, 9 days Performance:
Experiment: Word Analogy Questions LSA vs. VSM
Corpus size: AltaVista: 5*1011 English words WMTS: 5*1010 English words
Experiment: Word Analogy Questions Varying the Parameters
Experiment: Word Analogy Questions Ablation Experiments
No SVD: not significant, but maybe significant with more word pairs
No synonyms: recall drops No both: recall drops VSM: drop is significant
Experiments with Noun-Modifier Relations
Dataset 600 noun-modifier pairs, hand-labeled with 30 cla
sses of semantic relations Algorithm
Baseline LRA with Single Nearest Neighbor LRA: a distance (nearness) measure
Discussion
For Word Analogy Questions Performance is not yet be adequate for practical
application Speed
For noun-modifier classification More hand-labeled data, but it’s expensive the choice of classification scheme for the
semantic relations Hybrid approach
combine the corpus-based approach of LRA with the lexicon-based approach of Veale (2004)
Conclusion of 2006a
LRA, extend the VSM (2005) in Patterns are derived automatically SVD is used to smooth and compress data. automatically generated synonyms are used to
explore variations of the word pairs.
State of the art
accuracy HYBRID
(2003)
VSM
(2005)
LRA
(2006a)
HUMAN
Analogies 45% 47% 56.8% 57%
F-measure VSM
(2005)
LRA
(2006a)
Noun-Modifier
(5 classes)
43.2% 54.6%
2006bExpressing Implicit Semantic Relations
without Supervision
Coling/ACL-06
Introduction
Hearst (1992): pattern X:Y Pattern “Y such as the X” can be used to mine l
arge text corpora for hypernym-hyponym Search using the pattern “Y such as the X” and fi
nd the string “bird such as the ostrich”, then we can infer that “ostrich” is a hyponym of “bird”.
Here we consider the inverse of this problem: X:Y pattern Can we mine a large text corpus for patterns that
express the implicit relations between X and Y?
Introduction
Discovering high quality patterns Pertinence: measure of quality Reliable for mining further word pairs with the
same semantic relations
2 Pertinence the first formal measure of quality for text mining patterns. a set of word pairs a set of patterns
Pi is pertinent to Xj:Yj
if highly typical word pairs Xk:Yk for the pattern Pi tend to be relationally similar to Xj:Yj
Pertinence tends to be highest with unambiguous patterns
1 1: ,..., :n nW X Y X Y1{ ,..., }mP P P
1
: ,
( : | ) : , :
j j i
n
k k i r j j k kk
pertinence X Y P
p X Y P sim X Y X Y
2 Pertinence: 计算 fk,I is the number of occurrences in a corpus of the w
ord pair Xk:Yk with the pattern Pi
Smoothing, ,
1
( | : )m
i k k k i k jj
p P X Y f f
1
( : ) ( | : )( : | )
( : ) ( | : )
k k i k kk k i n
j j i j jj
p X Y p P X Yp X Y P
p X Y p P X Y
( : ) 1j jp X Y n
1
( | : )( : | )
( | : )
i k kk k i n
i j jj
p P X Yp X Y P
p P X Y
( : | )k k ip X Y P
贝叶斯定理
, ,1
( : | ) ( : , ) ( )n
k k i k k i i k i j ij
p X Y P p X Y P p P f f
3 Related Work Hearst (1992)
describes a method for finding patterns like “Y such as the X”. but her method requires human judgment.
Riloff and Jones (1999) use a mutual bootstrapping technique that can find patterns
automatically but the bootstrapping requires an initial seed of manually cho
sen examples.
Other works all require training examples or initial seed patterns for each relation
3 Related Work
Turney (2006a): LRA maps each pair X:Y to a high-dimensional vector
v, then calculate the cosine. Pertinence is based on it A limitation:
the semantic content of the vectors is difficult to interpret
The Algorithm 1. Find phrases 2. Generate patterns
Note pattern frequency (TF) A local frequency count
3. Count pair frequency: It’s a global frequency count (DF)
4. Map pairs to rows: both for Xj:Yj and Yj:Xj
5. Map patterns to columns drop all patterns with a pair frequency less than 10 1,706,845 distinct patterns 42,032 patterns
The Algorithm 6. Build a sparse matrix
Element is frequency 7. Calculate entropy: log and entropy
gives more weight to patterns that vary substantially in frequency for each pair.
8. Apply SVD: 9. Calculate cosines: 10. Calculate conditional probabilities:
For every word pair and every pattern
11. Calculate pertinence: 1
( | : )( : | )
( | : )
i k kk k i n
i j jj
p P X Yp X Y P
p P X Y
The Algorithm:简易图解版
语义相似度 = pattern list 的相似度
{ 词对 } 矩阵词对 1, pattern list1
……词对 n, pattern listn
检索 , 统计patterns 等 计算 , 排序
5 Experiments with Word Analogies Dataset
374 college-level multiple-choice word analogies, taken from the SAT test.
6*374 = 2244 pairs 4194rows * 84,064 columns The sparse matrix density is 0.91%
Score =
( rankstem + rankchoice ) / 2
the four highest ranking patterns for the stem and solution for the first example
the top five pairs match the pattern “Y such as the X”.
Comparing with other measures
Experiments with Noun-Modifiers
Method and Result Method
A single nearest neighbour algorithm with leave-one-out cross-validation.
The distance between two noun-modifier pairs is measured by the average rank of their best shared pattern.
Result
More
For the 5 general classes
Comparing with other measures
Discussion Time
Word Analogies: 5 hours, vs. 5 days (2005), 9 days(2006a) Noun-Modifiers: 9 hours the majority of the time was spent in SEARCHING
Performance Near the level of the average senior high school student
(54.6% vs. 57%) For applications such as building a thesaurus, lexicon, or
ontology, this level of performance suggests that our algorithm could assist, but not replace, a human expert.
Conclusion
LRA is a black box The main contribution of this paper is the idea
of pertinence use it to find patterns that express the implicit
semantic relations between two words.
State of the art
accuracy HYBRID
(2003)
VSM
(2005)
LRA
(2006a)
pertinence
(2006b)
HUMAN
Analogies 45% 47% 56.8% 55.7% 57%
F-measure VSM
(2005)
LRA
(2006a)
pertinence
(2006b)
Noun-Modifier
(5 classes)
43.2% 54.6% 50.2%
2008A Uniform Approach to Analogies,
Synonyms, Antonyms,and Associations
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), August 2008, Manchester, UK, Pages 905-912
1 Introduction
语义种类太多,不可能每种都提供一种特别的算法 we restrict our attention to
analogous synonymous Antonymous Associated
As far as we know, the algorithm proposed here is the first attempt to deal with all four tasks using a uniform approach.
1 Introduction: idea
Analogous Synonymous
X:Y is analogous to the pair levied:imposed Antonymous
X:Y is analogous to the pair black:white Associated
X:Y is analogous to the pair doctor:hospital
1 Introduction: Why not WordNet?
WordNet contains all of the needed relations. Corpus-based algorithm is BETTER than lexicon
answer 374 multiple-choice SAT analogy questionsWordNet (Veale, 2004): 43%corpus-based (Turney, 2006a): 56%
Less human labor Easy to extend to other languages
1 Introduction: experiments
SAT college entrance test TOFEL ESL a set of word pairs that are labeled similar,
associated, and both, developed for experiments in cognitive psychology
2 Algorithm: PairClass
view the task of recognizing word analogies as a problem of classifying word pairs standard classification problem for supervised
machine learning
2 Algorithm: Resource
Corpus: 5 × 1010 words, consisting of web pages gathered by a web
crawler, gathered by Clarke,CharlesL.A., 2003
Wumpus: an efficient search engine for passage retrieval from large c
orpora. (http://www.wumpus-search.org/) to study issues that arise in the context of indexing dynami
c text collections in multi-user environments.
2 Algorithm: PairClasstraining set
&testing set
Step1: generate morphological variations
Step 2: search in a large corpus forall phrases
Step 3:
generate patterns
Step 4: reduce the number
of patterns
Step 5: generate feature
vectors
Step 6: apply a standard su
pervisedlearning algorithm
Weka
mason:stone
masons:stones
the mason cut the stone
with
[0 to 1 words] X [0 to 3 words] Y [0 to 1 words]
“the X cut * Y with”
“* X * the Y *”
2(n−2) patterns
topkN patterns
k = 20
SMO RBF algorithm
PairClass vs. LSA(Turney, 2006a)
PairClass does not use a lexicon to find synonyms for the input word pairs. a pure corpus-based algorithm can handle synony
ms without a lexicon. PairClass uses a support vector machine (SV
M) instead of a nearest neighbour (NN) learning algorithm.
PairClass does not use SVD to smooth the feature vectors. It has been our experience that SVD is not neces
sary with SVMs.
Measure of similarity PairClass: probability estimates, more useful Turney (2006): cosine
The automatically generated patterns are slightly more general PairClass: [0 to 1 words] X [0 to 3 words] Y [0 to 1 words] Turney (2006): X [0 to 3 words] Y
The morphological processing in PairClass (Minnen et al., 2001) is more sophisticated than in Turney (2006).
3 Experiment: SAT Analogies
use a set of 374 multiple-choice questions from the SAT college entrance exam.
Eg.
a binary classification problem
3 Experiment: SAT Analogies
1st DIFFICULTY: no negative examples the training set consists of one positive example
(the stem pair) and the testing set consists of five unlabeled examples (the five choice pairs).
Solution: Randomly choose one of the other 373 questions,
to be a negative example use PairClass to estimate the probability that
each testing example is positive, and we guess the testing example with the highest probability.
3 Experiment: SAT Analogies
2nd DIFFICULTY: the algorithm is very unstable, for lack of example
s. Solution:
To increase the stability, we repeat the learning process 10 times, using a different randomly chosen negative training example each time.
Average the 10 probability
PairClass: accuracy of 52.1%
52.1%
3 Experiment: TOEFL Synonyms
Recognizing synonyms a set of 80 multiple-choice synonym question
s from the TOEFL
View it as a binary classification problem
3 Experiment: TOEFL Synonyms
80 questions, 80 positive, 240 negative apply PairClass using ten-fold cross-validation
In each random fold, 90% of the pairs are used for training and 10% are used for testing.
For each fold, the model that is learned from the training set is used to assign probabilities to the pairs in the testing set.
They are non-overlapping, so can cover the whole dataset.
Choice: the one with hightest probability
PairClass: accuracy of 76.1%
76.1%
3 Experiment: Synonyms and Antonyms
a set of 136 ESL practice questions
3 Experiment: Synonyms and Antonyms
By patterns: hand-coded Lin et al. (2003) two patterns, “from X to Y ” and “either X or Y ”.
Antonyms: they occasionally appear in a large corpus in one of these two patterns
Synonyms: very rare to appear in these patterns.
PairClass: automatically
3 Experiment: Synonyms and Antonyms
RESULT PairClass: ten-fold cross-validation
accuracy of 75.0% (ten-fold cross-validation) Baseline:
accuracy of 65.4% (Always guessing the majority class)
NO COMPARISON
3 Experiment: Similar, Associated, and Both
Lund et al. (1995) evaluated their corpus-based algorithm for measuring word similarity with word pairs that were labeled similar, associated, or both.
These 144 labeled pairs were originally created for cognitive psychology experiments with human subjects
3 Experiment: Similar, Associated, and Both
Lund et al. (1995) did not measure the accuracy showed that their algorithm’s similarity scores were
correlated with the response times of human subjects in priming tests.
PairClass with ten-fold cross-validation accuracy of 77.1%
Baseline: guessing the majority and Randomly guessing: 33.3%
Since the three classes are of equal size
3 Experiment: summary
For the first two experiments PairClass is not the best, But it performs competitively
For the second two experiments, PairClass performs significantly above the baselin
es.
State of the art
YEAR 算法 类型 synonym analogy
2001 PMI-IR Corpus-based 73.75%
2003 PR Hybrid 97.50%
2005 VSM Corpus-based 47.1%
2006a LRA Corpus-based 56.1%
2006b PERT Corpus-based 53.5%
2008 PairClass Corpus-based 76.1% 52.1%
HUMAN 64.5% 57.0%
终于讲完了 o_0
Any Questions?