Download - similarity measure
Chapter 3
Similarity Measures
Data Mining Technology
Chapter 3
Similarity MeasuresWritten by Kevin E. Heinrich
Presented by Zhao Xinyou
2007.6.7
Some materials (Examples) are taken from Website.
Searching Process
Input Text
Process
IndexQuery
Sorting
Show Text Result
Input Text
IndexQuery
Sorting
Show Text Result
Input Key Words
Results
Search
1. XXXXXXX2. YYYYYYY3. ZZZZZZZZ.......................
Example
Similarity Measures A similarity measure can represent the similar
ity between two documents, two queries, or one document and one query
It is possible to rank the retrieved documents in the order of presumed importance
A similarity measure is a function which computes the degree of similarity between a pair of text objects
There are a large number of similarity measures proposed in the literature, because the best similarity measure doesn't exist (yet!)
PP27-28
Classic Similarity Measures
All similarity measures should map to the range [-1,1] or [0,1],
0 or -1 shows minimum similarity. (incompatible similarity)
1 shows maximum similarity. (absolute similarity)
PP28
Conversion
For example 1 shows incompatible similarity, 10 shows
absolute similarity.
[1, 10]
[0, 1]
s’ = (s – 1 ) / 9
Generally, we may use:s’ = ( s – min_s ) / ( max_s – min_s )
LinearNon-linear
PP28
Vector-Space Model-VSM
1960s Salton etc provided VSM, which has been successfully applied on SMART (a text searching system).
PP28-29
Example D is a set, which contains m Web documents;
D={d1, d2,…di…dm} i=1,2…m
There are n words among m Web documents. di={wi1,wi2,…wij,…win} i=1,2…m , j=1,2,….n
Q= {q1,q2,…qi,….qn} i=1,2,….n
PP28-29
If similarity(q,di) > similarity(q, dj) We may get the result di is more relevant than dj
Simple Measure Technology
Documents Set
PP29
Retrieved A
Relevant B
Retrieved and Relevant A∩B
Precision = Returned Relevant Documents / Total Returned Documents
Recall = Returned Relevant Documents / Total Relevant Documents
P(A,B) = |A∩B| / |A|
R(A,B) = |A∩B| / |B|
Example--Simple Measure Technology
PP29
Documents Set
A,C,E,G,H, I, JRelevant
B,D,FRetrieved & Relevant
W,YRetrieved
|B| = {relevant} ={A,B,C,D,E,F,G,H,I,J} = 10
|A| = {retrieved} = {B, D, F,W,Y} = 5
|A∩B| = {relevant} ∩ {retrieved} ={B,D,F} = 3
P = precision = 3/5 = 60%
R = recall = 3/10 = 30%
Precision-Recall Graph-Curves There is a tradeoff between Precision and Recall So measure Precision at different levels of Recall
PP29-30
One QueryTwo Queries
Difficult to determine which of these twohypothetical results is better
Similarity measures based on VSM Dice coefficient Overlap Coefficient Jaccard Cosine Asymmetric Dissimilarity Other measures
PP30
Dice Coefficient-Cont’ Definition of Harmonic Mean: To X1,X2, …, Xn ,their harmonic mean E equals n divided
by(1/x1+1/x2+…+1/Xn), that is
RP
E11
2
To Harmonic Mean (E) of Precision (P) and Recall (R)
BA
BA
B
BA
A
BA
22
PP30
n
iin x
n
xxx
nE
121
1111
Dice Coefficient-Cont’
Denotation of Dice Coefficient:
)1,0()1(
)1(),(),(
1
2
1
2
1
n
k kj
n
k kq
n
k kjkq
j
ww
ww
BA
BABADdqsim
PP30
EBA
BA
BA
BABAthenDif
2
)1(),(
2
1
α>0.5 : precision is more important
α<0.5 : recall is more important
Usually α=0.5
Overlap CoefficientPP30-31
Documents Set
A queries B Documents
n
k
n
k kjkq
n
k kjkq
j
ww
ww
BA
BABAOdqsim
1 1
22
1
),min(
),min(),(),(
Jaccard Coefficient-Cont’
Documents Set
A queries B Documents
n
k
n
k kjkq
n
k kjkq
n
k kjkq
j
wwww
ww
BA
BABAJdqsim
1 11
22
1
),(),(
PP31
Example- Jaccard Coefficient
D1 = 2T1 + 3T2 + 5T3, (2,3,5)
D2 = 3T1 + 7T2 + T3 , (3,7,1)
Q = 0T1 + 0T2 + 2T3, (0,0,2)
J(D1 , Q) = 10 / (38+4-10) = 10/32 = 0.31
J(D2 , Q) = 2 / (59+4-2) = 2/61 = 0.04
PP31
n
k
n
k kjkq
n
k kjkq
n
k kjkq
j
wwww
ww
dqsim
1 11
22
1
),(
Cosine Coefficient-Cont’
n
k kj
n
k kq
n
k kjkq
j
j
j
ww
ww
dq
dq
BA
BAPRBACdqsim
1
2
1
2
1
),(),(
PP31-32
(d21,d22,…d2n)
(d11,d12,…d1n)(q1,q2,…qn)
Example-Cosine Coefficient Q = 0T1 + 0T2 + 2T3 D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3
C (D1 , Q) =
=10/ √ (38*4) = 0.81
C (D2 , Q) = 2 / √ (59*4) = 0.13
)200()532(
523020222222
PP31-32
(3,7,1)
(2,3,5)(0,0,2)
AsymmetricPP31
n
k kq
n
k kjkqjj
w
wwdqAdqsim
1
1),min(
),(),(
dj
diWki-->wkj
Euclidean distance
n
kkjkqjEj wwdqddqdis
1
2)(),(),(
PP32
Manhattan block distance
n
kkjkqjMj wwdqddqdis
1
),(),(
PP32
Other Measures
We may use priori/context knowledge
For example: Sim(q,dj)= [content identifier similarity]+
[objective term similarity]+
[citation similarity]
PP32
ComparisonPP34
Comparison
),min(
*2
2
1
2
1
BA
BAO
BA
BAC
BA
BAJ
BA
BAD
BA
Simple matching
Dice’s Coefficient
Cosine Coefficient
Overlap Coefficient
Jaccard’s Coefficient
|A|+|B|-|A∩B| ≥(|A|+|B|)/2
|A| ≥ |A∩B||B| ≥ |A∩B|
(|A|+|B|)/2 ≥√(|A|*|B|)
√(|A|*|B|)≥ min (|A|, |B|)
O≥C≥D≥J
PP34
Example-Documents-Term-Query-Cont’D1:A search Engine for 3D Models
D2:Design and Implementation of a string database query language
D3:Ranking of documents by measures considering conceptual dependence between terms
D4 Exploiting hierarchical domain structure to compute similarity
D5:an approach for measuring semantic similarity between words using multiple information sources
D6:determinging semantic similarity among entity classes from different ontologies
D7:strong similarity measures for ordered sets of documents in information retrieval
T1:search(ing) T2:Engine(s) T3:ModelsT4:database T5:query T6:languageT7:documents T8:measur(es,ing) T9:conceptual T10:dependence T11: domain T12:structure T13:similarity T14:semanticT15: ontologiesT16:informationT17: retrieval
Query: Semantic similarity measures used by search engines and other
information searching mechanisms
PP33
Example-Term-Document Matrix-Cont’
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 13 T14 T15 T16 T17
D1 1 1 1
D2 1 1 1
D3 1 1 1 1
D4 1 1 1
D5 1 1 1 1
D6 1 1 1
D7 1 1 1 1 1
Q 2 1 1 1 1 1
Matrix[q][A]
PP34
Dice coefficient
)0*0...1*11*1()0*0...1*12*2(
)0*01*0...0*01*01*11*2(*2
n
k k
n
k kq
n
k kkq
n
k k
n
k kq
n
k kkq
ww
ww
ww
wwqDD
1
211
2
1 1
1
211
2
1 1 2)2
1(
)1(),1(
5.012
6
39
)12(*2
PP30, PP34
Final ResultsPP34
O≥C≥D≥J
Current Applications
Multi-Dimensional Modeling Hierarchical Clustering Bioinformatics
PP35-38
Discussion