Download - 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…
![Page 1: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/1.jpg)
1
Efficient Phrase-Based Document Similarity for Clustering
IEEE Transactions On Knowledge And Data Engineering, Vol. 20, No. 9, Page(s):1217-1229,2008 Speaker: Wei-Cheng WuData:2008/10/23
![Page 2: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/2.jpg)
2
Outline1. Introduction2. The Phrase-Based Document
Similarity3. Experimental Results 4. Conclusions
![Page 3: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/3.jpg)
3
1. Introduction
Clustering techniques are based on four concepts , data representation model , similarity measure , clustering model , and clustering algorithm
Vector Space Document (VSD) model Suffix Tree Document (STD) model
![Page 4: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/4.jpg)
4
2. The Phrase-Based Document Similarity
Standard Suffix Tree Document Model and STC Algorithm
![Page 5: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/5.jpg)
5
2. The Phrase-Based Document Similarity
EX:(1, 2,3) (1, 2) 2 0.5(1,2,3) (1, 2) 3
b c
b c
B BB B
![Page 6: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/6.jpg)
6
The Phrase-Based Document Similarity Based on the STD Model
2. The Phrase-Based Document Similarity
![Page 7: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/7.jpg)
7
Vector Space Document (VSD) model
2. The Phrase-Based Document Similarity
(1, ), (2, ),......., ( , )d w d w d w M d
(1)
( , ) (1 log ( , )) log(1 / ( ))w i d tf i d N df i
Example of Fig.1
( ) 3, ( ,1) 1df b tf b
( ,1) (1 log1) log(1 3/ 3) 0.693w b
![Page 8: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/8.jpg)
8
2. The Phrase-Based Document Similarity
(2)
1 2 1 2, ,...., , , ,.....,x M y Md x x x d y y y
Let vectors
(3)
![Page 9: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/9.jpg)
9
Properties of the STD Model
2. The Phrase-Based Document Similarity
![Page 10: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/10.jpg)
10
2. The Phrase-Based Document Similarity
![Page 11: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/11.jpg)
11
Property1. Each internal node of the suffix tree T represents an LCP of the document data set D, and each leaf node represents a suffix substring of a document in the data set D.
Property 2. Each first-level node in suffix tree T is labeled by a distinct phrase that appears at least once in the documents of data set D. The number of the first-level nodes is equal to the number of keywords (distinct single-word terms in the VSD model) in the data set D.
Property 3. Each phrase denoted by an internal node v at a higher level ( ) in suffix tree T contains at least two words. The length of the phrase (by words).
2. The Phrase-Based Document Similarity
vP2vL
2vP
![Page 12: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/12.jpg)
12
3. Experimental Results
OHSUMED Document Collection RCV1 Document Collection 20-Newsgroups Document Collection Original STC algorithm GHAC (group-average HAC algorithm) with the phrase-based document similarity GHAC with the traditional single-word tf-idf cosine similarity K-NN clustering algorithm with the phrase-based document similarity
![Page 13: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/13.jpg)
13
3. Experimental Results
![Page 14: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/14.jpg)
14
1 2{ , ,..... }kC C C C is a clustering of data set D of N document
* * * *1 2{ , ,..... }lC C C C designate the “correct” class set of D
The recall of cluster j with respect to class i
* *( , ) /j i irec i j C C C
The precision of cluster j with respect to class i
*( , ) /j i jprec i j C C C
3. Experimental Results
![Page 15: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/15.jpg)
15
3. Experimental Results
2 ( , ) ( , )( , )( , ) ( , )
prec i j rec i jF i jprec i j rec i j
*
1,..,1
max{ ( , )}l
i
i ki
CF F i j
N
1,..,1
max{ ( , )}k
j
i lj
CPurity prec i j
N
![Page 16: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/16.jpg)
16
3. Experimental Results
![Page 17: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/17.jpg)
17
3. Experimental Results
![Page 18: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/18.jpg)
18
3. Experimental Results
![Page 19: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/19.jpg)
19
The Performance Evaluation on Large Document Data Sets
we conducted a set of experiments on a large data set(DS8) that are generated from the RCV1 document collection .The data set DS8 contains 500 documents of category GSPO, M11, respectively, and all documents of other eight categories. The total number of documents is 4,759.
3. Experimental Results
![Page 20: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/20.jpg)
20
3. Experimental Results
![Page 21: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/21.jpg)
21
3. Experimental Results
![Page 22: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/22.jpg)
22
3. Experimental Results
![Page 23: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/23.jpg)
23
3. Experimental Results
![Page 24: 1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data…](https://reader036.vdocuments.us/reader036/viewer/2022062911/5a4d1bfb7f8b9ab0599ebe18/html5/thumbnails/24.jpg)
24
4. Conclusions
The new phrase-based document similarity successfully connects the two document models and inherits their advantages.