Web Page Clustering based on
Web Community Extraction
Chikayama-Taura Lab.M2 Shim Wonbo
Background
Directory= Category
Open Directory Project Used by Google, Lycos, etc. Categorizing Web pages by hand
Accurate
Lately updated Unscalable
World Wide Web Rapid increase (= # of clusters changes) Daily updated (= cluster centers move)
Due to these two properties of the Web.. A Web page clustering system without human
effort is needed.
Purpose Constructing a Web page clustering
system which finds clusters without human help is scalable clusters Web pages in high speed clusters Web pages accurately
Brief System View
(a) Web pages
DBGExtraction
(b) Web Communities (c) Web Page Clustering
Partitioning ofremaining pagesbased on TF-IDF
Contribution Web Community
A new Web community topology is defined. Extracted Web community shows higher
precision than existing work.
Web Page Clustering An approach to exploit Web communities as
centroids of clusters in TF-IDF space is taken. Experimental results show meaningful clusters.
Agenda Introduction Related Work Proposal Evaluation Conclusion
Existing Work Text-based clustering
Use of terms as feature Generally used algorithm
ex) k-means, Hierarchical algorithm, Density-based clustering
Link-based clustering Called as Web community extraction Extracting dense subgraphs from the Web graph
Conjunction of text and link information ex) Contents-Link Coupled Web Page Clustering [Yitong
et al., DEWS2004]
Text-based Clustering Merit
Accurate (because of considering text) Problem
Unsupervised clustering Complex to decide the number of clusters
Supervised learning and clustering Difficult to label each training datum
Contents-Link Coupled Web Page Clustering [Yitong et al., DEWS2004]
FeatureTerm frequency (pterm), Out-link (pout), In-link (pin)
Similarity
Clustering Algorithm An extension of the k-means algorithm
Extraction of Web Community based on Link Analysis An Approach to Find Related Communities
Based on Bipartite Graphs [P.Krishna Reddy et al., 2001]
PlusDBG: Web Community Extraction Scheme Improving Both Precision and Pseudo-Recall [Saida et al, 2005]
Terminology Fan and Center
Bipartite Graph (BG) Complete BG (CBG) Dense BG (DBG)
Fan Center
(a) CBG (b) DBG
pq
),( tt qqpp
Algorithm for Extracting DBG [Reddy et al., 2001] Finds bipartite graph using co-citing and
co-cited Web pages Extracts a DBG from above graph
Seed page
2
4
3
3
1
DBG(3, 3)1
3
3
3
3
PlusDBG Uses distance defined by co-citing page
rate between two pages Finds co-citing pages which are within
distance threshold Extracts a DBG from above graph
PlusDBG shows higher precision than DBG does.
Web Community ExtractionO High speedO Finding out topics over the Web
X Possibility of extracting unrelated Web pages as a community
Problem of DBG
Improvement of PlusDBG
Agenda Introduction Related Work Proposal Evaluation Conclusion
Proposal1. Extracts Web communities using link
structure.2. Assigns remainders to the closest Web
community in TF-IDF space.
Connecter Fan which is citing two centers.
Connectable If two centers are connectable, the centers have
more than two connecters. Web Community
A Web Community C is a DBG composed of connectable centers and connecters.
Connectable centersConnecter
Proposed Web Community
All center is connectableto another one.
Proposed Web Community
Extraction Algorithm
b
c
d
e
f
g
h
i
a
j
S={}T={g}
S’={a,b,c,d}T’={e,f,h,i,j}t’=j# connecters = 1
T’={e,f,h,i}t’=i# connecters = 3
S={b,c,d}T={g,i}
Output Community = {a,b,c,d,e,f,g,h,i}
Labeling Remainders Remainder: a Web page which is not
extracted as a member of communities.
1. Calculate centroids of Web communities.
2. Label remainders with Web community ID
w.r.t vi is the TF-IDF vector of a page v
Agenda Introduction Related Work Proposal Evaluation
Preprocess Web community extraction Labeling result
Conclusion
Preprocess Data set
2.34 M pages, 20 M links Almost 80% of data set is Japanese pages.
Create a link-only file Links to out of data set are deleted. Duplicates are deleted which share 90% of links. Pages including 50 links are deleted. Remained data set: 1.45 M pages, 5.09 M links
Create a TF-IDF file Used TF-IDF: Parser: MeCab Terms which appeared in less than 0.1% or more than 90% of
total documents are removed
Distribution of Web Community Size
Distribution of Web Community Size
# communities# extracted pages
PlusDBG 0.8 22,902 865,945
PlusDBG 1.0 8,077 922,053
PlusDBG 1.2 7,527 923,100
Proposed method
50,065 648,626
Distance from centroids to term vectors
Variance of distance
Example of Web communities About motor bike manufacturers and links.
http://bike.ak-m.jp/ http://www.bike-cube.jp/ http://bike.ak-m.jp/2006/01/post_32.html http://www.bike-cube.jp/index.php http://bike.ak-m.jp/2006/11/post_20.html http://www.kymco.co.jp/ http://www1.suzuki.co.jp/motor/ http://www.yamaha-motor.jp/mc/ http://bike.ak-m.jp/ http://www.peugeot-moto.com/ http://www.apriliajapan.co.jp/index.html http://www.buell.jp/ http://www.cagiva.co.jp/ http://www.mitsuoka-motor.com/ http://www.ducati.com/od/ducatijapan/jp/index.jhtml http://www.triumphmotorcycles.com/japan/ http://www.harley-davidson.co.jp/index.html http://www.ktm-japan.co.jp/
Comparing to ODP Definition of precision
1. From a Web community C, let page subset existing in ODP OC.
2. If |OC| < 3, the precision of C is undefined.3. For r in OC, the Pscore of r is:
4. With Pscore, the precision of C is:
Comparing to the 4th and 5th level of ODP directories (Top/Regional/Japan/Arts/Movie)
The number of ODP pages included in the data set: 47,093
score(p, q) = 1, p, q in same directoryscore(p, q) = 0, otherwise
Comparing to ODP
# pages of ODP# communities including ODP pages
# directories which the pages belong to
PlusDBG 0.8 23,287 459 426
PlusDBG 1.0 25,016 156 430
PlusDBG 1.2 25,405 81 435
Proposed Method 12,406 4811 337
Precision of Web Communities(4th level)
Precision of Web communities(5th level)
Summary of Web Community Extraction The proposed method extracted smaller
Web communities than PlusDBG did. Members of each community were closer
to the centroid in the TF-IDF space than members of PlusDBG were.
My communities showed higher precision than PlusDBG’s when comparing to ODP.
Labeling Result Ignore pages including less than 10 terms. Compare to the ODP
ODP pages: 29,153 ODP directories: 1,862
Labeling Result (the 4th level)
Labeling Result (the 5th level)
Labeling example
Labeling example
Summary and Conclusion A DBG structure is defined as the Web community
topology. All two centers should be connectable. All fan is a connecter of centers. My DBG structure extracts more compact and more
precise Web communities than existing work does.
Clustering based on the Web community extraction is proposed. The centroids of communities in TF-IDF space are used
in labeling of remainders. Clustering result showed meaningful page groups.
Future Work Coupling feature selections for
improvement on the labeling result. Clustering extracted centroids.
発表文献(発表予定) ウェブコミュニティ抽出アルゴリズムの改良、沈 垣甫、田浦 健次郎、近山 隆、データ工学ワークショップ、 2007
Thank you for attention
1. Select seed page t and set T={t}, S={}.2. Find S’ of which members cite any page in T.3. Find T’ of which members cited by any page
in T and are not in T.4. Determine that t’∈T’ is connectable to all
pages in T.1. If t’ is connectable, set T=T∪{t’} and
S={connecters} and go to 2.2. If not, select other t’∈T’ and go to 4.
5. If |S| > 3 and |T| > 3, extract the page set as a Web Community and delete from the Web Graph.
6. If any t exists, go to 1.
Extraction Algorithm