1 extending link-based algorithms for similar web pages with neighborhood structure allen, zhenjiang...
Post on 21-Dec-2015
216 views
TRANSCRIPT
![Page 1: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/1.jpg)
1
Extending Link-based Algorithms for Similar Web Pages
with Neighborhood Structure
Allen, Zhenjiang LIN CSE, CUHK
13 Dec 2006
![Page 2: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/2.jpg)
2
Outline
1. Introduction
2. Extended Neighborhood Structure Model
3. Extending Link-based Similarity Measures
4. Experimental Results
5. Conclusion and Future Work
![Page 3: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/3.jpg)
3
1. Introduction
Background Similarity measures are required in many web
applications to evaluate the similarity between web pages. The “similar pages” service of Web search engines; Web document classification; Web community identification.
Problem Many link-based similarity measures are not so
accurate since they consider only part of the structural information.
![Page 4: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/4.jpg)
4
1. Introduction
Motivation How to improve the accuracy of link-based similarity
measures by making full use of the structural information?
Contributions Propose the Extended Neighborhood Structure (ENS) model.
bi-direction multi-hop
Construct extended link-based similarity measures base on the ENS model. more flexible and accurate
![Page 5: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/5.jpg)
5
1. Introduction
Searching the Web Keyword searching
Similarity searching
Search Engine
KEYWORDS: news
http://news.bbc.co.uk/
http://www.cnn.com/ …
Search Engine
URL: www.cnn.com
http://news.bbc.co.uk/
http://usnews.com/ …
similarity measure
![Page 6: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/6.jpg)
6
Similarity measures Evaluate how similarity or related two objects are.
Approaches to measuring similarity Text-based
Cosine TFIDF [Joachims97]
Link-based Bibliographic coupling [Kessler63] Co-citation [Small73] SimRank [Jeh et al 02], PageSim [Lin et al 06]
Hybrid
1. Introduction
Focus of this talk
![Page 7: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/7.jpg)
7
Extended Neighborhood Structure (ENS) model
Question: what hide in hyperlinks? similarity relationship between pages, similarity relationship decrease along hyperlinks.
2. Extend Neighborhood Structure Model
![Page 8: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/8.jpg)
8
2. Extend Neighborhood Structure Model
Extended Neighborhood Structure (ENS) model The ENS model
bi-direction in-link out-link
multi-hop direct (1-hop) indirect (2-hop, 3-hop, etc)
Purpose Improve accuracy of link-based similarity measures by
helping them make full use of the structural information of the Web.
![Page 9: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/9.jpg)
9
3. Extending Link-based Similarity Measures
Intuition of similarity Similar web pages have similar neighbors.
(to compare two web pages, see their neighbors.)
Notations G=(V, E), |V| = n: the web graph. I(a) / O(a): in-link / out-link neighbors of web page a. path(a1, as): a sequence of vertices a1, a2, …, as such
that (ai, ai+1) ∈ E (i=1,…,s-1) and ai are distinct.
PATH(a,b): the set of all possible paths from page a to b.
Sim(a,b): similarity score of web page a and b.
![Page 10: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/10.jpg)
10
3. Extending Link-based Similarity Measures
Two classical methods Co-citation: the more common in-link neighbors, the more
similar.
Sim(a,b) = |I(a)∩I(b)| Bibliographic coupling: the more common out-link
neighbors, the more similar.
Sim(a,b) = |O(a)∩O(b)|
Extended Co-citation and Bibliographic Coupling (ECBC) ECBC: the more common neighbors, the more similar.
Sim(a,b) = α|I(a)∩I(b)| + (1-α)|O(a)∩O(b)|, where 0≤α≤1 is a constant.
![Page 11: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/11.jpg)
11
3. Extending Link-based Similarity Measures
SimRank“two pages are similar if they are linked to by similar
pages”
(1) Sim(u,u)=1; (2) Sim(u,v)=0 if |I(u)| |I(v)| = 0. Recursive definition
C is a constant between 0 and 1. The iteration starts with Sim(u,u)=1, Sim(u,v)=0 if u≠
v.
( ) ( )( , )
( , )| ( ) | | ( ) |
a I u b I vSim a b
Sim u v CI u I v
![Page 12: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/12.jpg)
12
3. Extending Link-based Similarity Measures
Extended SimRank“two pages are similar if they have similar neighbors”
(1) Sim(u,u)=1; (2) Sim(u,v)=0 if |I(u)| |I(v)| = 0. Recursive definition
C is a constant between 0 and 1. The iteration starts with Sim(u,u)=1, Sim(u,v)=0 if u≠
v.
( ) ( ) ( ) ( )( , ) ( , )
( , )| ( ) | | ( ) | | ( ) | | ( ) |
a I u b I v a O u b O vSim a b Sim a b
Sim u v CI u I v O u O v
![Page 13: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/13.jpg)
13
3. Extending Link-based Similarity Measures
PageSim
“weighted multi-hop” version of Co-citation algorithm.
(a) multi-hop in-link information, and
(b) importance of web pages.
Can be represented by any global scoring system
PageRank scores, or
Authoritative scores of HITS.
![Page 14: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/14.jpg)
14
3. Extending Link-based Similarity Measures
PageSim (phase 1: feature propagation) Initially, each web page contains an unique feature
information, which is represented by its PageRank score.
The feature information of a web page is propagated along out-link hyperlinks at decay rate d. The PR score of u propagated to v is defined by
![Page 15: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/15.jpg)
15
3. Extending Link-based Similarity Measures
PageSim (phase 2: similarity computation) A web page v stores the feature information of its and
others in its Feature Vector FV(v).
The similarity between web page u and v is computed by Jaccard measure [Jain et al 88]
Intuition: the more common feature information two web pages contain, the more similar they are.
![Page 16: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/16.jpg)
16
3. Extending Link-based Similarity Measures
Extended PageSim (EPS)
Propagating feature information of web pages along in-link hyperlinks at decay rate 1- d.
Computing the in-link PS scores.
EPS(u,v) = in-link PS(u,v) + out-link PS(u,v).
![Page 17: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/17.jpg)
17
Properties
CC: Co-citation, BC: Bibliographic Coupling, ECBC: Extended Co-citation and Bibliographic Coupling, SR: SimRank, ESR: Extended SimRank, PS: PageSim, EPS: Extended PageSim.
Summary The extended versions consider more structural information. ESR and EPS are bi-directional & multi-hop. In ESR, two web pages are not similar unless there are
intermediate pages between them, even if they link to other (see Figure 1(2)).
3. Extending Link-based Similarity Measures
![Page 18: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/18.jpg)
18
Case study: Sim(a,b)
Summary The extended algorithms are more flexible. EPS is able to handle more cases.
3. Extending Link-based Similarity Measures
![Page 19: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/19.jpg)
19
4. Experimental Results
Datasets CSE Web (CW) dataset:
A set of web pages crawled from http://cse.cuhk.edu.hk.
22,000 pages, 180,000 hyperlinks.
The average number of in-links and out-links are 8.6 and 7.7.
Google Scholar (GS) dataset: A set of articles crawled from Google Scholar searching
engine.
Start crawling by submitting “web mining” keywords to GS, and then following the “Cited by” hyperlinks.
20,000 articles, 154,000 citations.
![Page 20: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/20.jpg)
20
4. Experimental Results
Evaluation Methods Cosine TFIDF similarity (for CW dataset)
A commonly used text-based similarity measure.
“Related Articles” (for GS dataset) A list of related articles to a query article provided by
GS. Can be used as ground truth.
Parameter Settings
![Page 21: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/21.jpg)
21
4. Experimental Results CC, BC vs ECBC
CW data (left): x-axis: top N results; y-axis: average cosine TFIDF of all pages.
GS data (right): x-axis: top N results; y-axis: average precision of all pages.
![Page 22: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/22.jpg)
22
4. Experimental Results SimRank vs Extended SimRank
CW data (left): x-axis: top N results; y-axis: average cosine TFIDF of all pages.
GS data (right): x-axis: top N results; y-axis: average precision of all pages.
![Page 23: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/23.jpg)
23
4. Experimental Results PageSim vs Extended PageSim
CW data (left): x-axis: top N results; y-axis: average cosine TFIDF of all pages.
GS data (right): x-axis: top N results; y-axis: average precision of all pages.
![Page 24: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/24.jpg)
24
4. Experimental Results Overall Accuracy of Algorithms
![Page 25: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/25.jpg)
25
5. Conclusion and Future Work
Conclusion Extended Neighborhood Structure model
Bi-direction and multi-hop Extend existing link-based similarity measures
Co-citation, Bibliographic coupling, SimRank, PageSim Experiments
Future Work Extend link-based algorithms based on ENS model Prove the convergence of the Extended SimRank Integrating link-based with text-based
![Page 26: 1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006](https://reader030.vdocuments.us/reader030/viewer/2022032521/56649d575503460f94a369c3/html5/thumbnails/26.jpg)
26
Publications
Z. Lin, M. R. Lyu, and I. King. PageSim: A novel link-based measure of web page similarity. In WWW '06: Proceedings of the 15th international conference on World Wide Web. Pages 1019-1020, Edinburgh, Scotland, 2006.
Z. Lin, I. King, and M. R. Lyu. PageSim: A novel link-based similarity measure for the World Wide Web. In WI ’06: Proceedings of the 5th International Conference on Web Intelligence. ACM Press. To appear, 2006.
Z. Lin, M. R. Lyu, and I. King. Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure. Submitted to WWW’07.