using topology to identify spam (sigir 2007)
TRANSCRIPT
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Know your NeighborsWeb Spam Detection Using the Web Topology
Carlos Castillo1, Debora Donato1, Aristides Gionis1,Vanessa Murdock1, Fabrizio Silvestri2
1. Yahoo! Research Barcelona – Catalunya, Spain2. ISTI-CNR –Pisa,Italy
ACM SIGIR, 25 July 2007, Amsterdam
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
1 Spam on the Web
2 Detecting Web Spam
3 Link-Based Detection
4 Content-Based Detection
5 Using Links and Contents
6 Using the Web Topology
7 Conclusions
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
What is on the Web?
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
What is on the Web [2.0]?
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
What else is on the Web?
Source: www.milliondollarhomepage.com
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
What’s happening on the Web?
There is a fierce competition
for your attention
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
What’s happening on the Web?
Search engines are to some extent
arbiters of this competition
and they must watch it closely, otherwise ...
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Some cheating occurs
1986 FIFA World Cup, Argentina vs England
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Simple web spam
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Hidden text
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Made for advertising
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Search engine?
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Fake search engine
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
“Normal” content in link farms
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
There are many attempts of cheating on the Web
Most of these are spam:
1,630,000 results for “free mp3 hilton viagra” in SE1
1,760,000 results for “credit vicodin loan” in SE2
1,320,000 results for “porn mortgage” in SE3
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Costs
Costs:
X Costs for users: lower precision for some queries
X Costs for search engines: wasted storage space,network resources, and processing cycles
X Costs for the publishers: resources invested in cheatingand not in improving their contents
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cheating on the Web
Z Link spam
Z Content spam
Spam-oriented blogging
Comment/forum/Wiki spam
Malicious cloaking
Click fraud ×2
Malicious tagging
. . . more?
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cheating on the Web
Z Link spam
Z Content spam
Spam-oriented blogging
Comment/forum/Wiki spam
Malicious cloaking
Click fraud ×2
Malicious tagging
. . . more?
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cheating on the Web
Z Link spam
Z Content spam
Spam-oriented blogging
Comment/forum/Wiki spam
Malicious cloaking
Click fraud ×2
Malicious tagging
. . . more?
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cheating on the Web
Z Link spam
Z Content spam
Spam-oriented blogging
Comment/forum/Wiki spam
Malicious cloaking
Click fraud ×2
Malicious tagging
. . . more?
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cheating on the Web
Z Link spam
Z Content spam
Spam-oriented blogging
Comment/forum/Wiki spam
Malicious cloaking
Click fraud ×2
Malicious tagging
. . . more?
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cheating on the Web
Z Link spam
Z Content spam
Spam-oriented blogging
Comment/forum/Wiki spam
Malicious cloaking
Click fraud ×2
Malicious tagging
. . . more?
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cheating on the Web
Z Link spam
Z Content spam
Spam-oriented blogging
Comment/forum/Wiki spam
Malicious cloaking
Click fraud ×2
Malicious tagging
. . . more?
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cheating on the Web
Z Link spam
Z Content spam
Spam-oriented blogging
Comment/forum/Wiki spam
Malicious cloaking
Click fraud ×2
Malicious tagging
. . . more?
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cheating on the Web
Z Link spam
Z Content spam
Spam-oriented blogging
Comment/forum/Wiki spam
Malicious cloaking
Click fraud ×2
Malicious tagging
. . . more?
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cheating on the Web
Z Link spam
Z Content spam
Spam-oriented blogging
Comment/forum/Wiki spam
Malicious cloaking
Click fraud ×2
Malicious tagging
. . . more?
Adversarial relationship
Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Research on Web spam detection
Web spam detection techniques
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Spam, damn spam and statistics
[Fetterly et al., 2004] propose to study statisticaldistributions: “in a number of these distributions, outliervalues are associated with web spam”
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Machine learning training
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Machine learning
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Challenges
Scalability+ Machine Learning Challenges:
Instances are not really independent (graph)
Training set is relatively small
+ Information Retrieval Challenges:
It is hard to find out which features are relevant
Features can be aggregated in content units:page/host/domain
Features can be propagated through the graph
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Challenges
Scalability+ Machine Learning Challenges:
Instances are not really independent (graph)
Training set is relatively small
+ Information Retrieval Challenges:
It is hard to find out which features are relevant
Features can be aggregated in content units:page/host/domain
Features can be propagated through the graph
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Challenges
Scalability+ Machine Learning Challenges:
Instances are not really independent (graph)
Training set is relatively small
+ Information Retrieval Challenges:
It is hard to find out which features are relevant
Features can be aggregated in content units:page/host/domain
Features can be propagated through the graph
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Challenges
Scalability+ Machine Learning Challenges:
Instances are not really independent (graph)
Training set is relatively small
+ Information Retrieval Challenges:
It is hard to find out which features are relevant
Features can be aggregated in content units:page/host/domain
Features can be propagated through the graph
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Challenges
Scalability+ Machine Learning Challenges:
Instances are not really independent (graph)
Training set is relatively small
+ Information Retrieval Challenges:
It is hard to find out which features are relevant
Features can be aggregated in content units:page/host/domain
Features can be propagated through the graph
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Challenges
Scalability+ Machine Learning Challenges:
Instances are not really independent (graph)
Training set is relatively small
+ Information Retrieval Challenges:
It is hard to find out which features are relevant
Features can be aggregated in content units:page/host/domain
Features can be propagated through the graph
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Challenges
Scalability+ Machine Learning Challenges:
Instances are not really independent (graph)
Training set is relatively small
+ Information Retrieval Challenges:
It is hard to find out which features are relevant
Features can be aggregated in content units:page/host/domain
Features can be propagated through the graph
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Challenges
Scalability+ Machine Learning Challenges:
Instances are not really independent (graph)
Training set is relatively small
+ Information Retrieval Challenges:
It is hard to find out which features are relevant
Features can be aggregated in content units:page/host/domain
Features can be propagated through the graph
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Challenges
Scalability+ Machine Learning Challenges:
Instances are not really independent (graph)
Training set is relatively small
+ Information Retrieval Challenges:
It is hard to find out which features are relevant
Features can be aggregated in content units:page/host/domain
Features can be propagated through the graph
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Training data
X It is hard for search engines to provide labeled data
X Even if they do, it will not reflect a consensus on what isWeb Spam
V Public Web Spam collection built by a group ofvolunteers: http://www.yr-bcn.es/webspam/
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Training data
X It is hard for search engines to provide labeled data
X Even if they do, it will not reflect a consensus on what isWeb Spam
V Public Web Spam collection built by a group ofvolunteers: http://www.yr-bcn.es/webspam/
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Training data
X It is hard for search engines to provide labeled data
X Even if they do, it will not reflect a consensus on what isWeb Spam
V Public Web Spam collection built by a group ofvolunteers: http://www.yr-bcn.es/webspam/
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
“Link farms”
Web
Link farm
Spam page
Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al., 2005]
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Handling large-graphs
Memory size enough to hold some data per-node
Disk size enough to hold some data per-edge
A small number of passes over the data
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Semi-streaming model
1: for node : 1 . . . N do2: INITIALIZE-MEM(node)3: end for4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do7: COMPUTE(src,dest)8: end for9: end for
10: NORMALIZE11: end for12: POST-PROCESS13: return Something
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Semi-streaming model
1: for node : 1 . . . N do2: INITIALIZE-MEM(node)3: end for4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do7: COMPUTE(src,dest)8: end for9: end for
10: NORMALIZE11: end for12: POST-PROCESS13: return Something
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Semi-streaming model
1: for node : 1 . . . N do2: INITIALIZE-MEM(node)3: end for4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do7: COMPUTE(src,dest)8: end for9: end for
10: NORMALIZE11: end for12: POST-PROCESS13: return Something
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Link-based features
Degree-related measures
PageRank
TrustRank [Gyongyi et al., 2004]
Truncated PageRank [Becchetti et al., 2006]
Estimation of supporters [Becchetti et al., 2006]
140 features per host (2 pages per host)
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Degree-Based
0.00
0.02
0.04
0.06
0.08
0.10
0.12
1968753460609107764252125899138032376184
NormalSpam
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
22009.92686.5327.940.04.90.60.10.00.00.0
NormalSpam
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
TrustRank
[Gyongyi et al., 2004]
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
TrustRank / PageRank
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
9e+033e+031e+033e+021e+024e+011e+01410.4
NormalSpam
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Truncated PageRank
Proposed in [Becchetti et al., 2006]. Idea: reduce the directcontribution of the first levels of links:
damping(t) =
{0 t ≤ T
Cαt t > T
V No extra reading of the graph after PageRank
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Truncated PageRank
Proposed in [Becchetti et al., 2006]. Idea: reduce the directcontribution of the first levels of links:
damping(t) =
{0 t ≤ T
Cαt t > T
V No extra reading of the graph after PageRank
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Hop-plot
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0%−10%Top 40%−50%Top 60%−70%
Areas below the curves are equal if we are in the samestrongly-connected component
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0%−10%Top 40%−50%Top 60%−70%
Areas below the curves are equal if we are in the samestrongly-connected component
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
“OR” operation
100010
[Becchetti et al., 2006] shows an improvement of ANFalgorithm [Palmer et al., 2002] based on probabilisticcounting [Flajolet and Martin, 1985]
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
“OR” operation
100010
[Becchetti et al., 2006] shows an improvement of ANFalgorithm [Palmer et al., 2002] based on probabilisticcounting [Flajolet and Martin, 1985]
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Bottleneck number
bd(x) = minj≤d
{|Nj (x)||Nj−1(x)|
}.
Minimum rate of growth of the neighbors of x up to a certaindistance.
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Bottleneck number: spam
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Bottleneck number: normal
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Bottleneck number
bd(x) = minj≤d{|Nj(x)|/|Nj−1(x)|}.
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
4.523.873.312.832.422.071.781.521.301.11
NormalSpam
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Content-Based Features
Most of the features reported in [Ntoulas et al., 2006]
Number of word in the page and title
Average word length
Fraction of anchor text
Fraction of visible text
Compression rate
Corpus precision and corpus recall
Query precision and query recall
Independent trigram likelihood
Entropy of trigrams
96 features per host
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Content-based features (entropy related)
T = {(w1, p1), . . . , (wk , pk)} the set of trigrams in a page,
where trigram wi has frequency pi
Features:
Entropy of trigrams H = −∑
wi∈T pi log pi
Also, compression rate, as measured by bzip
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Content-based features (related to popularkeywords)
F set of most frequent terms in the collection
Q set of most frequent terms in a query log
P set of terms in a page
Features:
Corpus “precision” |P ∩ F |/|P|Corpus “recall” |P ∩ F |/|F |Query “precision” |P ∩ Q|/|P|Query “recall” |P ∩ Q|/|Q|
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Average word length
0.00
0.02
0.04
0.06
0.08
0.10
0.12
3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
NormalSpam
Figure: Histogram of the average word length in non-spam vs.spam pages for k = 500.
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Corpus precision
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
NormalSpam
Figure: Histogram of the corpus precision in non-spam vs. spampages.
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Query precision
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.0 0.1 0.2 0.3 0.4 0.5 0.6
NormalSpam
Figure: Histogram of the query precision in non-spam vs. spampages for k = 500.
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Cost-sensitive decision tree with bagging
Bagging of 10 decision trees, asymmetrical costs.
Cost ratio 1 10 20 30 50
True positive rate 65.8% 66.7% 71.1% 78.7% 84.1%False positive rate 2.8% 3.4% 4.5% 5.7% 8.6%
F-Measure 0.712 0.703 0.704 0.723 0.692
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Link- and content-based features
Link-based and content-based
Both Link-only Content-only
True positive rate 78.7% 79.4% 64.9%False positive rate 5.7% 9.0% 3.7%
F-Measure 0.723 0.659 0.683
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Hypothesis
Pages topologically close to each other are more likelyto have the same label (spam/nonspam) than randompairs of pages.
Pages linked together are more likely to be on the same topicthan random pairs of pages [Davison, 2000]
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Hypothesis
Pages topologically close to each other are more likelyto have the same label (spam/nonspam) than randompairs of pages.
Pages linked together are more likely to be on the same topicthan random pairs of pages [Davison, 2000]
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Topological dependencies: in-links
Histogram of fraction of spam hosts in the in-links
0 = no in-link comes from spam hosts
1 = all of the in-links come from spam hosts
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.0 0.2 0.4 0.6 0.8 1.0
In-links of non spamIn-links of spam
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Topological dependencies: out-links
Histogram of fraction of spam hosts in the out-links
0 = none of the out-links points to spam hosts
1 = all of the out-links point to spam hosts
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.0 0.2 0.4 0.6 0.8 1.0
Out-links of non spamOutlinks of spam
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 1: Clustering
Classify, then cluster hosts, then assign the same label to allhosts in the same cluster by majority voting
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 1: Clustering (cont.)
Initial prediction:
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 1: Clustering (cont.)
Clustering:
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 1: Clustering (cont.)
Final prediction:
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 1: Clustering – Results
Baseline Clustering
Without bagging
True positive rate 75.6% 74.5%False positive rate 8.5% 6.8%
F-Measure 0.646 0.673With bagging
True positive rate 78.7% 76.9%False positive rate 5.7% 5.0%
F-Measure 0.723 0.728
V Reduces error rate
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 2: Propagate the label
Classify, then interpret “spamicity” as a probability, then do arandom walk with restart from those nodes
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 2: Propagate the label (cont.)
Initial prediction:
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 2: Propagate the label (cont.)
Propagation:
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 2: Propagate the label (cont.)
Final prediction, applying a threshold:
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 2: Propagate the label – Results
Baseline Fwds. Backwds. Both
Classifier without bagging
True positive rate 75.6% 70.9% 69.4% 71.4%False positive rate 8.5% 6.1% 5.8% 5.8%
F-Measure 0.646 0.665 0.664 0.676Classifier with bagging
True positive rate 78.7% 76.5% 75.0% 75.2%False positive rate 5.7% 5.4% 4.3% 4.7%
F-Measure 0.723 0.716 0.733 0.724
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 3: Stacked graphical learning
Meta-learning scheme [Cohen and Kou, 2006]
Derive initial predictions
Generate an additional attribute for each object bycombining predictions on neighbors in the graph
Append additional attribute in the data and retrain
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 3: Stacked graphical learning (cont.)
Let p(x) ∈ [0..1] be the prediction of a classificationalgorithm for a host x using k features
Let N(x) be the set of pages related to x (in some way)
Compute
f (x) =
∑g∈N(x) p(g)
|N(x)|Add f (x) as an extra feature for instance x and learn anew model with k + 1 features
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 3: Stacked graphical learning (cont.)
Initial prediction:
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 3: Stacked graphical learning (cont.)
Computation of new feature:
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 3: Stacked graphical learning (cont.)
New prediction with k + 1 features:
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 3: Stacked graphical learning - Results
Avg. Avg. Avg.Baseline of in of out of both
True positive rate 78.7% 84.4% 78.3% 85.2%False positive rate 5.7% 6.7% 4.8% 6.1%
F-Measure 0.723 0.733 0.742 0.750
V Increases detection rate
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Idea 3: Stacked graphical learning x2
And repeat ...
Baseline First pass Second pass
True positive rate 78.7% 85.2% 88.4%False positive rate 5.7% 6.1% 6.3%
F-Measure 0.723 0.750 0.763
V Significant improvement over the baseline
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Concluding remarks
V Considering content-based and link-based attributesimproves the accuracy of the classifier
V Considering the links among pages improves the accuracy
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Concluding remarks
V Considering content-based and link-based attributesimproves the accuracy of the classifier
V Considering the links among pages improves the accuracy
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
i Web Spam Dataset: http://www.yr-bcn.es/webspam/
i Web Spam Challenge I & II: http://webspam.lip6.fr/
i AIRWeb Workshop: http://airweb.cse.lehigh.edu/
i GraphLab at ECML/PKDD: http://graphlab.lip6.fr/
B Newsletter: [email protected]
Thank you!
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
i Web Spam Dataset: http://www.yr-bcn.es/webspam/
i Web Spam Challenge I & II: http://webspam.lip6.fr/
i AIRWeb Workshop: http://airweb.cse.lehigh.edu/
i GraphLab at ECML/PKDD: http://graphlab.lip6.fr/
B Newsletter: [email protected]
Thank you!
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R.(2006).Using rank propagation and probabilistic counting for link-based spamdetection.In Proceedings of the Workshop on Web Mining and Web Usage Analysis(WebKDD), Pennsylvania, USA. ACM Press.
Cohen, W. W. and Kou, Z. (2006).Stacked graphical learning: approximating learning in markov randomfields using very short inhomogeneous markov chains.Technical report.
Davison, B. D. (2000).Topical locality in the web.In Proceedings of the 23rd annual international ACM SIGIR conference onresearch and development in information retrieval, pages 272–279, Athens,Greece. ACM Press.
Fetterly, D., Manasse, M., and Najork, M. (2004).Spam, damn spam, and statistics: Using statistical analysis to locate spamweb pages.In Proceedings of the seventh workshop on the Web and databases(WebDB), pages 1–6, Paris, France.
Flajolet, P. and Martin, N. G. (1985).Probabilistic counting algorithms for data base applications.Journal of Computer and System Sciences, 31(2):182–209.
Web SpamDetection
C. Castillo,D. Donato,A. Gionis,
V. Murdock,F. Silvestri
Spam on theWeb
Detecting WebSpam
Link-BasedDetection
Content-BasedDetection
Using Links andContents
Using the WebTopology
Conclusions
Gibson, D., Kumar, R., and Tomkins, A. (2005).Discovering large dense subgraphs in massive graphs.In VLDB ’05: Proceedings of the 31st international conference on Verylarge data bases, pages 721–732. VLDB Endowment.
Gyongyi, Z., Garcia-Molina, H., and Pedersen, J. (2004).Combating Web spam with TrustRank.In Proceedings of the 30th International Conference on Very Large DataBases (VLDB), pages 576–587, Toronto, Canada. Morgan Kaufmann.
Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006).Detecting spam web pages through content analysis.In Proceedings of the World Wide Web conference, pages 83–92,Edinburgh, Scotland.
Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).ANF: a fast and scalable tool for data mining in massive graphs.In Proceedings of the eighth ACM SIGKDD international conference onKnowledge discovery and data mining, pages 81–90, New York, NY, USA.ACM Press.