using topology to identify spam (sigir 2007)

105
Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions Know your Neighbors Web Spam Detection Using the Web Topology Carlos Castillo 1 , Debora Donato 1 , Aristides Gionis 1 , Vanessa Murdock 1 , Fabrizio Silvestri 2 1. Yahoo! Research Barcelona – Catalunya, Spain 2. ISTI-CNR –Pisa,Italy ACM SIGIR, 25 July 2007, Amsterdam

Upload: carlos-castillo

Post on 13-May-2015

1.181 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Know your NeighborsWeb Spam Detection Using the Web Topology

Carlos Castillo1, Debora Donato1, Aristides Gionis1,Vanessa Murdock1, Fabrizio Silvestri2

1. Yahoo! Research Barcelona – Catalunya, Spain2. ISTI-CNR –Pisa,Italy

ACM SIGIR, 25 July 2007, Amsterdam

Page 2: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

1 Spam on the Web

2 Detecting Web Spam

3 Link-Based Detection

4 Content-Based Detection

5 Using Links and Contents

6 Using the Web Topology

7 Conclusions

Page 3: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions

Page 4: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

What is on the Web?

Page 5: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

What is on the Web [2.0]?

Page 6: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

What else is on the Web?

Source: www.milliondollarhomepage.com

Page 7: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

What’s happening on the Web?

There is a fierce competition

for your attention

Page 8: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

What’s happening on the Web?

Search engines are to some extent

arbiters of this competition

and they must watch it closely, otherwise ...

Page 9: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Some cheating occurs

1986 FIFA World Cup, Argentina vs England

Page 10: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Simple web spam

Page 11: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Hidden text

Page 12: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Made for advertising

Page 13: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Search engine?

Page 14: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Fake search engine

Page 15: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

“Normal” content in link farms

Page 16: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

There are many attempts of cheating on the Web

Most of these are spam:

1,630,000 results for “free mp3 hilton viagra” in SE1

1,760,000 results for “credit vicodin loan” in SE2

1,320,000 results for “porn mortgage” in SE3

Page 17: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Costs

Costs:

X Costs for users: lower precision for some queries

X Costs for search engines: wasted storage space,network resources, and processing cycles

X Costs for the publishers: resources invested in cheatingand not in improving their contents

Page 18: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cheating on the Web

Z Link spam

Z Content spam

Spam-oriented blogging

Comment/forum/Wiki spam

Malicious cloaking

Click fraud ×2

Malicious tagging

. . . more?

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Page 19: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cheating on the Web

Z Link spam

Z Content spam

Spam-oriented blogging

Comment/forum/Wiki spam

Malicious cloaking

Click fraud ×2

Malicious tagging

. . . more?

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Page 20: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cheating on the Web

Z Link spam

Z Content spam

Spam-oriented blogging

Comment/forum/Wiki spam

Malicious cloaking

Click fraud ×2

Malicious tagging

. . . more?

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Page 21: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cheating on the Web

Z Link spam

Z Content spam

Spam-oriented blogging

Comment/forum/Wiki spam

Malicious cloaking

Click fraud ×2

Malicious tagging

. . . more?

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Page 22: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cheating on the Web

Z Link spam

Z Content spam

Spam-oriented blogging

Comment/forum/Wiki spam

Malicious cloaking

Click fraud ×2

Malicious tagging

. . . more?

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Page 23: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cheating on the Web

Z Link spam

Z Content spam

Spam-oriented blogging

Comment/forum/Wiki spam

Malicious cloaking

Click fraud ×2

Malicious tagging

. . . more?

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Page 24: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cheating on the Web

Z Link spam

Z Content spam

Spam-oriented blogging

Comment/forum/Wiki spam

Malicious cloaking

Click fraud ×2

Malicious tagging

. . . more?

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Page 25: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cheating on the Web

Z Link spam

Z Content spam

Spam-oriented blogging

Comment/forum/Wiki spam

Malicious cloaking

Click fraud ×2

Malicious tagging

. . . more?

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Page 26: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cheating on the Web

Z Link spam

Z Content spam

Spam-oriented blogging

Comment/forum/Wiki spam

Malicious cloaking

Click fraud ×2

Malicious tagging

. . . more?

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Page 27: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cheating on the Web

Z Link spam

Z Content spam

Spam-oriented blogging

Comment/forum/Wiki spam

Malicious cloaking

Click fraud ×2

Malicious tagging

. . . more?

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Page 28: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions

Page 29: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Research on Web spam detection

Web spam detection techniques

Page 30: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Spam, damn spam and statistics

[Fetterly et al., 2004] propose to study statisticaldistributions: “in a number of these distributions, outliervalues are associated with web spam”

Page 31: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Machine learning training

Page 32: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Machine learning

Page 33: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Challenges

Scalability+ Machine Learning Challenges:

Instances are not really independent (graph)

Training set is relatively small

+ Information Retrieval Challenges:

It is hard to find out which features are relevant

Features can be aggregated in content units:page/host/domain

Features can be propagated through the graph

Page 34: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Challenges

Scalability+ Machine Learning Challenges:

Instances are not really independent (graph)

Training set is relatively small

+ Information Retrieval Challenges:

It is hard to find out which features are relevant

Features can be aggregated in content units:page/host/domain

Features can be propagated through the graph

Page 35: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Challenges

Scalability+ Machine Learning Challenges:

Instances are not really independent (graph)

Training set is relatively small

+ Information Retrieval Challenges:

It is hard to find out which features are relevant

Features can be aggregated in content units:page/host/domain

Features can be propagated through the graph

Page 36: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Challenges

Scalability+ Machine Learning Challenges:

Instances are not really independent (graph)

Training set is relatively small

+ Information Retrieval Challenges:

It is hard to find out which features are relevant

Features can be aggregated in content units:page/host/domain

Features can be propagated through the graph

Page 37: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Challenges

Scalability+ Machine Learning Challenges:

Instances are not really independent (graph)

Training set is relatively small

+ Information Retrieval Challenges:

It is hard to find out which features are relevant

Features can be aggregated in content units:page/host/domain

Features can be propagated through the graph

Page 38: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Challenges

Scalability+ Machine Learning Challenges:

Instances are not really independent (graph)

Training set is relatively small

+ Information Retrieval Challenges:

It is hard to find out which features are relevant

Features can be aggregated in content units:page/host/domain

Features can be propagated through the graph

Page 39: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Challenges

Scalability+ Machine Learning Challenges:

Instances are not really independent (graph)

Training set is relatively small

+ Information Retrieval Challenges:

It is hard to find out which features are relevant

Features can be aggregated in content units:page/host/domain

Features can be propagated through the graph

Page 40: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Challenges

Scalability+ Machine Learning Challenges:

Instances are not really independent (graph)

Training set is relatively small

+ Information Retrieval Challenges:

It is hard to find out which features are relevant

Features can be aggregated in content units:page/host/domain

Features can be propagated through the graph

Page 41: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Challenges

Scalability+ Machine Learning Challenges:

Instances are not really independent (graph)

Training set is relatively small

+ Information Retrieval Challenges:

It is hard to find out which features are relevant

Features can be aggregated in content units:page/host/domain

Features can be propagated through the graph

Page 42: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Training data

X It is hard for search engines to provide labeled data

X Even if they do, it will not reflect a consensus on what isWeb Spam

V Public Web Spam collection built by a group ofvolunteers: http://www.yr-bcn.es/webspam/

Page 43: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Training data

X It is hard for search engines to provide labeled data

X Even if they do, it will not reflect a consensus on what isWeb Spam

V Public Web Spam collection built by a group ofvolunteers: http://www.yr-bcn.es/webspam/

Page 44: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Training data

X It is hard for search engines to provide labeled data

X Even if they do, it will not reflect a consensus on what isWeb Spam

V Public Web Spam collection built by a group ofvolunteers: http://www.yr-bcn.es/webspam/

Page 45: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions

Page 46: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

“Link farms”

Web

Link farm

Spam page

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al., 2005]

Page 47: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Handling large-graphs

Memory size enough to hold some data per-node

Disk size enough to hold some data per-edge

A small number of passes over the data

Page 48: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Semi-streaming model

1: for node : 1 . . . N do2: INITIALIZE-MEM(node)3: end for4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do7: COMPUTE(src,dest)8: end for9: end for

10: NORMALIZE11: end for12: POST-PROCESS13: return Something

Page 49: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Semi-streaming model

1: for node : 1 . . . N do2: INITIALIZE-MEM(node)3: end for4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do7: COMPUTE(src,dest)8: end for9: end for

10: NORMALIZE11: end for12: POST-PROCESS13: return Something

Page 50: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Semi-streaming model

1: for node : 1 . . . N do2: INITIALIZE-MEM(node)3: end for4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do7: COMPUTE(src,dest)8: end for9: end for

10: NORMALIZE11: end for12: POST-PROCESS13: return Something

Page 51: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Link-based features

Degree-related measures

PageRank

TrustRank [Gyongyi et al., 2004]

Truncated PageRank [Becchetti et al., 2006]

Estimation of supporters [Becchetti et al., 2006]

140 features per host (2 pages per host)

Page 52: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Degree-Based

0.00

0.02

0.04

0.06

0.08

0.10

0.12

1968753460609107764252125899138032376184

NormalSpam

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

22009.92686.5327.940.04.90.60.10.00.00.0

NormalSpam

Page 53: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

TrustRank

[Gyongyi et al., 2004]

Page 54: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

TrustRank / PageRank

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

9e+033e+031e+033e+021e+024e+011e+01410.4

NormalSpam

Page 55: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Truncated PageRank

Proposed in [Becchetti et al., 2006]. Idea: reduce the directcontribution of the first levels of links:

damping(t) =

{0 t ≤ T

Cαt t > T

V No extra reading of the graph after PageRank

Page 56: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Truncated PageRank

Proposed in [Becchetti et al., 2006]. Idea: reduce the directcontribution of the first levels of links:

damping(t) =

{0 t ≤ T

Cαt t > T

V No extra reading of the graph after PageRank

Page 57: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Hop-plot

Page 58: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0%−10%Top 40%−50%Top 60%−70%

Areas below the curves are equal if we are in the samestrongly-connected component

Page 59: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0%−10%Top 40%−50%Top 60%−70%

Areas below the curves are equal if we are in the samestrongly-connected component

Page 60: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

“OR” operation

100010

[Becchetti et al., 2006] shows an improvement of ANFalgorithm [Palmer et al., 2002] based on probabilisticcounting [Flajolet and Martin, 1985]

Page 61: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

“OR” operation

100010

[Becchetti et al., 2006] shows an improvement of ANFalgorithm [Palmer et al., 2002] based on probabilisticcounting [Flajolet and Martin, 1985]

Page 62: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Bottleneck number

bd(x) = minj≤d

{|Nj (x)||Nj−1(x)|

}.

Minimum rate of growth of the neighbors of x up to a certaindistance.

Page 63: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Bottleneck number: spam

Page 64: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Bottleneck number: normal

Page 65: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Bottleneck number

bd(x) = minj≤d{|Nj(x)|/|Nj−1(x)|}.

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

4.523.873.312.832.422.071.781.521.301.11

NormalSpam

Page 66: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions

Page 67: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Content-Based Features

Most of the features reported in [Ntoulas et al., 2006]

Number of word in the page and title

Average word length

Fraction of anchor text

Fraction of visible text

Compression rate

Corpus precision and corpus recall

Query precision and query recall

Independent trigram likelihood

Entropy of trigrams

96 features per host

Page 68: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Content-based features (entropy related)

T = {(w1, p1), . . . , (wk , pk)} the set of trigrams in a page,

where trigram wi has frequency pi

Features:

Entropy of trigrams H = −∑

wi∈T pi log pi

Also, compression rate, as measured by bzip

Page 69: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Content-based features (related to popularkeywords)

F set of most frequent terms in the collection

Q set of most frequent terms in a query log

P set of terms in a page

Features:

Corpus “precision” |P ∩ F |/|P|Corpus “recall” |P ∩ F |/|F |Query “precision” |P ∩ Q|/|P|Query “recall” |P ∩ Q|/|Q|

Page 70: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Average word length

0.00

0.02

0.04

0.06

0.08

0.10

0.12

3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5

NormalSpam

Figure: Histogram of the average word length in non-spam vs.spam pages for k = 500.

Page 71: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Corpus precision

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.10

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

NormalSpam

Figure: Histogram of the corpus precision in non-spam vs. spampages.

Page 72: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Query precision

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.0 0.1 0.2 0.3 0.4 0.5 0.6

NormalSpam

Figure: Histogram of the query precision in non-spam vs. spampages for k = 500.

Page 73: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions

Page 74: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cost-sensitive decision tree with bagging

Bagging of 10 decision trees, asymmetrical costs.

Cost ratio 1 10 20 30 50

True positive rate 65.8% 66.7% 71.1% 78.7% 84.1%False positive rate 2.8% 3.4% 4.5% 5.7% 8.6%

F-Measure 0.712 0.703 0.704 0.723 0.692

Page 75: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Link- and content-based features

Link-based and content-based

Both Link-only Content-only

True positive rate 78.7% 79.4% 64.9%False positive rate 5.7% 9.0% 3.7%

F-Measure 0.723 0.659 0.683

Page 76: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions

Page 77: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Hypothesis

Pages topologically close to each other are more likelyto have the same label (spam/nonspam) than randompairs of pages.

Pages linked together are more likely to be on the same topicthan random pairs of pages [Davison, 2000]

Page 78: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Hypothesis

Pages topologically close to each other are more likelyto have the same label (spam/nonspam) than randompairs of pages.

Pages linked together are more likely to be on the same topicthan random pairs of pages [Davison, 2000]

Page 79: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Page 80: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Topological dependencies: in-links

Histogram of fraction of spam hosts in the in-links

0 = no in-link comes from spam hosts

1 = all of the in-links come from spam hosts

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.0 0.2 0.4 0.6 0.8 1.0

In-links of non spamIn-links of spam

Page 81: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Topological dependencies: out-links

Histogram of fraction of spam hosts in the out-links

0 = none of the out-links points to spam hosts

1 = all of the out-links point to spam hosts

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.0 0.2 0.4 0.6 0.8 1.0

Out-links of non spamOutlinks of spam

Page 82: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 1: Clustering

Classify, then cluster hosts, then assign the same label to allhosts in the same cluster by majority voting

Page 83: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 1: Clustering (cont.)

Initial prediction:

Page 84: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 1: Clustering (cont.)

Clustering:

Page 85: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 1: Clustering (cont.)

Final prediction:

Page 86: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 1: Clustering – Results

Baseline Clustering

Without bagging

True positive rate 75.6% 74.5%False positive rate 8.5% 6.8%

F-Measure 0.646 0.673With bagging

True positive rate 78.7% 76.9%False positive rate 5.7% 5.0%

F-Measure 0.723 0.728

V Reduces error rate

Page 87: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 2: Propagate the label

Classify, then interpret “spamicity” as a probability, then do arandom walk with restart from those nodes

Page 88: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 2: Propagate the label (cont.)

Initial prediction:

Page 89: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 2: Propagate the label (cont.)

Propagation:

Page 90: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 2: Propagate the label (cont.)

Final prediction, applying a threshold:

Page 91: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 2: Propagate the label – Results

Baseline Fwds. Backwds. Both

Classifier without bagging

True positive rate 75.6% 70.9% 69.4% 71.4%False positive rate 8.5% 6.1% 5.8% 5.8%

F-Measure 0.646 0.665 0.664 0.676Classifier with bagging

True positive rate 78.7% 76.5% 75.0% 75.2%False positive rate 5.7% 5.4% 4.3% 4.7%

F-Measure 0.723 0.716 0.733 0.724

Page 92: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 3: Stacked graphical learning

Meta-learning scheme [Cohen and Kou, 2006]

Derive initial predictions

Generate an additional attribute for each object bycombining predictions on neighbors in the graph

Append additional attribute in the data and retrain

Page 93: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 3: Stacked graphical learning (cont.)

Let p(x) ∈ [0..1] be the prediction of a classificationalgorithm for a host x using k features

Let N(x) be the set of pages related to x (in some way)

Compute

f (x) =

∑g∈N(x) p(g)

|N(x)|Add f (x) as an extra feature for instance x and learn anew model with k + 1 features

Page 94: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 3: Stacked graphical learning (cont.)

Initial prediction:

Page 95: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 3: Stacked graphical learning (cont.)

Computation of new feature:

Page 96: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 3: Stacked graphical learning (cont.)

New prediction with k + 1 features:

Page 97: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 3: Stacked graphical learning - Results

Avg. Avg. Avg.Baseline of in of out of both

True positive rate 78.7% 84.4% 78.3% 85.2%False positive rate 5.7% 6.7% 4.8% 6.1%

F-Measure 0.723 0.733 0.742 0.750

V Increases detection rate

Page 98: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 3: Stacked graphical learning x2

And repeat ...

Baseline First pass Second pass

True positive rate 78.7% 85.2% 88.4%False positive rate 5.7% 6.1% 6.3%

F-Measure 0.723 0.750 0.763

V Significant improvement over the baseline

Page 99: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions

Page 100: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Concluding remarks

V Considering content-based and link-based attributesimproves the accuracy of the classifier

V Considering the links among pages improves the accuracy

Page 101: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Concluding remarks

V Considering content-based and link-based attributesimproves the accuracy of the classifier

V Considering the links among pages improves the accuracy

Page 102: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

i Web Spam Dataset: http://www.yr-bcn.es/webspam/

i Web Spam Challenge I & II: http://webspam.lip6.fr/

i AIRWeb Workshop: http://airweb.cse.lehigh.edu/

i GraphLab at ECML/PKDD: http://graphlab.lip6.fr/

B Newsletter: [email protected]

Thank you!

Page 103: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

i Web Spam Dataset: http://www.yr-bcn.es/webspam/

i Web Spam Challenge I & II: http://webspam.lip6.fr/

i AIRWeb Workshop: http://airweb.cse.lehigh.edu/

i GraphLab at ECML/PKDD: http://graphlab.lip6.fr/

B Newsletter: [email protected]

Thank you!

Page 104: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R.(2006).Using rank propagation and probabilistic counting for link-based spamdetection.In Proceedings of the Workshop on Web Mining and Web Usage Analysis(WebKDD), Pennsylvania, USA. ACM Press.

Cohen, W. W. and Kou, Z. (2006).Stacked graphical learning: approximating learning in markov randomfields using very short inhomogeneous markov chains.Technical report.

Davison, B. D. (2000).Topical locality in the web.In Proceedings of the 23rd annual international ACM SIGIR conference onresearch and development in information retrieval, pages 272–279, Athens,Greece. ACM Press.

Fetterly, D., Manasse, M., and Najork, M. (2004).Spam, damn spam, and statistics: Using statistical analysis to locate spamweb pages.In Proceedings of the seventh workshop on the Web and databases(WebDB), pages 1–6, Paris, France.

Flajolet, P. and Martin, N. G. (1985).Probabilistic counting algorithms for data base applications.Journal of Computer and System Sciences, 31(2):182–209.

Page 105: Using Topology to Identify Spam (SIGIR 2007)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Gibson, D., Kumar, R., and Tomkins, A. (2005).Discovering large dense subgraphs in massive graphs.In VLDB ’05: Proceedings of the 31st international conference on Verylarge data bases, pages 721–732. VLDB Endowment.

Gyongyi, Z., Garcia-Molina, H., and Pedersen, J. (2004).Combating Web spam with TrustRank.In Proceedings of the 30th International Conference on Very Large DataBases (VLDB), pages 576–587, Toronto, Canada. Morgan Kaufmann.

Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006).Detecting spam web pages through content analysis.In Proceedings of the World Wide Web conference, pages 83–92,Edinburgh, Scotland.

Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).ANF: a fast and scalable tool for data mining in massive graphs.In Proceedings of the eighth ACM SIGKDD international conference onKnowledge discovery and data mining, pages 81–90, New York, NY, USA.ACM Press.