Download - NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

NTU Natural Language Processing Lab.

1

Blog Track Open Task: Spam Blog Classification

Advisor: Hsin-Hsi ChenSpeaker: Sheng-Chung Yen

Date: 2007/01/08

P. Kolari, T. Finin, A. Java and J. MayfieldUniversity of Maryland Baltimore Country and

Johns Hopkins University Applied Physics Laboratory

2


OutlineIntroductionSplog Detection ProblemDetecting SplogsTREC Blog Track 2006Splog Task AssessmentConclusion

3


ConclusionThis paper:

proposes a spam blog classification task for TREC Blog Track 2007argues why it forms an important part of blog analyticssurveys existing techniques on eliminating them. shows how it impacted the primary task of TREC Blog Track 2006puts forward assessment and evaluation for such a task to be adopted in TREC Blog Track 2007

4


IntroductionSpam blogs or splogs refer to blogs created for the sole purpose of hosting ads, promoting page rank of affiliates and getting new content indexed.

This open task submission details How splogs impact Opinion Identification.Proposes an approach to assessment and evaluation for a Spam Blog Classification task in 2007.

5


Splog Detecting ProblemA Post from Splog:

1. Display ads in high paying contexts.2. Features content plagiarized (抄襲 ) from other blogs.3. Hosts hyperlinks that create link farms.

Splog Detection is a classification problem within the blogosphere subset B.

BA: represents all authentic content

BS: represents content from splogs

BU: represents those blog pages for which a judgment of authenticity or spam has not yet been made

6


1

2

3

7


BABS

BU

B

8


Detecting SplogsAll models are based on SVMs

Words (bag-of-words)– Ex: “I”, “We”, “my”, “what” authentic blog

Word N-Gram– Ex: “comments-off”, “in-uncategorized” splog– Ex: “2-comments”, “1-comments”, “I have”, “to my”

authentic blog

Tokenized Anchors– Anchor text: <a>anchor text</a>– “comment”, “flickr” authentic blog

9


Tokenized URLs– Point to “.info” domain splog– Point to “flickr”, “technorati” and “feedster” authentic blog

Global Models– Authentic blogs are very unlikely to link to splogs.– Splogs frequently do link to other splogs.

Other Techniques– Ping server– Url/IP blacklists

10


TREC Blog Track 200617969 feeds from splogs, contributing 15.8% of the documents.

The number of splogs present varies since splogs are query dependent.

11


Cholesterol(膽固醇 )

Hybrid cars

12


Splog Task AssessmentThe classification of splogs:

Non-blogKeyword-stuffingPost-stitchingPost-plagiarismPost-weavingLink-spam

13


Non-blog

14


Keyword-stuffing

15


Post-stitching

16


Post-plagiarism

17


Post-weaving

18


Link-spam

19


ConclusionThis paper:

proposes a spam blog classification task for TREC Blog Track 2007argues why it forms an important part of blog analyticssurveys existing techniques on eliminating them. shows how it impacted the primary task of TREC Blog Track 2006puts forward assessment and evaluation for such a task to be adopted in TREC Blog Track 2007

Download - NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

Top Related