NTU Natural Language Processing Lab.
1
Blog Track Open Task: Spam Blog Classification
Advisor: Hsin-Hsi ChenSpeaker: Sheng-Chung Yen
Date: 2007/01/08
P. Kolari, T. Finin, A. Java and J. MayfieldUniversity of Maryland Baltimore Country and
Johns Hopkins University Applied Physics Laboratory
2
NTU Natural Language Processing Lab.
OutlineIntroductionSplog Detection ProblemDetecting SplogsTREC Blog Track 2006Splog Task AssessmentConclusion
3
NTU Natural Language Processing Lab.
ConclusionThis paper:
proposes a spam blog classification task for TREC Blog Track 2007argues why it forms an important part of blog analyticssurveys existing techniques on eliminating them. shows how it impacted the primary task of TREC Blog Track 2006puts forward assessment and evaluation for such a task to be adopted in TREC Blog Track 2007
4
NTU Natural Language Processing Lab.
IntroductionSpam blogs or splogs refer to blogs created for the sole purpose of hosting ads, promoting page rank of affiliates and getting new content indexed.
This open task submission details How splogs impact Opinion Identification.Proposes an approach to assessment and evaluation for a Spam Blog Classification task in 2007.
5
NTU Natural Language Processing Lab.
Splog Detecting ProblemA Post from Splog:
1. Display ads in high paying contexts.2. Features content plagiarized (抄襲 ) from other blogs.3. Hosts hyperlinks that create link farms.
Splog Detection is a classification problem within the blogosphere subset B.
BA: represents all authentic content
BS: represents content from splogs
BU: represents those blog pages for which a judgment of authenticity or spam has not yet been made
6
NTU Natural Language Processing Lab.
1
2
3
7
NTU Natural Language Processing Lab.
BABS
BU
B
8
NTU Natural Language Processing Lab.
Detecting SplogsAll models are based on SVMs
Words (bag-of-words)– Ex: “I”, “We”, “my”, “what” authentic blog
Word N-Gram– Ex: “comments-off”, “in-uncategorized” splog– Ex: “2-comments”, “1-comments”, “I have”, “to my”
authentic blog
Tokenized Anchors– Anchor text: <a>anchor text</a>– “comment”, “flickr” authentic blog
9
NTU Natural Language Processing Lab.
Tokenized URLs– Point to “.info” domain splog– Point to “flickr”, “technorati” and “feedster” authentic blog
Global Models– Authentic blogs are very unlikely to link to splogs.– Splogs frequently do link to other splogs.
Other Techniques– Ping server– Url/IP blacklists
10
NTU Natural Language Processing Lab.
TREC Blog Track 200617969 feeds from splogs, contributing 15.8% of the documents.
The number of splogs present varies since splogs are query dependent.
11
NTU Natural Language Processing Lab.
Cholesterol(膽固醇 )
Hybrid cars
12
NTU Natural Language Processing Lab.
Splog Task AssessmentThe classification of splogs:
Non-blogKeyword-stuffingPost-stitchingPost-plagiarismPost-weavingLink-spam
13
NTU Natural Language Processing Lab.
Non-blog
14
NTU Natural Language Processing Lab.
Keyword-stuffing
15
NTU Natural Language Processing Lab.
Post-stitching
16
NTU Natural Language Processing Lab.
Post-plagiarism
17
NTU Natural Language Processing Lab.
Post-weaving
18
NTU Natural Language Processing Lab.
Link-spam
19
NTU Natural Language Processing Lab.
ConclusionThis paper:
proposes a spam blog classification task for TREC Blog Track 2007argues why it forms an important part of blog analyticssurveys existing techniques on eliminating them. shows how it impacted the primary task of TREC Blog Track 2006puts forward assessment and evaluation for such a task to be adopted in TREC Blog Track 2007