blog track open task: spam blog detection
Post on 03-Jan-2016
24 Views
Preview:
DESCRIPTION
TRANSCRIPT
Blog Track Open Task: Spam Blog Detection
Tim Finin
http://ebiquity.umbc.edu/paper/html/id/318/
Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin Martineau
University of Maryland, Baltimore County
NIST Blog Pre-Track, 14 Nov 2006
James MayfieldJohns Hopkins University Applied Physics Laboratory
Blogosphere Reputation at Stake!
Spam in the Blogosphere
• Types: comment spam, ping spam, spam blogs• Akismet: “87% of all comments are spam”• 75% of update pings are spam (ebiquity 2005)• 20% of indexed blogs by popular blog search
engines is spam (Umbria 2006, ebiquity 2005)• “Spam blogs, sometimes referred to by the
neologism splogs, are weblog sites which the author uses only for promoting affiliated websites” 1
• “Spings, or ping spam, are pings that are sent from spam blogs” 1
1Wikipedia
Why a problem?
• Blogosphere increasingly important segment of Web; ~12 hours from post to Google index
• Splog content provides no additional value• Splog content is often plagiarized• Splogs demote value of authentic content• Splogs steal advertising (referral) revenue
from authentic content producers• Splogs stress the blogosphere infrastructure• Splogs can skew Blog Analytics, as was
observed in TREC Blog Track 2006
Nature of Splogs in TREC 2006• Around 83K identifiable blog home-pages
in the collection, with 3.2M permalinks• 81K blogs could be processed • We use splog detection models developed
on blog home-pages; 87% accuracy• We identified 13,542 splogs• Blacklisted 543K permalinks from these
splogs• ~16% of the entire collection• ~17% splog posts injected into TREC
dataset1 1
The TREC Blog06 Collection: Creating and Analyzing a Blog Test Collection – C. Macdonald, I. Ounis
Impact of Splogs in TREC Queries
Distribution of Splogs that appear TREC queries
0
20
40
60
80
100
120
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Top 100 Search Results ranked using TFIDF Scoring
Num
ber
of Splo
gs
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
American Idol
CholesterolHybrid Cars
Splog Detection Task Proposal
• Motivation– Detecting and eliminating spam is an
essential requirement for any blog analysis– Splog detection has characteristics that
set it appart from e-mail and web spam detection
• Constraint– Simulate how blog search systems operate
• Task Statement– Is an input permalink (post) spam?
Relation to E-mail Spam Detection• TREC has an E-mail Spam
Classification Task• Similar in
– Fast online spam detection
• Different in– Nature of spamming: links, RSS feeds,
web graph, metadata– Users targeted indirectly through
search engines, e.g. “N1ST” not relevant for “NIST” query
Relation to Web Spam Detection•TREC does not have a web spam track•Similar in
–Spamming web link structure
•Different in–Coverage: Blog Analytics Engines don’t look
beyond blogosphere–Speed of detection is important, 150K
posts/hour–Presence of structured text through RSS
feeds presents new opportunities, and challenges
Difficulty•We have been experimenting with multiple
approaches starting mid 2005•Data: http://ebiquity.umbc.edu/resource/html/id/212
Difficulty
•Evolving spamming techniques and splog creation genres
•Most basic technique spam techniques– Generate content by stuffing key dictionary words– Generate link to affiliates, through link dumps on
blogrolls, linkrolls or after post content•Evolving spam techniques
– Scrape contextually similar content to generate posts
– RSS hijacking– Aggregation software, e.g. Planet X– Intersperse links randomly– Make link placement meaningful– Add spam comments and then ping. Repeat.
Task Details - Dataset Creation• Similar to TREC Blog 2006, a
collection of feeds, blog home-pages and permalinks
• View dataset D as two sets – Dbase , Dtest
• Dbase to span (n-x) days, and Dtest to span the rest of x days for x≤1
• D could collected as a combination of– D as collected in 2006– Sample a subset of pings from a ping
server over the period that D is collected
Task Details - Assessment
•Assessors classify spam post into one or more classes based on the kind of spam this post, or the blog hosting it features– Non-blog– Keyword-stuffed– Post-stitching– Post-plagiarism– Post-weaving– Blog/link-roll spam
•Each assessment typically takes 1-2 minutes•Detailed assessment will enable participants
to identify classes they handle well and where they can improve
Non-Blog ping at weblogs.com• No RSS Feeds• No Dated Entry, no comments• Possibly plagiarized content
Keyword Stuffed Blog• ‘coupon codes’, ‘casino’
Post Stitching• Excerpts scraped from other sources
Post Weaving• Spam Links contextually placed in post
Link-roll spam• With fully plagiarized text
Evaluation
• Dbase distributed first, Dtest subsequently with 50 independent sets of permalinks
• Dbase, Dtest division will mimic how blog search engines operate– Build models to detect splogs – using individual
posts, feeds or blog homepages of what is seen– Detect spam in an incoming stream of new
blog postings
• Teams will be judged by how well they detect “spamminess” for new posts
Input/Output<set>
<num>...</num>
<test>
<permalink>
<url>...</url>
<homepage>...</homepage>
<feed>...</feed>
<when>... </when>
</permalink>
<permalink>
...
</permalink>
...
</test>
</set>
{set Q0 docno rank prob runtag}
Individual set of test input.1 or y such sets can be used, with each set biased to a specific splog genre, blog Publishing host or TLD
Each permalink to bejudged by participants
Output format
Summary
• Spam Blogs present a major challenge to the quality of blog mining/analytics
• Splog Detection is different from spam in other communication platforms
• Development of TREC Task will help furthering state of the art
• Task requirements can be easily aligned with existing task of opinion identification
top related