comment spam identification eric cheng & eric steinlauf
TRANSCRIPT
Comment Spam Identification
Eric Cheng & Eric Steinlauf
What is comment spam?
Total spam: 1,226,026,178Total ham: 62,723,306
95% are spam!Source: http://akismet.com/stats/ Retrieved 4/22/2007
Countermeasures
Blacklisting5yx.org9kx.comaakl.comaaql.comaazl.comabcwaynet.comabgv.comabjg.comablazeglass.comabseilextreme.netactionbenevole.comacvt.comadbx.comadhouseaz.comadvantechmicro.comaeur.comaeza.comagentcom.comailh.orgakbu.comalaskafibre.comalkm.comalqx.comalumcasting-eng-inc.co!americanasb.comamwayau.comamwaynz.comamwaysa.comamysudderjoy.comanfb.comanlusa.netaobr.comaoeb.comapoctech.comapqf.comareagent.comartstonehalloweencostumes.com
globalplasticscrap.comgowest-veritas.comgreenlightgo.orghadjimitsis.comhealthcarefx.comherctrade.comhobbyhighway.comhominginc.comhongkongdivas.comhpspyacademy.comhzlr.comidlemindsonline.cominternetmarketingserve.comjesh.orgjfcp.comjfss.comjittersjapan.comjkjf.comjkmrw.comjknr.comjksp.comjkys.comjtjk.comjustfareed.comjustyourbag.comkimsanghee.orgkiosksusa.comknivesnstuff.comknoxvillevideo.comksj!kwscuolashop.comlancashiremcs.comlnjk.comlocalmediaaccess.comlrgww.commarketing-in-china.com
rockymountainair.org
rstechresources.com
samsung-integer.com
sandiegonhs.org
screwpile.org
scvend.org
sell-in-china.com
sensationalwraps.com
sevierdesign.com
starbikeshop.com
struthersinc.com
swarangeet.com
thecorporategroup.net
thehawleyco.com
thehumancrystal.com
thinkaids.org
thisandthatgiftshop.net
thomsungroup.com
ti0.org
timeby.net
tradewindswf.com
tradingb2c.com
turkeycogroup.net
vassagospalace.com
vyoung.net
web-toggery.com
webedgewars.com
webshoponsalead.com
webtoggery.com
willman-paris.com
worldwidegoans.com
Captchas
• "Completely Automated Public Turing test to tell Computers and Humans Apart"
Other ad-hoc/weak methods
• Authentication / registration• Comment throttling• Disallowing links in comments• Moderation
Our Approach – Naïve Bayes
• Statistical• Adaptive• Automatic• Scalable and extensible• Works well for spam e-mail
Naïve Bayes
P(A|B) ∙ P(B) = P(B|A) ∙ P(A)= P(AB)
P(A|B) ∙ P(B) = P(B|A) ∙ P(A)
P(A|B) = P(B|A) ∙ P(A) / P(B)
P(spam|comment) = P(comment|spam) ∙ P(spam) / P(comment)
P(spam|comment) = P(comment|spam) ∙ P(spam) / P(comment)
P(spam|comment) = P(w1|spam) ∙ P(w2|spam) ∙ … P(wn|spam) ∙ P(spam) / P(comment)
(naïve assumption)
Probability of w1 occurring given a spam comment
P(w1|spam) =1 – (1 – x/y)n
Probability of w1 occurring given a spam comment
where x is the number of times w1 appears in all spam messages, y is the total number of words in all spam messages, and n is the length of the given comment
Texas casino Online Texas hold’em
Texas gambling site
P(Texas|spam) = 1 – (1 – 2/5)3 = 0.784
Corpus Incoming Comment
P(spam|comment) = P(w1|spam) ∙ P(w2|spam) ∙ … P(wn|spam) ∙ P(spam) / P(comment)
Probability of w1 occurring given a spam comment
P(spam|comment) = P(w1|spam) ∙ P(w2|spam) ∙ … P(wn|spam) ∙ P(spam) / P(comment)
Probability of w1 occurring given a spam comment
Probability of something being spam
P(spam|comment) = P(w1|spam) ∙ P(w2|spam) ∙ … P(wn|spam) ∙ P(spam) / P(comment)
Probability of w1 occurring given a spam comment
Probability of something being spam ??????
P(spam|comment) = P(w1|spam) ∙ P(w2|spam) ∙ … P(wn|spam) ∙ P(spam) / P(comment)
Probability of w1 occurring given a spam comment
Probability of something being spam ??????
P(ham|comment) = P(w1|ham) ∙ P(w2|ham) ∙ … P(wn|ham) ∙ P(ham) / P(comment)
P(spam|comment) P(w1|spam) ∙ P(w2|spam) ∙ … P(wn|spam) ∙ P(spam)
Probability of w1 occurring given a spam comment
Probability of something being spam
P(ham|comment) P(w1|ham) ∙ P(w2|ham) ∙ … P(wn|ham) ∙ P(ham)
P(spam|comment) P(w1|spam) ∙ P(w2|spam) ∙ … P(wn|spam) ∙ P(spam))
P(ham|comment) P(w1|ham) ∙ P(w2|ham) ∙ … P(wn|ham) ∙ P(ham))
log(
log(
)
)
log(
log(
log(P(spam|comment)) log(P(w1|spam)) + log(P(w2|spam)) + … log(P(wn|spam)) + log(P(spam))
log(P(ham|comment)) log(P(w1|ham)) + log(P(w2|ham)) + … log(P(wn|ham)) + log(P(ham))
P(spam|comment) = 1 – P(ham|comment)
Fact:
Abuse of notation:
P(s) = P(spam|comment)P(h) = P(ham|comment)
P(s) = 1 – P(h)
m = log(P(s)) – log(P(h))
= log(P(s)/P(h))
em = elog(P(s)/P(h))
= P(s)/P(h)
em ∙ P(h) = P(s)
P(s) = 1 – P(h)
m = log(P(s)) – log(P(h))
em ∙ P(h) = P(s)
em ∙ P(h) = 1 – P(h)
(em + 1) ∙ P(h) = 1
P(h) = 1/(em+1)P(s) = 1 – P(h)
m = log(P(s)) – log(P(h))
P(h) = 1/(em+1)P(s) = 1 – P(h)
m = log(P(spam|comment)) – log(P(ham|comment))
P(ham|comment) = 1/(em+1)P(spam|comment) = 1 – P(ham|
comment)
log(P(ham|comment))
log(P(spam|comment))
In practice, just compare
Implementation
Corpus
• A collection of 50 blog pages with 1024 comments
• Manually tagged as spam/non-spam• 67% are spam• Provided by the Informatics Institute
at University of Amsterdam
Blocking Blog Spam with Language Model Disagreement, G. Mishne, D. Carmel, and R. Lempel. In: AIRWeb '05 - First International Workshop on Adversarial Information Retrieval on the Web, at the 14th International World Wide Web Conference (WWW2005), 2005.
Most popular spam wordscasino 0.999918 0.00008207
6
betting 0.999879 0.000120513
texas 0.999813 0.000187148
biz 0.999776 0.000223708
holdem 0.999738 0.000262111
poker 0.999551 0.000448675
pills 0.999527 0.000473407
pokerabc 0.999506 0.000493821
teen 0.999455 0.000544715
online 0.999455 0.000544715
bowl 0.999437 0.000562555
gambling 0.999437 0.000562555
sonneries 0.999353 0.000647359
blackjack 0.999346 0.000653516
pharmacy 0.999254 0.000745723
“Clean” wordsedu 0.00287339 0.997127
projects 0.00270528 0.997295
week 0.00270528 0.997295
etc 0.00270528 0.997295
went 0.00270528 0.997295
inbox 0.00270528 0.997295
bit 0.00270528 0.997295
someone 0.00255576 0.997444
bike 0.00230136 0.997699
already 0.00230136 0.997699
selling 0.00219225 0.997808
making 0.00209302 0.997907
squad 0.00184278 0.998157
left 0.00177216 0.998228
important 0.0013973 0.998603
pimps 0.000427782 0.999572
Implementation
• Corpus parsing and processing• Naïve Bayes algorithm• Randomly select 70% for training,
30% for testing• Stand-alone web service• Written entirely in Python
It’s showtime!
Configurations
• Separator used to tokenize comment• Inclusion of words from header• Classify based only on most significant
words• Double count non-spam comments• Include article body as non-spam example• Boosting
Minimum Error Configuration
• Separator: [^a-z<>]+• Header: Both• Significant words: All• Double count: No• Include body: No• Boosting: No
Varying Configuration Parameters
Both Include Tag
0.0
60
.08
0.1
00
.12
0.1
40
.16
Header Inclusion vs. Test Set Error
Header Inclusion Method
Tes
t Se
t Err
or
[^a-z]+ [^a-z<>]+ \W+
0.0
590
0.0
595
0.0
600
0.0
605
0.0
610
0.0
615
0.0
620
Word Separator vs. Test Set Error
Separator RegEx
Tes
t Se
t Err
or
Varying Configuration Parameters
False True
0.0
590
0.0
595
0.0
600
0.0
605
0.0
610
0.0
615
0.0
620
Double Counting Non Spam
Double Counting
Tes
t Se
t Err
or
2 4 6 8 10 12 14
0.1
00
.15
0.2
00
.25
Top Word Filtering Method vs. Test Set Error
Top Word Filtering Method
Te
st S
et E
rro
r
Boosting
• Naïve Bayes is applied repeatedly to the data.• Produces Weighted Majority Model
bayesModels = empty list
weights = vector(1)
for i in 1 to M:
model = naiveBayes(examples, weights)
error = computeError(model, examples)
weights = adjustWeights(examples, weights, error)
bayesModels[i] = [model, error]
if error==0: break
Boosting
5 10 15 20 25
0.0
16
0.0
18
0.0
20
0.0
22
0.0
24
Boosting Level vs. Training Set Error
Boosting Level
Te
st S
et E
rro
r
5 10 15 20 25
0.1
05
0.1
10
0.1
15
0.1
20
0.1
25
0.1
30
0.1
35
0.1
40
Boosting Level vs. Test Set Error
Boosting Level
Te
st S
et E
rro
r
Future work(or what we did not do)
Data Processing
• Follow links in comment and include words in target web page
• More sophisticated tokenization and URL handling (handling $100,000...)
• Word stemming
Features
• Ability to incorporate incoming comments into corpus
• Ability to mark comment as spam/non-spam
• Assign more weight on page content• Adjust probability table based on
page content, providing content-sensitive filtering
Comments?
No spam, please.