CEAS 2008
Battista Biggio, Ignazio Pillai, Giorgio Fumera, Fabio Roli
Pattern Recognition and Applications GroupUniversity of Cagliari, ItalyDepartment of Electrical and Electronic Engineering
R AP G
5th Conference on Email and Anti-Spam (CEAS) 2008,Mountain View, California, USA, August 21st - 22nd
Improving Image Spam FilteringUsing Image Text Features
21-08-2008 Image Spam Filtering 2CEAS 2008
About me
• Pattern Recognition and Applications Grouphttp://prag.diee.unica.it– DIEE, University of Cagliari, Italy.
• Contact– Battista Biggio, Ph.D. student
21-08-2008 Image Spam Filtering 3CEAS 2008
Pattern Recognition andApplications Group
• Research interests– Methodological issues
• Multiple classifier systems• Adversarial learning• Classification reliability
– Main applications• Intrusion detection in
computer networks• Multimedia document
categorization, Spam filtering• Biometric authentication
(fingerprint, face)• Content-based image
retrieval
R AP G
• Faculty members– F. Roli (group head)– G. Giacinto– G. Fumera– L. Didaci– G.L. Marcialis
– 7 PhD students– 3 post docs– 2 consultants
21-08-2008 Image Spam Filtering 4CEAS 2008
Outline
• Introduction– What is image spam?
• Image spam filtering– Image spam SoA– Our work
• Experiments
• A plug-in for SpamAssassin: Image Cerberus
21-08-2008 Image Spam Filtering 5CEAS 2008
Image spam
• Since about 2005: image spam– Embedding spam messages into images to evade
modules based on machine learning approaches(e.g. bayesian filters)
– Adding adversarial noise to prevent OCR fromreading embedded text (obfuscated spam images)
21-08-2008 Image Spam Filtering 6CEAS 2008
Image spam SoA
• Commercial / open source anti-spam filters:– OCR + keyword search– Image low-level feature analysis
• Research:– OCR + TC
• Fumera et al., JMLR 2006– BayesOCR plug-in for SpamAssassin
– Image classifiers (ham/spam) based on low-levelimage features (text areas, color distribution, etc.)• Wu et al., ICIP 2005• Aradhye et al., ICDAR 2005• Dredze et al., CEAS 2007
21-08-2008 Image Spam Filtering 7CEAS 2008
Our past work• OCR is not effective against obfuscated images
– Spammers learned from CAPTCHAs / HIPs!• Our idea: the presence of adversarial obfuscated text
can be a spamminess hint (Biggio et al., CEAS 2007)– How did we detect the presence of adv. obfuscated text?
• Four features based on:– Text localisation– Perimetric complexity– Edge detection
• However, these features did not work as we thought fordetecting only adversarial obfuscated text…
21-08-2008 Image Spam Filtering 8CEAS 2008
This work
• Our image text defect measures seemed to beable to provide some discriminant informationabout low level text characteristics betweenham and spam images
• We exploit the proposed image text defectmeasures as additional features in approachesbased on image classification techniques, toimprove their discriminant capability
21-08-2008 Image Spam Filtering 9CEAS 2008
Experiments
• Data sets (1)– A: 2006 ham images, 3297 spam images– B: 2006 ham images, 8549 spam images
• Image feature sets– Aradhye et al., ICDAR 2005
• Color heterogeneity, color saturation, text area
– Dredze et al., CEAS 2007• Image meta-data, visual features
– Four other visual features, for comparison (generic)• Number of colors (log), number of pixels (log), relative
area occupied by the most common color, text area
– Features used in this work (text)
(1) Data sets are publicly available at: http://prag.diee.unica.it/n3ws1t0/eng/spamRepository
21-08-2008 Image Spam Filtering 10CEAS 2008
Experiments (cont’d)
• We evaluated performances of imageham/spam classifiers based on individualfeature sets (aradhye, dredze, generic) andtheir fusion (either at feature or score level) withour features (text).
C(x1∪x2) C(x2)
C(s)
C(x1)
Feature level fusion Score level fusion
21-08-2008 Image Spam Filtering 11CEAS 2008
Results
21-08-2008 Image Spam Filtering 12CEAS 2008
Image CerberusImage Cerberus
A plug-in for SpamAssassin:Image Cerberus
• We implemented a SpamAssassin plug-in based on ourapproach– generic + text fused at feature level
• Publicly available– http://prag.diee.unica.it/n3ws1t0/imageCerberus
• We will release source code (C++) soon
We need your feedback!
R AP G
21-08-2008 Image Spam Filtering 13CEAS 2008
Some examples
score = 1.06 score = 0.98 score = 0.28
21-08-2008 Image Spam Filtering 14CEAS 2008
Some examples (cont’d)
score = 0.82score = 1.00score = 0.63
21-08-2008 Image Spam Filtering 15CEAS 2008
• Ham images from the TREC 2007 spam corpus!
Spam or ham?
score = 0.20
score = - 1.4
score = 0.27
21-08-2008 Image Spam Filtering 16CEAS 2008
Thank you!
• See you at the poster session!
• Contacts– [email protected]
• Web– http://prag.diee.unica.it
R AP G