improving image spam filtering using image text features

CEAS 2008

Battista Biggio, Ignazio Pillai, Giorgio Fumera, Fabio Roli

Pattern Recognition and Applications GroupUniversity of Cagliari, ItalyDepartment of Electrical and Electronic Engineering

R AP G

5th Conference on Email and Anti-Spam (CEAS) 2008,Mountain View, California, USA, August 21st - 22nd

Improving Image Spam FilteringUsing Image Text Features

21-08-2008 Image Spam Filtering 2CEAS 2008

About me

• Pattern Recognition and Applications Grouphttp://prag.diee.unica.it– DIEE, University of Cagliari, Italy.

• Contact– Battista Biggio, Ph.D. student

battista.biggio@diee.unica.it

Pattern Recognition andApplications Group

• Research interests– Methodological issues

• Multiple classifier systems• Adversarial learning• Classification reliability

– Main applications• Intrusion detection in

computer networks• Multimedia document

categorization, Spam filtering• Biometric authentication

(fingerprint, face)• Content-based image

retrieval

R AP G

• Faculty members– F. Roli (group head)– G. Giacinto– G. Fumera– L. Didaci– G.L. Marcialis

– 7 PhD students– 3 post docs– 2 consultants

Outline

• Introduction– What is image spam?

• Image spam filtering– Image spam SoA– Our work

• Experiments

• A plug-in for SpamAssassin: Image Cerberus

Image spam

• Since about 2005: image spam– Embedding spam messages into images to evade

modules based on machine learning approaches(e.g. bayesian filters)

– Adding adversarial noise to prevent OCR fromreading embedded text (obfuscated spam images)

Image spam SoA

• Commercial / open source anti-spam filters:– OCR + keyword search– Image low-level feature analysis

• Research:– OCR + TC

• Fumera et al., JMLR 2006– BayesOCR plug-in for SpamAssassin

– Image classifiers (ham/spam) based on low-levelimage features (text areas, color distribution, etc.)• Wu et al., ICIP 2005• Aradhye et al., ICDAR 2005• Dredze et al., CEAS 2007

Our past work• OCR is not effective against obfuscated images

– Spammers learned from CAPTCHAs / HIPs!• Our idea: the presence of adversarial obfuscated text

can be a spamminess hint (Biggio et al., CEAS 2007)– How did we detect the presence of adv. obfuscated text?

• Four features based on:– Text localisation– Perimetric complexity– Edge detection

• However, these features did not work as we thought fordetecting only adversarial obfuscated text…

This work

• Our image text defect measures seemed to beable to provide some discriminant informationabout low level text characteristics betweenham and spam images

• We exploit the proposed image text defectmeasures as additional features in approachesbased on image classification techniques, toimprove their discriminant capability

Experiments

• Data sets (1)– A: 2006 ham images, 3297 spam images– B: 2006 ham images, 8549 spam images

• Image feature sets– Aradhye et al., ICDAR 2005

• Color heterogeneity, color saturation, text area

– Dredze et al., CEAS 2007• Image meta-data, visual features

– Four other visual features, for comparison (generic)• Number of colors (log), number of pixels (log), relative

area occupied by the most common color, text area

– Features used in this work (text)

(1) Data sets are publicly available at: http://prag.diee.unica.it/n3ws1t0/eng/spamRepository

Experiments (cont’d)

• We evaluated performances of imageham/spam classifiers based on individualfeature sets (aradhye, dredze, generic) andtheir fusion (either at feature or score level) withour features (text).

C(x1∪x2) C(x2)

Feature level fusion Score level fusion

Results

Image CerberusImage Cerberus

A plug-in for SpamAssassin:Image Cerberus

• We implemented a SpamAssassin plug-in based on ourapproach– generic + text fused at feature level

• Publicly available– http://prag.diee.unica.it/n3ws1t0/imageCerberus

• We will release source code (C++) soon

We need your feedback!

R AP G

Some examples

score = 1.06 score = 0.98 score = 0.28

Some examples (cont’d)

score = 0.82score = 1.00score = 0.63

• Ham images from the TREC 2007 spam corpus!

Spam or ham?

score = 0.20

score = - 1.4

score = 0.27

Thank you!

• See you at the poster session!

• Contacts– roli@diee.unica.it

– fumera@diee.unica.it

– pillai@diee.unica.it

– battista.biggio@diee.unica.it

• Web– http://prag.diee.unica.it

R AP G

improving image spam filtering using image text features

image spam filtering

spam ceas

image cerberus21

spam messages

spam corpus

spam images b

image metadata

image cerberus p r

Technology

logistic regression for spam filtering

a buyers guide to spam filtering - it consulting services...

spam filters -...

collective classification for spam filtering - cisis 2011

hexamail guard anti-spam server spam filtering software -...

machine learning for spam filtering 1 sai koushik haddunoori

analysis study of spam image -based email s filtering...

spam filtering with fuzzy categorization in intelligent...

spam filtering

6/1/2015 email spam filtering - muthiyalu jothir 1 email...

spam filtering with naive bayes – which naive bayes? ·...

a dynamic approach to spam filtering · a dynamic approach...

spam filtering based on the analysis of text information...

spam filtering algorithm.pptx

outsourcing of spam filtering - projekter.aau.dk

twitter content-based spam filtering - cisis 2013

spam filtering based on naive bayes classi...

spam filtering at cern emmanuel ormancey - 23 october 2002

a survey: sms spam filtering

towards online spam filtering in social networks