improving image spam filtering using image text features

Post on 20-Jan-2015

935 Views

Category:

Technology

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

CEAS 2008

Battista Biggio, Ignazio Pillai, Giorgio Fumera, Fabio Roli

Pattern Recognition and Applications GroupUniversity of Cagliari, ItalyDepartment of Electrical and Electronic Engineering

R AP G

5th Conference on Email and Anti-Spam (CEAS) 2008,Mountain View, California, USA, August 21st - 22nd

Improving Image Spam FilteringUsing Image Text Features

21-08-2008 Image Spam Filtering 2CEAS 2008

About me

• Pattern Recognition and Applications Grouphttp://prag.diee.unica.it– DIEE, University of Cagliari, Italy.

• Contact– Battista Biggio, Ph.D. student

battista.biggio@diee.unica.it

21-08-2008 Image Spam Filtering 3CEAS 2008

Pattern Recognition andApplications Group

• Research interests– Methodological issues

• Multiple classifier systems• Adversarial learning• Classification reliability

– Main applications• Intrusion detection in

computer networks• Multimedia document

categorization, Spam filtering• Biometric authentication

(fingerprint, face)• Content-based image

retrieval

R AP G

• Faculty members– F. Roli (group head)– G. Giacinto– G. Fumera– L. Didaci– G.L. Marcialis

– 7 PhD students– 3 post docs– 2 consultants

21-08-2008 Image Spam Filtering 4CEAS 2008

Outline

• Introduction– What is image spam?

• Image spam filtering– Image spam SoA– Our work

• Experiments

• A plug-in for SpamAssassin: Image Cerberus

21-08-2008 Image Spam Filtering 5CEAS 2008

Image spam

• Since about 2005: image spam– Embedding spam messages into images to evade

modules based on machine learning approaches(e.g. bayesian filters)

– Adding adversarial noise to prevent OCR fromreading embedded text (obfuscated spam images)

21-08-2008 Image Spam Filtering 6CEAS 2008

Image spam SoA

• Commercial / open source anti-spam filters:– OCR + keyword search– Image low-level feature analysis

• Research:– OCR + TC

• Fumera et al., JMLR 2006– BayesOCR plug-in for SpamAssassin

– Image classifiers (ham/spam) based on low-levelimage features (text areas, color distribution, etc.)• Wu et al., ICIP 2005• Aradhye et al., ICDAR 2005• Dredze et al., CEAS 2007

21-08-2008 Image Spam Filtering 7CEAS 2008

Our past work• OCR is not effective against obfuscated images

– Spammers learned from CAPTCHAs / HIPs!• Our idea: the presence of adversarial obfuscated text

can be a spamminess hint (Biggio et al., CEAS 2007)– How did we detect the presence of adv. obfuscated text?

• Four features based on:– Text localisation– Perimetric complexity– Edge detection

• However, these features did not work as we thought fordetecting only adversarial obfuscated text…

21-08-2008 Image Spam Filtering 8CEAS 2008

This work

• Our image text defect measures seemed to beable to provide some discriminant informationabout low level text characteristics betweenham and spam images

• We exploit the proposed image text defectmeasures as additional features in approachesbased on image classification techniques, toimprove their discriminant capability

21-08-2008 Image Spam Filtering 9CEAS 2008

Experiments

• Data sets (1)– A: 2006 ham images, 3297 spam images– B: 2006 ham images, 8549 spam images

• Image feature sets– Aradhye et al., ICDAR 2005

• Color heterogeneity, color saturation, text area

– Dredze et al., CEAS 2007• Image meta-data, visual features

– Four other visual features, for comparison (generic)• Number of colors (log), number of pixels (log), relative

area occupied by the most common color, text area

– Features used in this work (text)

(1) Data sets are publicly available at: http://prag.diee.unica.it/n3ws1t0/eng/spamRepository

21-08-2008 Image Spam Filtering 10CEAS 2008

Experiments (cont’d)

• We evaluated performances of imageham/spam classifiers based on individualfeature sets (aradhye, dredze, generic) andtheir fusion (either at feature or score level) withour features (text).

C(x1∪x2) C(x2)

C(s)

C(x1)

Feature level fusion Score level fusion

21-08-2008 Image Spam Filtering 11CEAS 2008

Results

21-08-2008 Image Spam Filtering 12CEAS 2008

Image CerberusImage Cerberus

A plug-in for SpamAssassin:Image Cerberus

• We implemented a SpamAssassin plug-in based on ourapproach– generic + text fused at feature level

• Publicly available– http://prag.diee.unica.it/n3ws1t0/imageCerberus

• We will release source code (C++) soon

We need your feedback!

R AP G

21-08-2008 Image Spam Filtering 13CEAS 2008

Some examples

score = 1.06 score = 0.98 score = 0.28

21-08-2008 Image Spam Filtering 14CEAS 2008

Some examples (cont’d)

score = 0.82score = 1.00score = 0.63

21-08-2008 Image Spam Filtering 15CEAS 2008

• Ham images from the TREC 2007 spam corpus!

Spam or ham?

score = 0.20

score = - 1.4

score = 0.27

21-08-2008 Image Spam Filtering 16CEAS 2008

Thank you!

• See you at the poster session!

• Contacts– roli@diee.unica.it

– fumera@diee.unica.it

– pillai@diee.unica.it

– battista.biggio@diee.unica.it

• Web– http://prag.diee.unica.it

R AP G

top related