exploiting spammer’s tactics of obfuscations for better corporate level spam filtering ·...

Exploiting Spammer’s Tactics of Obfuscations forBetter Corporate Level Spam Filtering

Vipul Sharma, Steve LewisProofpoint, Inc

2

Agenda

• Machine Learning 101• Problem– Definition– Strategies

• Solution– Existing– Proposal

• Validation– Why our system works better– Overall improvement in blocking spam

• Conclusions

3

How does MLX work?

•• Machine learning is the study of making computers learn; theMachine learning is the study of making computers learn; thegoal is to make computers improve their performance throughgoal is to make computers improve their performance throughexperience.experience.

Learning Learning Algorithm Algorithm(Logistic Regression)

Performance Performance PP (Effectiveness)

Experience Experience EE (Weights file)

EnvironmentEnvironment(Problem Space)( Uniform Data)

(Feature Set)(Class of task)

4

Training/Testing

w1 + w2 + w3

e.g. MAN vs. WOMAN

TRAINING

TESTING

?

5

Spam Adversarial

•• Spam is a special problem of MLSpam is a special problem of ML

Learning Learning Algorithm Algorithm(Logistic Regression)

Performance Performance PP (Effectiveness)

Experience Experience EE (Weights file)

EnvironmentEnvironment(Email)

(Email Messages)(Spam/Ham)

Hashbusting Obfuscation

6

Training/Testing

w1 + w2 + w3

MAN

e.g. MAN vs. WOMAN vs MONKEY

TRAINING

TESTING

+ w4

7

Spam Features

Header Information: Date, Subject, MTA, MUA, content types, etc.

Body: Words, phrases, URL etc.

Meta: Boolean combinations of other features

Surrounding Image features, IP, obfuscations, etcTechnologies:

8

Deceiving Content based Spam Filters usingText Obfuscation• Come play

your favoriteca$in0games onlineright now.

• Come playyour favoriteca#inogames onlineright now.

• Come playyour favoritecaniso gamesonline rightnow.

• Come playyour favoritec$in0 gamesonline rightnow.

• Come playyour favoritecassiinogames onlineright now.

• Come playyour favoriteca $ inogames onlineright now.

9

Types of Text Obfuscation

• Come playyour favoriteca$in0 gamesonline rightnow.Substitution

• Come playyour favoritecasno gamesonline rightnow.

Deletion

• Come playyour favoritecassiinogames onlineright now.

Addition

• Come playyour favoriteca s inogames onlineright now.

Segmentation

• Come playyour favorite($iino0games onlineright now.Combination

• Come playyour favoritecaniso gamesonline rightnow.

Shuffling

10

How to Counter V!@gr@@?

• Come play your

favorite c$in0games online rightnow

• Come play your

favorite casinogames online rightnow

• Come play your


• Come play your


Deobfuscation

Detection

11

Advantages / Disadvantages

• Deobfuscation (Lee et al. CEAS 2005)– HMM

• Accurate (97%),• Very Slow (240 letters/sec) on English letters (Bad for corporate level

spam filters)

• Identification– Regular Expressions

• Inaccurate• Expensive to maintain

– Edit distance (Oliver et al. Spam Conference 2005)• Less Accurate (75%) (Bad for corporate level spam filters)• Cheap / Faster

12

What is a Good Solution?

Accurate (~95%)&

Fast (near real time)&

Computationally Inexpensive (minimal overhead)&

Easy to Maintain

13

Obfuscation Detection Model

• A machine learning based detection system• Benchmarked several supervised multivariate

classification techniques• Uses a domain knowledge of ~800 hand collected

frequently obfuscated words (FOW)• Auxiliary classifier that can be easily integrated with base

classifier• Fast, accurate and easy to maintain

14

Frequently Obfuscated Words (FOW)

• Come playyour favoriteca$$iino gamesonline rightnow

• we offer real, genuinedegrees, that includebachelo-rs's, ma|ster's,mba, and do,ctoratedegrees. they are fullyverifiable

• Buy cheapestVi@gr@,Ci@#lis, mbianon!!ine

• Re|^ian@nceyour m0rtg@g3today. Click here

WHATWOULD

SPAMMERSWANT TOHIDE?

15

Environment

• Problem Space To detect variations of FOW

• Dataset– 67,907 hand collected obfuscated words– 250,000 valid words, parsed from ham messages– 12,000 commonly used valid word as dictionary– 727 frequently obfuscated words (FOWs)

• Class of Task Obfuscated | Valid

16

Feature Set

• Feature Set– A: Count of non-alphanumeric characters (~!@#$ ..)– B: Count of Numeric letters (not on boundary)(01234 ..)

• DEC009988• M0r1gage

– C: Length of the word– D: Dictionary presence of the word {0,1}– E: Similarity between FOW and the word (0-1)

17

Similarity Metric

• A common technique of obfuscation is shuffling fore.g. mtograge

• Levenshtein Distance, Jaro Winkler metric match– L(mortgage, mortal) = 4– L(mortgage, mrogtgae)=6

• Other metric are sensitive towards orderedvariations

• We need a metric that neglects order of letters

18

Similarity Metric

• L is the list of FOW; li ? L (Viagra, Mortgage ..)• bi = length(li)• m be any test word Vi!@gra– Filtered word m’ = Vigra

• bm = length(m’)• bim = common letters (bi, bm’)• Sim = maxi(bim/(bi + bm’ – bim))

19

Similarity Metric Example

• L = { Viagra, mortgage} ; m = mrogtgae• bviagra = 6; bmortgage = 8; bmrogtgae = 6• bviagra, mrogtgae = 3; bmortgage, mrogtgae = 8• Sviagra, mrogtgae = 3/(6+8-3) = 0.27• Smortgage, mrogtgae = 8/(8+8-8) = 1• S = max (1,0.27) = 1• S mortgage, mortal = 5/8+6-5 = 0.55

20

Preprocessing

• Discretization (Fayyad & Irani’s MDL method)– Converted numeric features to nominal features– Increase classification accuracy for certain classifiers– Certain classifier works only for nominal features

• Cutoff bins are calculated such that the entropy of the model isminimized

21

Entropy/Information Gain

Entropy measure of randomness ~ prediction

Low Entropy High EntropyMore Predictive Less Predictive

Images taken from eb.cecs.pdx.edu/~york/cs510w04/infogain09.pdf

22

Discretization

Discretized Similarity Index (A)

0

4000

8000

12000

16000

20000

24000

28000

32000

36000

(0-0.2

2649

6]'

'(0.32

4561

-0.34

1667

]

'(0.37

9808

-0.39

4444

]

'(0.40

9586

-0.44

9495

]

'(0.45

9936

-0.46

4103

]

'(0.47

5973

-0.48

913]

'(0.51

087-

0.526

611]

'(0.54

1958

-0.54

7727

]

'(0.55

2778

-0.56

6964

]

'(0.58

114-

0.594

118]

'(0.61

3248

-0.62

0192

]

'(0.63

0682

-0.65

6863

]

'(0.68

5855

-0.71

0084

]

'(0.72

0779

-0.74

1667

]

'(0.77

3504

-0.79

2857

]

'(0.81

5341

-0.82

0856

]

'(0.83

9744

-0.91

2879

]

'(0.91

9872

-0.92

5824

]

Nominal Cut Points

Inst

ance

s

Obfuscated WordsTrue Words

A B

23

Feature Generation using ConstructorFunctions

• General Constructor Functions generated 13features– Set of operators used +, > , <, =, !=, &, |– Cut-points generated via discretization used as ranges– Use beam search– Heuristics used maximize Information Gain

• E > 0.710084 and A > 1 (V!a-gra)– If similarity index > 0.710084 and number of non alpha

numeric character > 1 strong representation ofobfuscated class

24

Learning Model

• Various multivariate classification techniques werecompared using Weka

• 10-fold cross validation was used for accuracyestimation

• Accuracy was compared on both feature set– 5 basic feature before preprocessing– 13 generated features after preprocessing

25

Obfuscation Detection Accuracy Comparison

Accuracy Comparison

85.681

85.5682.3 81.32

94.16 94.19 94.1996.84

90.17

7075

8085

9095

100Naïv

e Bay

es C4.5

Decisi

on Tab

le

Random

fores

tLog

istic

Regres

sion

Bayes

ian N

etworks

Learning Techniques

Ave

rage

Acc

urac

y

Using 13 features

Using 5 features

26

Detection

• Training produces weights for all the 13 feature• Any given word will be converted in the form of a

feature vector• A score for each word is calculated using the

weights and the logistic function• If Score > 0.5 Obfuscated• If Score < 0.5 True

27

Integration with Base Classifier

• Used to generate features for base classifier

• Weight of the spam filter feature a confidence ofobfuscation (Score of the term)

• Logistic regression scores each term between 0-1• Score(term) > 0.5 obfuscated• Score(term) > 0.9 HIGHOBFS• 0.7 < Score(term) < 0.9 LOWOBFS• The weight of HIGHOBFS, LOWOBFS is

determined during base classifier training

28

Integration

SPAMFILTER

FeatureExtractor

ImageAnalyzer

Text ObfsDetectorObfs

FeatureExtractor

ImageFeature

Extractor

Text

13 2

SpamHam

HIGHOBFSLOWOBFS

29

Sample Integration

30

Overall Spam Detection Accuracy

• Test Corpora– 400,000 spam messages– 112,000 ham messages

• Accuracy improvements– FNs decreased by 50%– A negligible increase in FP ~ 0%– Overall accuracy ~ average increase 0.3%

31

Overall Spam Detection Accuracy

• Tested on one of the Proofpoint’s honeypot

32

Conclusions

• Obfuscation can be detected with high accuracy– Concentrate on FOW– Use preprocessing techniques for feature generation

• A very low overhead to spam engine• Logistic regression achieved highest detection

accuracy with lowest false positives• Similarity Metric should not be weighted around

ordered similarity• We noticed a significant improvement in spam

detection accuracy with almost no false positives

33

Discussions

• Biased towards the FOW list• Works for all languages• FOW list do not contain words with length equal or

less than 4• FP rate can be decreased by adding the errors in

dictionary• A interesting method of using supervised

classification technique for feature generation

34

Thank You for Your Time!Q&A

For information about Proofpoint,contact us at:

[email protected]

www.proofpoint.com

For a FREE 45-day trial of Proofpoint, visit:www.proofpoint.com/vb2006

exploiting spammer’s tactics of obfuscations for better corporate level spam filtering ·...

Documents