yahoo! mail antispam - bay area hadoop user group

45
YOKAI VERSUS THE ELEPHANT HADOOP AND THE FIGHT AGAINST SHAPE-SHIFTING SPAM Vishwanath Ramarao & Mark Risher Yahoo! Mail

Upload: hadoop-user-group

Post on 20-Jan-2015

15.034 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Yahoo! Mail antispam - Bay area Hadoop user group

YOKAI VERSUS THE ELEPHANTHADOOP AND THE FIGHT AGAINST SHAPE-SHIFTING SPAM

Vishwanath Ramarao & Mark Risher

Yahoo! Mail

Page 2: Yahoo! Mail antispam - Bay area Hadoop user group

© SHMorgan - www.obakemono.com

Page 3: Yahoo! Mail antispam - Bay area Hadoop user group

3

AGENDA

Shape-shifting spam

Antispam Origins

Hadoop Algorithms

Applications to Security

Resources for Implementers

Page 4: Yahoo! Mail antispam - Bay area Hadoop user group
Page 5: Yahoo! Mail antispam - Bay area Hadoop user group

5

Page 6: Yahoo! Mail antispam - Bay area Hadoop user group

6

http:/<!--gmail.com-->/f915fde2cf53df18<!--uc22wddprm-->.li<!--cf997b28e-->gh<!--PdNKLr-->tt<!---kxnd2itipuvd.yahoo.com-->o<!--ju1j8V-->p<!--vrgxetdcnubslgacvc-->b<!--OsLaWIv-->o<[email protected]>dy<!--in7oouvxfrg7ax-->.com]*!}v}]along especially consecutive important dmvfu

<!--gmail.com-->

Page 7: Yahoo! Mail antispam - Bay area Hadoop user group

7

Page 8: Yahoo! Mail antispam - Bay area Hadoop user group

8

1,300,925,111,156,286,160,896

Viagorea ViagDrHa V l a g r a  VyAGRA via---gra viagrga 

via-gra 'V 1 @ G' Ra Viagzra viagdra via_gra ViaZUgra

Viargvra ViagrYa Vii-agra ViagWra vi(@)gr@ Viagvra

V-I-A-G-R-A Vi-ag.ra vigra Vkiagra via.gra v-ii-a=g-ra

V l A G R A VIA7GRA V/i/a/g/r/a VIxAGRA  Viaggra vi@gr|@|

ViaTagra ViaVErga Viagr(a Viagr^a Viágrá Viagara

Viag@ra Viag&ra vi@g*r@ V-i.a-g*r-a V1@grA ViaaPrga

Vi$agra ViaJ1gra Viag$ra via---gra Vi.ag.ra  Viaoygra

Vi/agra Viag%ra Viarga V|i|a|g|r|a Viag)ra vi@|g|r@

Viag&ra vi**agra vi@gr*@ vi-@gr@ V iagr a V&iagra

(http://bit.ly/cpOyLi)

Page 9: Yahoo! Mail antispam - Bay area Hadoop user group
Page 10: Yahoo! Mail antispam - Bay area Hadoop user group

10

Page 11: Yahoo! Mail antispam - Bay area Hadoop user group

11

TYPICAL ATTACK/RESPONSE PROFILE

1/17/06 1/18/06 1/19/06 1/20/06 1/21/06 1/22/06 1/23/06 1/24/06 1/25/06 1/26/06 1/27/06 1/28/06 1/29/06 1/30/060

5000

10000

15000

20000

25000

30000

35000

Connections from IP:64.21.48.67

Rule change(1/23@01:15)

Page 12: Yahoo! Mail antispam - Bay area Hadoop user group

12

MORE YOKAI - TARGETED ATTACKS

<style>mechanic CC0066 getimage 3A00 lectroniques repertoires spiel proscribing ammonoid 10110 radiobutton telefoons Jermaine ie saporito roshan 3026 janata trennung palillos toughest n capitole calzado 20200 Omnimedia collective saudade dizaines 205px hardener elongating Invasionofyourprivacy Personnal ftsbedingungen Montaner prozac Serpell fcard bvh capacitate 12502 courtship kiranji utroligt transducer tyee Delhaize clueless toffee nnio Zoa pochino sterns 622 Verordnung carbons waterresistant assessing footerText perrine url0 potatoes 999933 Rightmove positively thmb closer secures Amarillo suffer 314992 32599 8849 GJ initialling cockleshell JTA Justi aguardo jibes Chubb inflammatory iteration gran fald asseoir considerations 692px treasured Allotransplantation twoyears appx Bowers doorgeven 1487 bigpicture repeatedly Popp MPEG4 webbsida liefde Voeding Elena Kernighan sternway laggardly Zwischendurch commons equis sewing f17 apadrina sarei niques lugo quotedbl bayr 3500 CI addressee optatively gazzetta 616px mingus 23238 PhotoLink desuetude tofu keychains molding redevelopment stucco deltage astrology2 thumbscrews probablemente 700g rns fuseaction repris taires restraint manchettes trendlines effectue despatch Minsky estadual doses danbrown Muenster jind7n7 smashes gourmandes ashanti sentants rows kyk coated Incontournables coinciden jspa stalker CDS contienen expletives s8 eof replenishing puyallup prato sondra validar orientale sonnets steamer Niwango acrocentric dozens elr tempting poing jails ingredi Sep3 misdirection vested tecnici conciertos dear martini 3D35 MBR DNAME 2650 violation Egyptiin NCR sposoriss hl 12450 connectors circumcision transform CFA employeur 153 comunicazioni miner 19905 citronella Plissier Hellmich Randall Caradonna springa registrada haupt Entran 3060 Rochin capacitor sotol 3413 smirk interdite ServicePoint capabilities bouncefee Linkov 3Dg auntie OSP Caecilia Platzierung wrangler pisos banlieue Daniella enderle israel professionnelles susto 39800 Espana plena radian antic!...........................200KB……….

</style>

<center><a href="http://ivywhere.info/52210088504303.hrmj.1/285/1000/1006/1000/1237976a102c0176c7b3fb3164f83590.html">Please Click Here if You Can't See Images<br><img src="http://ivywhere.info/images/usacpm1.jpg" border="0"></a><br><a href="http://ivywhere.info/52210088504303.hrmj.1/40106/1000/1000/1000/a.html"><img src="http://ivywhere.info/images/usacpm2.jpg" border="0"></a><br><a href="http://ivywhere.info/gp.html"><img src="http://ivywhere.info/images/please2.jpg" border="0"></a><br>

[400kb…]<center><a href="http://corfair.info/52210088504303.hrmj.1/129286/1000/1006/1000/d1c7b1fa06980b08bf9b3a9c14844623.html">Please Click Here if You Can't See Images<br><img src="http://corfair.info/images/ivblg1.jpg" border="0"></a><br><a href="http://corfair.info/52210088504303.hrmj.1/40126/1000/1000/1000/a.html"><img src="http://corfair.info/images/ivblg2.jpg" border="0"></a><br><a href="http://corfair.info/gp.html"><img src="http://corfair.info/images/please2.jpg" border="0"></a><br>

Page 13: Yahoo! Mail antispam - Bay area Hadoop user group
Page 14: Yahoo! Mail antispam - Bay area Hadoop user group

14

Page 15: Yahoo! Mail antispam - Bay area Hadoop user group

15

WHY IS THE ANTISPAM PROBLEM HARD

• Scale of the problem; 25B Connections, 5B deliveries, 450M mailboxes

• User feedback is often late, noisy and not always actionable

• Large, diverse stream of legitimate traffic that looks like spam

• Slow adoption of authentication technologies like DKIM and SPF

• Spammers are clever; target and specialize attacks

• Rapidly changing spam campaigns with a large bot controlled IP base; large variations even within a single campaign

• A significant percentage of spam comes from large ESPs like Hotmail, Google and Yahoo

Page 16: Yahoo! Mail antispam - Bay area Hadoop user group

16

GENERATION 1: MANUAL MANAGEMENT LAYER

• Heuristics, blocks, blacklists– Provide attack mitigation and operational flexibility,

highly explainable. – Not durable, expensive to keep pace with fast

morphing spam

• Ad hoc queries– Proprietary implementations, not very scalable, steep

learning curve– Reactive and usually late

Page 17: Yahoo! Mail antispam - Bay area Hadoop user group

17

GENERATION 2: MACHINE MANAGEMENT LAYER

• Online reputation models– Simple, mostly scoring/counter/ratio based models– Highly scalable due the absence of any state/memory– Generalize too broadly, lack expressive power

• Batch trained reputation models– Typically digested memory based hashing or machine

learning models– Difficult to implement and due to the need for labeled

examples scale well only moderately– Slow to update and learn, lack explainability, limited

operational control

Page 18: Yahoo! Mail antispam - Bay area Hadoop user group
Page 19: Yahoo! Mail antispam - Bay area Hadoop user group

19

DISTRIBUTED COMPUTING PARADIGM

Map:Reduce + distributed storage:

• Simplicity of online, stateless models

• Expressiveness of offline analysis

• Ease of management

Page 20: Yahoo! Mail antispam - Bay area Hadoop user group

20

THE MAP:REDUCE PARADIGM

• Input data format is application-specific, specified by the user

• Output is a set of <key,value> pairs

• User expresses algorithm using two functions– Map is applied on the input data and produces a list

of intermediate <key,value> pairs – Reduce is applied to all intermediate pairs with the

same key. It typically performs some kind of merging operation and produces zero or more output pairs

• Finally, output pairs are sorted by their key value

Page 21: Yahoo! Mail antispam - Bay area Hadoop user group

21

THE MAP:REDUCE PARADIGM

Mapper

Mapper

Mapper <k1,v1>

<k2,v2>

<k1,v3>

<k1,{v1,v3}><k2,v2>

<k1,W1>

Reducer

Page 22: Yahoo! Mail antispam - Bay area Hadoop user group

22

A SIMPLE MAP:REDUCE EXAMPLE

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01

Hello World Bye World

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02

Hello Hadoop Goodbye Hadoop

// Split up input files (MAP), iterate over chunks, reassemble results (REDUCE)

$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output

$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000

Bye 1

Goodbye 1

Hadoop 2

Hello 2

World 2

Page 23: Yahoo! Mail antispam - Bay area Hadoop user group

23

A SIMPLE MAP:REDUCE EXAMPLE (bit.ly/bdyi0l)

18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

19. String line = value.toString();

20. StringTokenizer tokenizer = new StringTokenizer(line);

21. while (tokenizer.hasMoreTokens()) {

22. word.set(tokenizer.nextToken());

23. output.collect(word, one);

24. }

25. }

Page 24: Yahoo! Mail antispam - Bay area Hadoop user group

24

A SIMPLE MAP:REDUCE EXAMPLE (bit.ly/bdyi0l)

28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

30. int sum = 0;

31. while (values.hasNext()) {

32. sum += values.next().get();

33. }

34. output.collect(key, new IntWritable(sum));

Page 25: Yahoo! Mail antispam - Bay area Hadoop user group

25

Applications &

Outcomes

Page 26: Yahoo! Mail antispam - Bay area Hadoop user group

26

LETS REVIEW OUR DESIGN GOALS AGAIN

• Classifiers are notorious for lack of explainability– Engineers and analysts needs to know what the classifier is

missing– Engineers and analysts need to know about emerging threats– Analysts need “canned” reports along interesting dimensions– Machines need smart feature engineering

• Develop a scalable system to provide deep insight into spammer campaigns– Double up as a platform for standard reporting– Also double up as a platform for adhoc analysis and data

probing– Signal amplification and smart feature extraction platform

Page 27: Yahoo! Mail antispam - Bay area Hadoop user group

27

OUR ANTISPAM ANALYTIC PLATFORM

• Hadoop: Implements map reduce, written in Java but supports many other languages including Perl and C++ using the streaming interface

• Feature engineering with small simple Perl programs for data extraction and transformation

• SQL-like “Pig” programming language for data analysis and management

• Mahout: data mining libraries that provide shrink- wrapped, scalable, sophisticated algorithms

• Other proprietary algorithms and frameworks for specialized tasks

Page 28: Yahoo! Mail antispam - Bay area Hadoop user group

28

VARIOUS ASPECTS OF A GRID DRIVEN SOLUTION

• Standard reporting

• Ad hoc querying

• Campaign discovery from spam feedback using frequent item set mining

• “Gaming” detection in notspam feedback using connected components

Page 29: Yahoo! Mail antispam - Bay area Hadoop user group

29

TOP SPAMMY DOMAINS REPORT FOR 01/15/2010

key:noreply.amateurmatch.com|value:1164key:goodmere.info|value:896key:marketing.meredith.com|value:1078key:verizon.net|value:822key:reply.mb00.net|value:980key:insideapple.apple.com|value:1094key:facebookappmail.com|value:882key:mydailymoment.com|value:849key:thetwilightsaga.com|value:4671key:adknowledgemailer6.com|value:859key:freedollarspro.info|value:1164key:smartreachmedia.com|value:1074key:yahoo.es|value:877key:ecomasher.com|value:1197key:leasetrade-statusupdates.com|value:951

key:noreply.amateurmatch.comvalue:1164

Page 30: Yahoo! Mail antispam - Bay area Hadoop user group

30

AD HOC QUERIES FOR ANTISPAM RESEARCH

• Identify domains that had few spam votes in the previous time window but have a high number of spam votes today

• All IPs in the last hour that sent a particular URL pattern…or that sent any unknown URL >500 times

• Which domains/IPs suddenly increased their sending volume after a positive reputation change

• Which FROM addresses exhibit low message size entropy

• All messages that had nothing but a URL and the domain of the URL had low page rank

Page 31: Yahoo! Mail antispam - Bay area Hadoop user group

31

AD HOC QUERIES - ANATOMY OF A PIG QUERY--- This includes some basic string functions, including splitting a

string on the '@' character

register /homes/jpujara/pig_scripts/string.jar;

define splitEmail string.Tokenize('2','@');

--- Load up some data - incoming messages at a date and time, and our trusted user database

MESSAGES = load '/projects/antispam/mta_feature_logs/$date*/*/*-$time*' using com.yahoo.ymail.pigfunctions.AsStorage('__record_key__,firstrcpt,mailfrom') as (mid:chararray,to:chararray,from:chararray);

USERS = load '/projects/antispam/TrustedUser.bz2' using com.yahoo.ymail.pigfunctions.AsStorage('user,t') as (user:chararray,trusted:int);

--- Split the e-mail addresses into user+domain and generate the appropriate user-id for yahoo users and partners

EXPLODED_MESSAGES = FOREACH MESSAGES GENERATE to,FLATTEN(splitEmail(to)) as (user,udomain),FLATTEN(splitEmail(from)) as (sender,sdomain);

YAHOO_MESSAGES = FOREACH EXPLODED_MESSAGES GENERATE (udomain MATCHES '.*yahoo.*' ? user : to ) as yuser,sdomain;

--- Combine the message and sender domains with the trusted user data and select only trusted messages

YAHOO_MESSAGES_TRUST = JOIN YAHOO_MESSAGES by yuser, USERS by user;

TRUSTED_MESSAGES = FILTER YAHOO_MESSAGES_TRUST by trusted > 0;

--- Group by domain, and generate a count, order by descending count

DOMAIN_GROUPS = GROUP TRUSTED_MESSAGES by sdomain;

DOMAIN_GROUPS_COUNT = FOREACH DOMAIN_GROUPS GENERATE group,COUNT(TRUSTED_MESSAGES) as count;

DOMAIN_GROUPS_ORDER = ORDER DOMAIN_GROUPS_COUNT by count DESC;

--- Output the results

STORE DOMAIN_GROUPS_ORDER into '$targetdir/topDomains';

Page 32: Yahoo! Mail antispam - Bay area Hadoop user group

32

CAMPAIGN DISCOVERY IN SPAM FEEDBACK

• Frequent Itemset Mining– Classical method– Research interesting relationships between variables in a large database– Primarily applied for market basket analysis

• Many good implementations– APRIORI

• Easy to implement

• Parallelizes moderately well but bottlenecks for extremely large data sets

• Not very efficient with the number scans

– ECLAT• Parallelizes easily

• Amenable to a good grid implementation

• Fewer scans of the dataset

– Parallel FP GROWTH• Designed explicitly for systems like hadoop

• Implemented in Mahout 0.2

Page 33: Yahoo! Mail antispam - Bay area Hadoop user group

33

FREQUENT ITEM SET – EXAMPLE DATASET

Item sets database - D

I1, I2, I5

I2, I4

I2, I4

I1, I2, I4

I1, I3

I2, I3

I1, I3

I1, I2, I3, I5

I1, I2, I3

Page 34: Yahoo! Mail antispam - Bay area Hadoop user group

34

FREQUENT ITEMSET MINING

Slide Courtsey: dortmund.de

Page 35: Yahoo! Mail antispam - Bay area Hadoop user group

35

FREQUENT ITEMSET MINING ON ONE DAY’S SPAM REPORTS

9 2595 (IPTYPE:none,FROMUSER:sales,SUBJ:It's Important You Know,FROMDOM:dappercom.info,URL:dappercom.info,ip_D:66.206.14.77,)

9 2457 (IPTYPE:none,FROMUSER:sales,SUBJ:Save On Costly Repairs,FROMDOM:aftermoon.info,URL:aftermoon.info,ip_D:66.206.14.78,)

9 2447 (IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-Compete-On-New-Vehicles,FROMDOM:sherge.info,URL:sherge.info,ip_D:66.206.25.227,)

9 2432 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:zaninte.info,URL:zaninte.info,ip_D:66.206.25.227,)

9 2376 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole family,FROMDOM:fiatchimera.com,URL:articulatedispirit.com,ip_D:216.218.201.149,)

9 2184 (IPTYPE:none,FROMUSER:health,SUBJ:Finally. Coverage for the whole family,FROMDOM:fiatchimera.com,URL:stratagemnepheligenous.com,ip_D:216.218.201.149,)

9 1990 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New Cars,FROMDOM:sastlg.info,URL:sastlg.info,ip_D:66.206.25.227,)

9 1899 (IPTYPE:none,FROMUSER:sales,FROMDOM:brunhil.info,SUBJ:700-CreditScore-What-Is-Yours?,URL:brunhil.info,ip_D:66.206.25.227,)

9 1743 (IPTYPE:none,FROMUSER:sales,SUBJ:Now exercise can be fun,FROMDOM:accordpac.info,URL:accordpac.info,ip_D:66.206.14.78,)

9 1706 (IPTYPE:none,FROMUSER:sales,SUBJ:Closeout 2008-2009-2010 New Cars,FROMDOM:rionel.info,URL:rionel.info,ip_D:66.206.25.227,)

9 1693 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:astroom.info,URL:astroom.info,ip_D:66.206.25.227,)

9 1689 (IPTYPE:none,FROMUSER:sales,SUBJ:eBay: Work@Home w/Solid-Income-Strategies,FROMDOM:stamine.info,URL:stamine.info,ip_D:66.165.232.203,)

2432 (IPTYPE:none,FROMUSER:sales,SUBJ:January 18th: CreditReport Update,FROMDOM:zaninte.info,URL:zaninte.info, ip_D:66.206.25.227,)

2447 (IPTYPE:none,FROMUSER:sales,SUBJ:Car-Dealers-Compete-On-New-Vehicles,FROMDOM:sherge.info,URL:sherge.info,ip_D:66.206.25.227,)

Page 36: Yahoo! Mail antispam - Bay area Hadoop user group

36

GAMING DETECTION IN NOTSPAM FEEDBACK

• Spammers instrument accounts to vote “not spam” on emails that they send– Delays classification of spamming IP addresses– Throws off the classifiers if the feedback is not filtered well

• Model the problem as a bipartite graph– Well known model for matching algorithms– Broadly applied in various fields like coding theory– A graph whose vertices are disjoint form disjoint sets U,V – There is an edge connecting every U to a vertex in V

Page 37: Yahoo! Mail antispam - Bay area Hadoop user group

37

CONNECTED COMPONETS - EXPLAINED

Y1 = Yahoo user 1, Y2 = Yahoo user 2

IP1 = IP address of the host Y1 “voted” notspam from

y1

y1

IP1

IP2

y1

y1

weight = 2SQUARING

Page 38: Yahoo! Mail antispam - Bay area Hadoop user group

38

CONNECTED COMPONENTS FOR “GAMING” DETECTION

y2

y1 IP3

IP4

IP1

IP2

Set of “voted from” IPs

y3

Set of “voted on” IPsSet of Yahoo IDs

voting notspam

Set of IPs/YIDs used exclusively for voting notspam

Set of (likely new) spamming IPs which are “worth” voting for

Page 39: Yahoo! Mail antispam - Bay area Hadoop user group

39

CONNECTED COMPONENTS - RESULTS

- Connnected components for IPs notspam was voted from

Page 40: Yahoo! Mail antispam - Bay area Hadoop user group

40

CONNECTED COMPONENTS - RESULTS

- Connnected components for IPs notspam was voted on

Page 41: Yahoo! Mail antispam - Bay area Hadoop user group

41

CONCLUSIONS

• We have had success leveraging parallel, stateful algorithms on grid systems to keep pace with polymorphic spam that evade traditional analysis and algorithms

• Frequent Itemset Mining rapidly identifies cohesive campaigns in ISSPAM feedback

• Connected Components amplifies weak signals in gamed NOTSPAM feedback and helps separate signal from noise in the feedback

• Grid system based analysis platforms may be broadly applicable across the security domain

Page 42: Yahoo! Mail antispam - Bay area Hadoop user group

42

APPLY SLIDE

• Download Hadoop distribution– http://hadoop.apache.org– Try out Pig on standalone, single Linux box

• Identify source data to aggregate– Start simple: IP patterns across web access logs– Begin with offline aggregation; yesterday’s attacks still interesting

• Read Connected Components and Frequent Itemset Mining papers– Stop looking for a single, invariant “tell” – far too costly– Start thinking about co-occurrence of innocuous features

Page 43: Yahoo! Mail antispam - Bay area Hadoop user group

43

RESOURCES FOR IMPLEMENTERS

• Hadoop setup, documentation and resources– http://hadoop.apache.org/

• Pig documentation and resources– http://hadoop.apache.org/pig/

• Mahout documentation and resources– http://lucene.apache.org/mahout/

• Frequent itemset mining implementation repository– http://fimi.cs.helsinki.fi/src/

• Connected components description– [link not yet live]

• Ranger, Raghuraman, Penmetsa, Bradski, and Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In HPCA 2007

Page 44: Yahoo! Mail antispam - Bay area Hadoop user group
Page 45: Yahoo! Mail antispam - Bay area Hadoop user group

45

CONNECTED COMPONENTS

• Reg IP• Cookie• Username• Birthday

• Reg IP• Cookie• Username• Birthday

• Reg IP• Cookie• Username• Birthday

• Reg IP• Cookie• Username• Birthday

• Reg IP• Cookie• Username• Birthday

• Reg IP• Cookie• Username• Birthday

• Reg IP• Cookie• Username• Birthday