a false positive safe neural network for spam detection
DESCRIPTION
A False Positive Safe Neural Network for Spam Detection. Alexandru Catalin Cosoi [email protected]. Does this look familiar?. Anatrim. Oh boy, it’s getting worst!!!. Oh boy, it’s getting worst!!!. Bad Bad Spammer!!!. Databases: D: Random legitimate text - PowerPoint PPT PresentationTRANSCRIPT
Does this look familiar?
Anatrim
Oh boy, it’s getting worst!!!
Oh boy, it’s getting worst!!!
Bad Bad Spammer!!!
• Databases:• D: Random legitimate text
• D1: Different rephrases of a certain spam phrase
• D2: Different rephrases of another spam phrase
• …………………
• Dn: Different rephrases of another spam phrase
– Create spam message script:
– Choose a random phrase from D1
– Choose random text from D
– Choose a random phrase from D2
– Choose random text from D– …………….
– Chose random phrase from Dn
• Send message.
• 40 samples of different subjects
• 50 samples of different titles
• 30 samples of different titles (part II)
• 60000 different combinations
Appeared as a consequence
of botnets
Features
• Larger time frame – KeyWord!!!!• Weak features
– Words like “Anatrim”, “Viagra”, “Xanax”, “Stock”– Simple word combinations like “Stock alert”, “Strong buy”– Simple Header Heuristics (for both spam and ham) like: valid
reply, weird message id, forged headers
• Example:– Top 500 spammy words from a Bayesian dictionary– Some simple header heuristics from spamassasins’ SARE
Ninjas– Trainer’s personal flavour
Why ART?
• Training occurs by modifying the weights of each neuron
• For large amounts of data, forgetting important details might actually happen
• Solves the stability-plasticity dilemma• Based on template detection• Unlimited number of templates
involves unlimited number of patterns• 2 self organizing neural networks + a
mapping module = supervised organizing neural network
Adaptive Resonance Theory
• Similar to a cluster algorithm (as many clusters as needed)
• ARTMAP = ARTa + ARTb + MapField
ART Vigilance
Small Value - Imprecise Big value - Fragmented
• A big value: Accepts small errors; Many small clusters; High precision• A small value: Accepts high errors; A few big clusters; Errors can appear
ART ++
Algorithm
Corpus
• 2.5 million spam messages (sampled on waves with a high degree of variation) and around 1000 simple low relevance text heuristics (not counting the standard header heuristics).
• The first 1000 words (ordered by discrimination, but with a minimum of 10-30 hundred occurrences) from a bayesian dictionary trained on this corpus, and also standard header heuristics.
• Almost 1 million legitimate email messages• 75% of the message corpus were used for training the neural
network and,• 25% were used in testing the neural network.
• 1.5 days to train!!!!
Results
• FP: 1% 0.0001%• FN: 4% 20 %
• On some corpuses (TREC 2006) we had … not so great results (but current heuristics)
• FN: 35% ()• FP: 2 email messages! ()
• At least, just a few false positives!
Conclusions
• ART + Simple Features + Spam = Love• ART + False Positives + Spam = OMG!!!• (ART++) = Heuristic Filter + ARTMAP• Must use a lot of email messages. It is highly difficult to
find representative samples for individual waves.• Can also be applied to other neural networks• Interesting PowerPoint template…
Thanks
QUESTIONS?