network-level spam defenses
DESCRIPTION
Network-Level Spam Defenses. Nick Feamster Georgia Tech. with Anirudh Ramachandran, Shuang Hao, Alex Gray, Santosh Vempala. Spam: More than Just a Nuisance. 95% of all email traffic Image and PDF Spam (PDF spam ~12%) - PowerPoint PPT PresentationTRANSCRIPT
Network-Level Spam Defenses
Nick FeamsterGeorgia Tech
with Anirudh Ramachandran, Shuang Hao, Alex Gray, Santosh Vempala
2
Spam: More than Just a Nuisance
• 95% of all email traffic– Image and PDF Spam
(PDF spam ~12%)
• As of August 2007, one in every 87 emails constituted a phishing attack
• Targeted attacks on the rise– 20k-30k unique phishing attacks per month
Source: CNET (January 2008), APWG
3
Approach: Filter
• Prevent unwanted traffic from reaching a user’s inbox by distinguishing spam from ham
• Question: What features best differentiate spam from legitimate mail?– Content-based filtering: What is in the mail?– IP address of sender: Who is the sender?– Behavioral features: How the mail is sent?
Conventional: Content Filters• Trying to hit a moving target...
...and even mp3s!
PDFs Excel sheets Images
5
Problems with Content Filtering• Customized emails are easy to generate: Content-
based filters need fuzzy hashes over content, etc.
• Low cost to evasion: Spammers can easily alter features of an email’s content can be easily adjusted and changed
• High cost to filter maintainers: Filters must be continually updated as content-changing techniques become more sophisticated
6
Another Approach: IP Addresses
• Problem: IP addresses are ephemeral
• Every day, 10% of senders are from previously unseen IP addresses
• Possible causes– Dynamic addressing– New infections
7
Our Idea: Network-Based Filtering
• Filter email based on how it is sent, in addition to simply what is sent.
• Network-level properties are less malleable– Network/geographic location of sender and receiver– Set of target recipients– Hosting or upstream ISP (AS number)– Membership in a botnet (spammer, hosting
infrastructure)
8
Why Network-Level Features?
• Lightweight: Don’t require inspecting details of packet streams– Can be done at high speeds– Can be done in the middle of the network
• Robust: Perhaps more difficult to change some network-level features than message contents
9
Challenges (Talk Outline)• Understanding network-level behavior
– What network-level behaviors do spammers have?– How well do existing techniques work?
• Building classifiers using network-level features– Key challenge: Which features to use?– Two Algorithms: SNARE and SpamTracker
• Building the system – Dynamism: Behavior itself can change– Scale: Lots of email messages (and spam!) out there
10
Some Questions of Study
• Where (in IP space, in geography) does spam originate from?
• What OSes are used to send spam?
• What techniques are used to send spam?
11
Data: Spam and BGP• Spam Traps: Domains that receive only spam• BGP Monitors: Watch network-level reachability
Domain 1
Domain 2
17-Month Study: August 2004 to December 2005
12
Data Collection: MailAvenger• Configurable SMTP server• Collects many useful statistics
13
Finding: BGP “Spectrum Agility”• Hijack IP address space using BGP• Send spam• Withdraw IP address
A small club of persistent players appears to be using
this technique.
Common short-lived prefixes and ASes
61.0.0.0/8 4678 66.0.0.0/8 2156282.0.0.0/8 8717
~ 10 minutes
Somewhere between 1-10% of all spam (some clearly intentional,
others might be flapping)
14
Spectrum Agility: Big Prefixes?
• Flexibility: Client IPs can be scattered throughout dark space within a large /8– Same sender usually returns with different IP
addresses
• Visibility: Route typically won’t be filtered (nice and short)
15
Other Findings
• Top senders: Korea, China, Japan– Still about 40% of spam coming from U.S.
• More than half of sender IP addresses appear less than twice
• ~90% of spam sent to traps from Windows
16
How Well do IP Blacklists Work?
• Completeness: The fraction of spamming IP addresses that are listed in the blacklist
• Responsiveness: The time for the blacklist to list the IP address after the first occurrence of spam
17
Completeness and Responsiveness
• 10-35% of spam is unlisted at the time of receipt• 8.5-20% of these IP addresses remain unlisted
even after one month
Data: Trap data from March 2007, Spamhaus from March and April 2007
18
What’s Wrong with IP Blacklists?• Based on ephemeral identifier (IP address)
– More than 10% of all spam comes from IP addresses not seen within the past two months
• Dynamic renumbering of IP addresses• Stealing of IP addresses and IP address space• Compromised machines
• IP addresses of senders have considerable churn
• Often require a human to notice/validate the behavior– Spamming is compartmentalized by domain and not analyzed
across domains
19
Are There Other Approaches?
• Option 1: Stronger sender identity [AIP, Pedigree]
– Stronger sender identity/authentication may make reputation systems more effective
– May require changes to hosts, routers, etc.
• Option 2: Behavior-based filtering [SNARE, SpamTracker]
– Can be done on today’s network– Identifying features may be tricky, and some may
require network-wide monitoring capabilities
20
Outline
• Understanding the network-level behavior– What behaviors do spammers have?– How well do existing techniques work?
• Classifiers using network-level features– Key challenge: Which features to use?– Two algorithms: SNARE and SpamTracker
• The System: SpamSpotter – Dynamism: Behavior itself can change– Scale: Lots of email messages (and spam!) out there
21
Finding the Right Features
• Goal: Sender reputation from a single packet?– Low overhead– Fast classification– In-network– Perhaps more evasion resistant
• Key challenge– What features satisfy these properties and can
distinguish spammers from legitimate senders?
22
Set of Network-Level Features• Single-Packet
– AS of sender’s IP– Distance to k nearest senders– Status of email service ports– Geodesic distance– Time of day
• Single-Message– Number of recipients– Length of message
• Aggregate (Multiple Message/Recipient)
23
Sender-Receiver Geodesic Distance
90% of legitimate messages travel 2,200 miles or less
24
Density of Senders in IP Space
For spammers, k nearest senders are much closer in IP space
25
Local Time of Day at Sender
Spammers “peak” at different local times of day
26
Combining Features: RuleFit• Put features into the RuleFit classifier• 10-fold cross validation on one day of query logs
from a large spam filtering appliance provider
• Comparable performance to SpamHaus– Incorporating into the system can further reduce FPs
• Using only network-level features• Completely automated
27
SNARE: Putting it Together
• Email arrival• Whitelisting
– Top 10 ASes responsible for 43% of misclassified IP addresses• Greylisting• Retraining
28
Benefits of Whitelisting
Whitelisting top 50 ASes:False positives reduced to 0.14%
29
Outline• Understanding the network-level behavior
– What behaviors do spammers have?– How well do existing techniques work?
• Classifiers using network-level features– Key challenge: Which features to use?– Algorithms: SNARE and SpamTracker
• Building the system (SpamSpotter)– Dynamism: Behavior itself can change– Scale: Lots of email messages (and spam!) out there
30
SpamTracker
• Idea: Blacklist sending behavior (“Behavioral Blacklisting”)– Identify sending patterns commonly used by
spammers
• Intuition: Much more difficult for a spammer to change the technique by which mail is sent than it is to change the content
31
SpamTracker: Clustering Approach
• Construct a behavioral fingerprint for each sender
• Cluster senders with similar fingerprints• Filter new senders that map to existing
clusters
32
SpamTracker: Identify Invariant
domain1.com domain2.com domain3.com
spam spam spam
IP Address: 76.17.114.xxxKnown Spammer
DHCPReassignment
Behavioral fingerprint
domain1.com domain2.com domain3.com
spam spam spam
IP Address: 24.99.146.xxxUnknown sender
Cluster on sending behavior
Similar fingerprint!
Cluster on sending behavior
Infection
33
Building the Classifier: Clustering
• Feature: Distribution of email sending volumes across recipient domains
• Clustering Approach– Build initial seed list of bad IP addresses– For each IP address, compute feature vector:
volume per domain per time interval– Collapse into a single IP x domain matrix:– Compute clusters
34
Clustering: Output and Fingerprint
• For each cluster, compute fingerprint vector:
• New IPs will be compared to this “fingerprint”
IP x IP Matrix: Intensity indicates pairwise similarity
35
Evaluation
• Emulate the performance of a system that could observe sending patterns across many domains– Build clusters/train on given time interval
• Evaluate classification– Relative to labeled logs– Relative to IP addresses that were eventually listed
36
Data• 30 days of Postfix logs from email hosting service
– Time, remote IP, receiving domain, accept/reject– Allows us to observe sending behavior over a large
number of domains– Problem: About 15% of accepted mail is also spam
• Creates problems with validating SpamTracker
• 30 days of SpamHaus database in the month following the Postfix logs– Allows us to determine whether SpamTracker detects
some sending IPs earlier than SpamHaus
37
Clustering ResultsHam
Spam
SpamTracker Score
Separation may not be sufficient alone, but could be a useful feature
38
Outline• Understanding the network-level behavior
– What behaviors do spammers have?– How well do existing techniques work?
• Building classifiers using network-level features– Key challenge: Which features to use?– Algorithms: SpamTracker and SNARE
• Building the system (SpamSpotter)– Dynamism: Behavior itself can change– Scale: Lots of email messages (and spam!) out there
39
Deployment: Real-Time Blacklist
• As mail arrives, lookups received at BL
• Queries provide proxy for sending behavior
• Train based on received data
• Return score
Approach
40
Challenges• Scalability: How to collect and aggregate data, and form
the signatures without imposing too much overhead?
• Dynamism: When to retrain the classifier, given that sender behavior changes?
• Reliability: How should the system be replicated to better defend against attack or failure?
• Evasion resistance: Can the system still detect spammers when they are actively trying to evade?
41
Design Choice: Augment DNSBL• Expressive queries
– SpamHaus: $ dig 55.102.90.62.zen.spamhaus.org• Ans: 127.0.0.3 (=> listed in exploits block list)
– SpamSpotter: $ dig \ receiver_ip.receiver_domain.sender_ip.rbl.gtnoise.net
• e.g., dig 120.1.2.3.gmail.com.-.1.1.207.130.rbl.gtnoise.net
• Ans: 127.1.3.97 (SpamSpotter score = -3.97)
• Also a source of data– Unsupervised algorithms work with unlabeled
data
42
Latency
Performance overhead is negligible.
43
Design Choice: Sampling
Relatively small samples can achieve low false positive rates
44
Possible Improvements• Accuracy
– Synthesizing multiple classifiers– Incorporating user feedback– Learning algorithms with bounded false positives
• Performance– Caching/Sharing– Streaming
• Security– Learning in adversarial environments
45
Summary: Network-Based Behavioral Reputation
• Spam increasing, spammers becoming agile– Content filters are falling behind– IP-Based blacklists are evadable
• Up to 30% of spam not listed in common blacklists at receipt. ~20% remains unlisted after a month
• Complementary approach: behavioral blacklisting based on network-level features– Key idea: Blacklist based on how messages are sent– SNARE: Automated sender reputation
• ~90% accuracy of existing with lightweight features– SpamTracker: Spectral clustering
• catches significant amounts faster than existing blacklists– SpamSpotter: Putting it together in an RBL system
46
Next Steps: Phishing and Scams
• Scammers host Web sites on dynamic scam hosting infrastructure– Use DNS to redirect users to different sites
when the location of the sites move• State of the art: Blacklist URL• Our approach: Blacklist based on
network-level fingerprints
Konte et al., “Dynamics of Online Scam Hosting Infrastructure”, PAM 2009
47
References• Anirudh Ramachandran and Nick Feamster, “Understanding
the Network-Level Behavior of Spammers”, ACM SIGCOMM, 2006
• Anirudh Ramachandran, Nick Feamster, and Santosh Vempala, “Filtering Spam with Behavioral Blacklisting”, ACM CCS, 2007
• Nadeem Syed, Shuang Hao, Nick Feamster, Alex Gray and Sven Krasser, “SNARE: Spatio-temporal Network-level Automatic Reputation Engine”, GT-CSE-08-02
• Anirudh Ramachandran, Shuang Hao, Hitesh Khandelwal, Nick Feamster, Santosh Vempala, “A Dynamic Reputation Service for Spotting Spammers”, GT-CS-08-09 (In submission)
48
Time Between Record ChangesFast-flux Domains tend to change much more frequently than legitimately hosted sites
49
50
Classifying IP Addresses
• Given “new” IP address, build a feature vector based on its sending pattern across domains
• Compute the similarity of this sending pattern to that of each known spam cluster– Normalized dot product of the two feature vectors– Spam score is maximum similarity to any cluster
51
Sampling: Training Time
52
Additional History: Message Size Variance
Senders of legitimate mail have a much higher variance in sizes of messages they send
Message Size Range
Certain Spam
Likely Spam
Likely Ham
Certain Ham
Surprising: Including this feature (and others with more history) can actually decrease the accuracy of the classifier
53
Completeness of IP Blacklists
~80% listed on average
~95% of bots listed in one or more blacklists
Number of DNSBLs listing this spammer
Only about half of the IPs spamming from short-lived BGP are listed in any blacklistFr
actio
n of
all
spam
rece
ived
Spam from IP-agile senders tend to be listed in fewer blacklists
54
Low Volume to Each Domain
Lifetime (seconds)
Am
ount
of S
pam Most spammers send very little spam, regardless
of how long they have been spamming.
55
Some Patterns of Sending are Invariant
domain1.com domain2.com domain3.com
spam spam spam
IP Address: 76.17.114.xxx
DHCPReassignment
domain1.com domain2.com domain3.com
spam spam spam
IP Address: 24.99.146.xxx
• Spammer's sending pattern has not changed• IP Blacklists cannot make this connection
56
Characteristics of Agile Senders
• IP addresses are widely distributed across the /8 space• IP addresses typically appear only once at our sinkhole• Depending on which /8, 60-80% of these IP addresses
were not reachable by traceroute when we spot-checked
• Some IP addresses were in allocated, albeit unannounced space
• Some AS paths associated with the routes contained reserved AS numbers
57
Early Detection Results
• Compare SpamTracker scores on “accepted” mail to the SpamHaus database– About 15% of accepted mail was later determined to
be spam– Can SpamTracker catch this?
• Of 620 emails that were accepted, but sent from IPs that were blacklisted within one month– 65 emails had a score larger than 5 (85th percentile)
58
Evasion
• Problem: Malicious senders could add noise– Solution: Use smaller number of trusted domains
• Problem: Malicious senders could change sending behavior to emulate “normal” senders– Need a more robust set of features…