approximate object location and spam filtering on peer-to-peer systems
DESCRIPTION
Approximate Object Location and Spam Filtering on Peer-to-Peer Systems. Feng Zhou, Li Zhuang, Ben Y. Zhao , Ling Huang, Anthony D. Joseph and John D. Kubiatowicz University of California, Berkeley. The Problem of Spam. Spam Unsolicited, automated emails - PowerPoint PPT PresentationTRANSCRIPT
Approximate Object Location and Spam Filtering on Peer-to-Peer
Systems
Feng Zhou, Li Zhuang, Ben Y. Zhao,Ling Huang, Anthony D. Josephand John D. Kubiatowicz
University of California, Berkeley
ACM Middleware 2003 [email protected] 2
The Problem of Spam Spam
Unsolicited, automated emails Radicati Group: $20B cost in 2003, $198B in 2007
Proposed solutions Economic model for spam prevention
Attach cost to mass email distribution Weakness: needs wide-spread deployment, prevent but not filter
Bayesian network / machine learning (independent) “Train” mailer with spam, rely on recognizing words / patterns Weakness: key words can be masked (images, invis. characters)
Collaborative filtering Store / query for spam signatures on central repository Other users query signatures to filter out incoming spam Weakness: central repository limited in bandwidth, computation
ACM Middleware 2003 [email protected] 3
Our Contribution Can signatures effectively detect modified spam?
Goals: Minimize false positives (marking good email as spam) Recognize modified/customized spam as same as original
Present signature scheme based on approx. fingerprints Evaluate against random text and real email messages
Can we build a scalable, resilient signature repository Leverage structured peer-to-peer networks Constrain query latency and limit bandwidth usage
Orthogonal issues we do not address: Preprocessing emails to extract content Interpreting collective votes via reputation systems
ACM Middleware 2003 [email protected] 4
Outline Introduction An Approximate Signature Scheme
Evaluation using random text and real emails Approximate object location
Similarity search on P2P systems Constraining latency and bandwidth usage
Conclusion
ACM Middleware 2003 [email protected] 5
Collaborative Spam Filtering1. Spam sent to user A2. Signatures stored3. Spam sent to user B4. B queries repository5. B filters out spam
storesignature
queryresponse
ACM Middleware 2003 [email protected] 6
An Approximate Signature Scheme
How to match documents with very similar content Calculate checksums of all substrings of length L Select deterministic set of N checksums A matches B iff |sig(A) sig(B)| > Threshold Computation tput: 13MByte/s on P-III 1Ghz
A B C D E F
Sort by value
checksum
Choose N Randomize
Vector
Email Message
ACM Middleware 2003 [email protected] 7
Accuracy of Signature Vectors Matching Accuracy vs Changes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 3 4 5 6 7 8 9 10
Minimum Matches out of 10
Pro
bab
ility
of
Mat
ch
10/5K
50/5K
5x5/5K
10000 random text documents, size = 5KB, calculate 10 signatures Compare signatures of before and after modifications Analytical results match experimental results
ACM Middleware 2003 [email protected] 8
Eliminating False positivesFalse Positive Rate
1.00E-09
1.00E-08
1.00E-07
1.00E-06
1.00E-05
1.00E-04
1.00E-03
1000 10000 100000Size of Documents (Bytes)
Fal
se P
osi
tive
Rat
e
1/10 Signatures Match 2/10 Signatures Match
Compare pair-wise signatures between 10000 random docs None matched 3 of 10 signatures (100,000,000 pairs)
ACM Middleware 2003 [email protected] 9
Evaluation on Real Messages 29631 Spam Emails from www.spamarchive.org
Processed visually by project members 14925 (unique), 86% of spam ≤ 5K
Robustness to modification test Most popular 39 msgs have 3440 modified copies Examine recognition between copies and originals
THRES Detected Failed %
3/10 3356 84 97.56
4/10 3172 268 92.21
5/10 2967 473 86.25
ACM Middleware 2003 [email protected] 10
False Positive Test Non-spam emails
9589 messages: 50% newsgroup posts + 50% personal emails
Compare against 14925 unique spam messages
Sweet spot, using threshold of 3/10 signatures Recognition rate > 97.5% False positive rate < 1 in 140 million pairs
THRES # of pairs Probability
1/10 270 1.89e-6
2/10 4 2.79e-8
3/10 0 0
ACM Middleware 2003 [email protected] 11
A Distributed Signature Repository?
storesignature
queryresponse
• How do we build a scalable distributed repository?• How do we limit bandwidth consumption and latency?
ACM Middleware 2003 [email protected] 12
Structured Peer-to-Peer Overlays Storage / query via structured P2P overlay networks
Large sparse ID space N (160 bits: 0 – 2160) Nodes in overlay network have nodeIDs N Given k N, overlay deterministically maps k
to its root node (a live node in the network) E.g. Chord, Pastry, Tapestry, Kademlia, Skipnet, etc…
Decentralized Object Location and Routing (DOLR) Objects identified by Globally Unique IDs (GUIDs) N Decentralized directory service for endpoints/objects
Route messages to nearest available endpoint Object location with locality:
routing stretch (overlay location / shortest distance) O(1)
ACM Middleware 2003 [email protected] 13
DOLR on Tapestry Routing Mesh
Object O
ServerClient
Root(O)
Client
ACM Middleware 2003 [email protected] 14
More Than Just Unique Identifiers Objects named by Globally Unique ID (GUID)
Application maps secondary characteristics to ID:versioning, modified replicas, app-specific info
Simplify the search problem out of m search fields, or “features,” find
objects matching at least n exactly
Simpler Queries More Powerful QLs
DHT/DOLRs DatabasesSearch on Features
ACM Middleware 2003 [email protected] 15
A Layered Perspective
ADOLR layer Introduce naming mapping from feature vector to GUIDs Rely on overlay infrastructure for storage Abstraction of feature vectors as approximate names for
object(s)
IP Network Layer
Structured P2P Overlay Network (DOLR)
Approximate DOLR/DHT
route (IPAddr, msg)
Publish (GUID) route (GUID, msg)
approxPublish(FV,GUID)approxSearch(FV)route (FV, msg)
ACM Middleware 2003 [email protected] 16
Marking a New Spam Message
Signatures stored as inverted index (feature object) inside overlay User on C gets spam E2, calculates signatures S: {S1, S2, S3} For each feature in S, if feature object exists, add E2
If no feature object exists, create one locally and publish
node CE2
Overlay
S2 E2
put (S1, E2)
S3 E2
node AE3
S1 E3
X X
put (S2, E2)
S1 E3, E2
put (S3, E2)
ACM Middleware 2003 [email protected] 17
Filtering New Emails for Spam
User at node D receives new email E2’ with signatures {S1, S2, S4} Queries overlay for signatures, retrieve matching GUIDs for each Threshold = 2/3, contact GUIDs that occur in 2 of 3 result sets Contact E2 via overlay for any additional info (votes etc)
node CE2
Overlay
S2 E2
S3 E2
node AE3
S1 E3, E2
node D
XS4
S2
S1
E2’ E2
ACM Middleware 2003 [email protected] 18
Constraining Bandwidth and Latency Need to constrain bandwidth and latency
Limit signature query to h overlay hops Return null set if h hops reached without result
Simulation on transit-stub topologies 5K nodes, 4K overlay nodes, diameter = 400ms Each spam message only reaches small group
For each message: % of users seen and marked = mark rate
Measure tradeoff between latency and success rate of locating known spam, for different mark rates
ACM Middleware 2003 [email protected] 19
Simulation ResultTrading Accuracy for Latency and Bandwidth
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500
Latency to Locate Object
Pro
b. o
f Fin
din
g M
atch
10% Mark Rate 4% Mark Rate
2% Mark Rate 1% Mark Rate
0.5% Mark Rate
network diameter
ACM Middleware 2003 [email protected] 20
Feature-based Queries Approximate Text Addressing
Text objects: features hashed signatures Applications: plagiarism detection, replica management
Other feature-based searching Music similarity search
Extract musical characteristics Signatures: {hash(field1=value1), hash(field2=value2)…} E.g. Fourier transform values, # of wavetable entries
Image similarity search Locate files with similar geometric properties, etc.
ACM Middleware 2003 [email protected] 21
Finally… Status
ADOLR infrastructure implemented on Tapestry
SpamWatch: P2P spam filterimplemented, including Microsoft Outlook plug-inAvailable for download:http://www.cs.berkeley.edu/~zf/spamwatch
Tapestryhttp://www.cs.berkeley.edu/~ravenben/tapestry
Thank you…Questions?
ACM Middleware 2003 [email protected] 22
Backup Slides Follow