an overview of search engine spam - stanford university · 2007. 3. 19. · • search engine...
TRANSCRIPT
An Overview ofSearch Engine SpamZoltán Gyöngyi
Stanford Security Workshop ● Stanford, March 19, 2007 2
RoadmapWhat is search engine spam?What techniques do spammers use?Work at StanfordChallenges ahead
Stanford Security Workshop ● Stanford, March 19, 2007 3
Example
kaiser pharmacy online
Stanford Security Workshop ● Stanford, March 19, 2007 4
Example
Save today on Viagra, Lipitor, Zoloft, …
Phentermine 90 Pills/$119
mp3
Stanford Security Workshop ● Stanford, March 19, 2007 5
ExampleLawyersLoansMortgageRingtonesViagra
Pharmacy is the profession of compounding and dispensing medication. More recently, the term has come to include other services…
Stanford Security Workshop ● Stanford, March 19, 2007 6
Definition
So… who does what?
Stanford Security Workshop ● Stanford, March 19, 2007 7
Definition
Spammingdeliberate human action
So… who does what?
Stanford Security Workshop ● Stanford, March 19, 2007 8
Definition
Spammingdeliberate human action
meant to trigger unjustifiably high ranking
So… who does what?
Stanford Security Workshop ● Stanford, March 19, 2007 9
MonetizationBetter ranking = higher click-through rate• Search engine optimization• Affiliate spam• Advertisement spam
Why?
Stanford Security Workshop ● Stanford, March 19, 2007 10
Techniques How?
Stanford Security Workshop ● Stanford, March 19, 2007 11
Techniques / Boosting / Termhow?
techniques
boosting
hiding
link
Stanford Security Workshop ● Stanford, March 19, 2007 12
Techniques / Boosting / Termrepetition repetition repetitionrepetition repetition repetition dumortierite dumose dumous dump dumpage dumper dumpily dumpiness dumping dumpish dumpishlywork in weaving three-women teams is an ancient textile art on loomsplease refrain from using the phrasestitching wounds located on the lower limbs
Stanford Security Workshop ● Stanford, March 19, 2007 13
Techniques / Boosting / Link
techniques
boosting
hiding
term
Stanford Security Workshop ● Stanford, March 19, 2007 14
Techniques / Boosting / LinkDirectory clones• Duplicate (parts of) DMOZ or Yahoo! Directory
Comment spam• Post messages (containing links) to
– Blogs– (Unmoderated) forums– Wikis
Link spam farms• Create colluding spam pages• See later
Stanford Security Workshop ● Stanford, March 19, 2007 15
Techniques / Hidinghow?
techniques
boosting
term
term
Stanford Security Workshop ● Stanford, March 19, 2007 16
Techniques / HidingContent hiding
Cloaking• Identify web crawlers• Serve a different version of the page
<style type = “text/css”>body {
background-color: white;color: white; }
</style>
<div style = “visibility: hidden”>You can’t see me!</div>
<a href = “…”><img src = “1x1.gif”></img></a>
Stanford Security Workshop ● Stanford, March 19, 2007 17
RoadmapWhat is search engine spam?What techniques do spammers use?Work at StanfordChallenges ahead
Stanford Security Workshop ● Stanford, March 19, 2007 18
Work at StanfordAnalysis• Link spam farms and alliances
Demotion• TrustRank
Detection• Spam mass estimation• See publications
Stanford Security Workshop ● Stanford, March 19, 2007 19
Link Spam Farms & AlliancesSpammer’s goal: increase PageRankFarm model
1
2
k
0
0
1
2
k
Stanford Security Workshop ● Stanford, March 19, 2007 20
Link Spam Farms & AlliancesOptimal farms• Short loops
including target
Stanford Security Workshop ● Stanford, March 19, 2007 21
Link Spam Farms & AlliancesOptimal farms• Short loops
including target
Alliances• Interconnected farms• 2 always better than 1• Larger alliances often
benefit all
Stanford Security Workshop ● Stanford, March 19, 2007 22
TrustRank / Observation
good pages
spam pages
Stanford Security Workshop ● Stanford, March 19, 2007 23
TrustRank / Observation
good pages
spam pages
Stanford Security Workshop ● Stanford, March 19, 2007 24
TrustRank / Observation
good pages
spam pages
Stanford Security Workshop ● Stanford, March 19, 2007 25
TrustRank / Observation
good pages
spam pages
Approximate isolation of good pages:good pages seldom point to spam
Stanford Security Workshop ● Stanford, March 19, 2007 26
TrustRank / ObjectiveSeparate good pages from spam pages
Stanford Security Workshop ● Stanford, March 19, 2007 27
TrustRank / Objective
What?Assign high scores to very good pages
How?Propagate scores from known good pages
(seed set)When?
Use results in ranking
Separate good pages from spam pages
Stanford Security Workshop ● Stanford, March 19, 2007 28
TrustRank / Example
Stanford Security Workshop ● Stanford, March 19, 2007 29
TrustRank / Example
Stanford Security Workshop ● Stanford, March 19, 2007 30
TrustRank / Example
Stanford Security Workshop ● Stanford, March 19, 2007 31
TrustRank / Example.50 Damping
1
Stanford Security Workshop ● Stanford, March 19, 2007 32
TrustRank / Example
.25
Splitting
.25
1
Stanford Security Workshop ● Stanford, March 19, 2007 33
TrustRank / ExperimentsData• Site-level AltaVista web graph: 31M sites• Seed set of 178 sites
Evaluation sample• 1000 manually tagged sites
Results• Log scores 20 buckets• Top 5 PageRank buckets: 15-20% spam• Top 5 TrustRank buckets: almost no spam
Stanford Security Workshop ● Stanford, March 19, 2007 34
What is search engine spam?What techniques do spammers use?Work at StanfordChallenges ahead
Roadmap
Stanford Security Workshop ● Stanford, March 19, 2007 35
ChallengesRemove economic incentive• Why not just charge for high ranking?• Revenue based on transactions generated, not
click-through rateMechanism design• Spam-proof algorithms/services
Spam on community-driven sites• Flickr, MySpace, del.icio.us…
Stanford Security Workshop ● Stanford, March 19, 2007 36
Thank You!Stanford InfoLabhttp://infolab.stanford.edu
Publicationshttp://dbpubs.stanford.edu
Stanford Security Workshop ● Stanford, March 19, 2007 37
TrustRank / ExperimentsWeb data• Entire AltaVista index (June 2003)• Site-level web graph
– 31M nodes– 13M without inlinks
Seed set• 2500 candidates• 178 selected high-quality sites
Evaluation sample• 1000 manually tagged sites
Stanford Security Workshop ● Stanford, March 19, 2007 38
TrustRank / Experiments
Stanford Security Workshop ● Stanford, March 19, 2007 39
TrustRank / Experiments
1 2 3 4 5 6 7 8 9 1011121314151617181920
2
3
4
5
6
Stanford Security Workshop ● Stanford, March 19, 2007 40
TrustRank / Experiments
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17+
2
4
6
PageRank Bucket
Ave
rage
Dem
otio
n (#
of B
ucke
ts)
Spam from PageRank bucket #3 moved to TrustRank bucket #7