an overview of search engine spam - stanford university · 2007. 3. 19. · • search engine...

An Overview ofSearch Engine SpamZoltán Gyöngyi

Stanford Security Workshop ● Stanford, March 19, 2007 2

RoadmapWhat is search engine spam?What techniques do spammers use?Work at StanfordChallenges ahead


Example

kaiser pharmacy online


Example

Save today on Viagra, Lipitor, Zoloft, …

Phentermine 90 Pills/$119

mp3


ExampleLawyersLoansMortgageRingtonesViagra

Pharmacy is the profession of compounding and dispensing medication. More recently, the term has come to include other services…


Definition

So… who does what?


Definition

Spammingdeliberate human action



Definition

Spammingdeliberate human action

meant to trigger unjustifiably high ranking



MonetizationBetter ranking = higher click-through rate• Search engine optimization• Affiliate spam• Advertisement spam

Why?


Techniques How?


Techniques / Boosting / Termhow?

techniques

boosting

hiding

link


Techniques / Boosting / Termrepetition repetition repetitionrepetition repetition repetition dumortierite dumose dumous dump dumpage dumper dumpily dumpiness dumping dumpish dumpishlywork in weaving three-women teams is an ancient textile art on loomsplease refrain from using the phrasestitching wounds located on the lower limbs


Techniques / Boosting / Link

techniques

boosting

hiding

term


Techniques / Boosting / LinkDirectory clones• Duplicate (parts of) DMOZ or Yahoo! Directory

Comment spam• Post messages (containing links) to

– Blogs– (Unmoderated) forums– Wikis

Link spam farms• Create colluding spam pages• See later


Techniques / Hidinghow?

techniques

boosting

term

term


Techniques / HidingContent hiding

Cloaking• Identify web crawlers• Serve a different version of the page

<style type = “text/css”>body {

background-color: white;color: white; }

</style>

<div style = “visibility: hidden”>You can’t see me!</div>

<a href = “…”><img src = “1x1.gif”></img></a>


RoadmapWhat is search engine spam?What techniques do spammers use?Work at StanfordChallenges ahead


Work at StanfordAnalysis• Link spam farms and alliances

Demotion• TrustRank

Detection• Spam mass estimation• See publications


Link Spam Farms & AlliancesSpammer’s goal: increase PageRankFarm model

1

2

k

0

0

1

2

k


Link Spam Farms & AlliancesOptimal farms• Short loops

including target


Link Spam Farms & AlliancesOptimal farms• Short loops

including target

Alliances• Interconnected farms• 2 always better than 1• Larger alliances often

benefit all


TrustRank / Observation

good pages

spam pages



good pages

spam pages



good pages

spam pages

Approximate isolation of good pages:good pages seldom point to spam


TrustRank / ObjectiveSeparate good pages from spam pages


TrustRank / Objective

What?Assign high scores to very good pages

How?Propagate scores from known good pages

(seed set)When?

Use results in ranking

Separate good pages from spam pages


TrustRank / Example


TrustRank / Example.50 Damping

1


TrustRank / Example

.25

Splitting

.25

1


TrustRank / ExperimentsData• Site-level AltaVista web graph: 31M sites• Seed set of 178 sites

Evaluation sample• 1000 manually tagged sites

Results• Log scores 20 buckets• Top 5 PageRank buckets: 15-20% spam• Top 5 TrustRank buckets: almost no spam


What is search engine spam?What techniques do spammers use?Work at StanfordChallenges ahead

Roadmap


ChallengesRemove economic incentive• Why not just charge for high ranking?• Revenue based on transactions generated, not

click-through rateMechanism design• Spam-proof algorithms/services

Spam on community-driven sites• Flickr, MySpace, del.icio.us…


Thank You!Stanford InfoLabhttp://infolab.stanford.edu

Publicationshttp://dbpubs.stanford.edu

[email protected]


TrustRank / ExperimentsWeb data• Entire AltaVista index (June 2003)• Site-level web graph

– 31M nodes– 13M without inlinks

Seed set• 2500 candidates• 178 selected high-quality sites

Evaluation sample• 1000 manually tagged sites


TrustRank / Experiments



1 2 3 4 5 6 7 8 9 1011121314151617181920

2

3

4

5

6



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17+

2

4

6

PageRank Bucket

Ave

rage

Dem

otio

n (#

of B

ucke

ts)

Spam from PageRank bucket #3 moved to TrustRank bucket #7

an overview of search engine spam - stanford university · 2007. 3. 19. · • search engine...

Documents