an overview of search engine spam - stanford university · 2007. 3. 19.  · • search engine...

40
An Overview of Search Engine Spam Zoltán Gyöngyi

Upload: others

Post on 31-Dec-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

An Overview ofSearch Engine SpamZoltán Gyöngyi

Page 2: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 2

RoadmapWhat is search engine spam?What techniques do spammers use?Work at StanfordChallenges ahead

Page 3: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 3

Example

kaiser pharmacy online

Page 4: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 4

Example

Save today on Viagra, Lipitor, Zoloft, …

Phentermine 90 Pills/$119

mp3

Page 5: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 5

ExampleLawyersLoansMortgageRingtonesViagra

Pharmacy is the profession of compounding and dispensing medication. More recently, the term has come to include other services…

Page 6: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 6

Definition

So… who does what?

Page 7: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 7

Definition

Spammingdeliberate human action

So… who does what?

Page 8: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 8

Definition

Spammingdeliberate human action

meant to trigger unjustifiably high ranking

So… who does what?

Page 9: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 9

MonetizationBetter ranking = higher click-through rate• Search engine optimization• Affiliate spam• Advertisement spam

Why?

Page 10: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 10

Techniques How?

Page 11: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 11

Techniques / Boosting / Termhow?

techniques

boosting

hiding

link

Page 12: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 12

Techniques / Boosting / Termrepetition repetition repetitionrepetition repetition repetition dumortierite dumose dumous dump dumpage dumper dumpily dumpiness dumping dumpish dumpishlywork in weaving three-women teams is an ancient textile art on loomsplease refrain from using the phrasestitching wounds located on the lower limbs

Page 13: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 13

Techniques / Boosting / Link

techniques

boosting

hiding

term

Page 14: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 14

Techniques / Boosting / LinkDirectory clones• Duplicate (parts of) DMOZ or Yahoo! Directory

Comment spam• Post messages (containing links) to

– Blogs– (Unmoderated) forums– Wikis

Link spam farms• Create colluding spam pages• See later

Page 15: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 15

Techniques / Hidinghow?

techniques

boosting

term

term

Page 16: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 16

Techniques / HidingContent hiding

Cloaking• Identify web crawlers• Serve a different version of the page

<style type = “text/css”>body {

background-color: white;color: white; }

</style>

<div style = “visibility: hidden”>You can’t see me!</div>

<a href = “…”><img src = “1x1.gif”></img></a>

Page 17: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 17

RoadmapWhat is search engine spam?What techniques do spammers use?Work at StanfordChallenges ahead

Page 18: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 18

Work at StanfordAnalysis• Link spam farms and alliances

Demotion• TrustRank

Detection• Spam mass estimation• See publications

Page 19: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 19

Link Spam Farms & AlliancesSpammer’s goal: increase PageRankFarm model

1

2

k

0

0

1

2

k

Page 20: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 20

Link Spam Farms & AlliancesOptimal farms• Short loops

including target

Page 21: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 21

Link Spam Farms & AlliancesOptimal farms• Short loops

including target

Alliances• Interconnected farms• 2 always better than 1• Larger alliances often

benefit all

Page 22: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 22

TrustRank / Observation

good pages

spam pages

Page 23: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 23

TrustRank / Observation

good pages

spam pages

Page 24: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 24

TrustRank / Observation

good pages

spam pages

Page 25: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 25

TrustRank / Observation

good pages

spam pages

Approximate isolation of good pages:good pages seldom point to spam

Page 26: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 26

TrustRank / ObjectiveSeparate good pages from spam pages

Page 27: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 27

TrustRank / Objective

What?Assign high scores to very good pages

How?Propagate scores from known good pages

(seed set)When?

Use results in ranking

Separate good pages from spam pages

Page 28: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 28

TrustRank / Example

Page 29: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 29

TrustRank / Example

Page 30: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 30

TrustRank / Example

Page 31: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 31

TrustRank / Example.50 Damping

1

Page 32: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 32

TrustRank / Example

.25

Splitting

.25

1

Page 33: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 33

TrustRank / ExperimentsData• Site-level AltaVista web graph: 31M sites• Seed set of 178 sites

Evaluation sample• 1000 manually tagged sites

Results• Log scores 20 buckets• Top 5 PageRank buckets: 15-20% spam• Top 5 TrustRank buckets: almost no spam

Page 34: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 34

What is search engine spam?What techniques do spammers use?Work at StanfordChallenges ahead

Roadmap

Page 35: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 35

ChallengesRemove economic incentive• Why not just charge for high ranking?• Revenue based on transactions generated, not

click-through rateMechanism design• Spam-proof algorithms/services

Spam on community-driven sites• Flickr, MySpace, del.icio.us…

Page 36: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 36

Thank You!Stanford InfoLabhttp://infolab.stanford.edu

Publicationshttp://dbpubs.stanford.edu

[email protected]

Page 37: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 37

TrustRank / ExperimentsWeb data• Entire AltaVista index (June 2003)• Site-level web graph

– 31M nodes– 13M without inlinks

Seed set• 2500 candidates• 178 selected high-quality sites

Evaluation sample• 1000 manually tagged sites

Page 38: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 38

TrustRank / Experiments

Page 39: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 39

TrustRank / Experiments

1 2 3 4 5 6 7 8 9 1011121314151617181920

2

3

4

5

6

Page 40: An Overview of Search Engine Spam - Stanford University · 2007. 3. 19.  · • Search engine optimization • Affiliate spam • Advertisement spam Why? Stanford Security Workshop

Stanford Security Workshop ● Stanford, March 19, 2007 40

TrustRank / Experiments

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17+

2

4

6

PageRank Bucket

Ave

rage

Dem

otio

n (#

of B

ucke

ts)

Spam from PageRank bucket #3 moved to TrustRank bucket #7