do not crawl in the dust different ur ls similar text

46
Do Not Crawl In The DUST: Different URLs Similar Text Uri Schonfeld Department of Electrical Engineering Technion Joint Work with Dr. Ziv Bar Yossef and Dr. Idit Keidar

Upload: george-ang

Post on 20-Jan-2015

1.071 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Do not crawl in the dust different ur ls similar text

Do Not Crawl In The DUST:Different URLs Similar Text

Uri Schonfeld Department of Electrical Engineering

TechnionJoint Work with Dr. Ziv Bar Yossef and Dr. Idit Keidar

Page 2: Do not crawl in the dust different ur ls similar text

Problem statement and motivation Related work Our contribution The DustBuster algorithm Experimental results Concluding remarks

Talk Outline

Page 3: Do not crawl in the dust different ur ls similar text

DUST – Different URLs Similar Text Examples:

Standard Canonization: “http://domain.name/index.html”

“http://domain.name” Domain names and virtual hosts

“http://news.google.com” “http://google.com/news” Aliases and symbolic links:

“http://domain.name/~shuri” “http://domain.name/people/shuri”

Parameters with little affect on content Print=1

URL transformations: “http://domain.name/story_”

“http://domain.name/story?id=”

Even the WWW Gets Dusty

Page 4: Do not crawl in the dust different ur ls similar text

Dust rule: Transforms one URL to another Example: “index.html” “”

Valid DUST rule:

r is a valid DUST rule w.r.t. site S if for every URL u S,

r(u) is a valid URL r(u) and u have “similar” contents

Why similar and not identical? Comments, news, text ads, counters

DUST Rules!

Page 5: Do not crawl in the dust different ur ls similar text

Expensive to crawl Access the same document via multiple URLs

Forces us to shingle An expensive technique used to discover similar

documents Ranking algorithms suffer

References to a document split among its aliases Multiple identical results

The same document is returned several times in the search results

Any algorithm based on URLs suffers

DUST is Bad

Page 6: Do not crawl in the dust different ur ls similar text

Given: a list of URLs from a site S Crawl log Web server log

Want: to find valid DUST rules w.r.t. S As many as possible Including site-specific ones Minimize number of fetches

Applications: Site-specific canonization More efficient crawling

We Want To

Page 7: Do not crawl in the dust different ur ls similar text

Domain name aliases Standard extensions Default file names: index.html,

default.htm File path canonizations:

“dirname/../” “”, “//” “/” Escape sequences: “%7E” “~”

How do we Fight DUST Today? (1) Standard Canonization

Page 8: Do not crawl in the dust different ur ls similar text

Site-specific DUST: “story_” “story?id=“ “news.google.com”

“google.com/news” “labs” “laboratories”

This DUST is harder to find

Standard Canonization is not Enough

Page 9: Do not crawl in the dust different ur ls similar text

Shingles are document sketches [Broder,Glassman,Manasse 97]

Used to compare documents for similarity Pr(Shingles are equal) = Document similarity Compare documents by comparing shingles Calculate Shingle:

Take all m word sequences Hash them with hi Choose the min That's your shingle

How do we Fight DUST Today? (2) Shingles

Page 10: Do not crawl in the dust different ur ls similar text

Shingles expensive: Require fetch Parsing Hash

Shingles do not find rules Therefore, not applicable to new

pages

Shingles are Not Perfect

Page 11: Do not crawl in the dust different ur ls similar text

Mirror detection[Bharat,Broder 99], [Bharat,Broder,Dean,Henzinger 00], [Cho,Shivakumar,Garcia-Molina 00], [Liang 01]

Identifying plagiarized documents [Hoad,Zobel 03]

Finding near-replicas[Shivakumar,Garcia-Molina 98], [Di Iorio,Diligenti,Gori,Maggini,Pucci 03]

Copy detection[Brin,Davis,Garcia-Molina 95], [Garcia-Molina,Gravano,Shivakumar 96], [Shivakumar,Garcia-Molina 96]

More Related Work

Page 12: Do not crawl in the dust different ur ls similar text

An algorithm that finds site-specific valid DUST rules requires minimum number of fetches

Convincing results in experiments Benefits to crawling

Our contributions

Page 13: Do not crawl in the dust different ur ls similar text

Alias DUST: simple substring substitutions “story_1259” “story?id=1259” “news.google.com” “google.com/news” “/index.html” “”

Parameter DUST: Standard URL structure:

protocol://domain.name/path/name?para=val&pa=va Some parameters do not affect content:

Can be removed Can changed to a default value

Types of DUST

Page 14: Do not crawl in the dust different ur ls similar text

Input: URL list Detect likely DUST rules Eliminate redundant rules Validate DUST rules using samples:

Eliminate DUST rules that are “wrong” Further eliminate duplicate DUST rules

No Fetch Required

Our Basic Framework

Page 15: Do not crawl in the dust different ur ls similar text

Large support principle: Likely DUST rules have lots of “evidence” supporting them

Small buckets principle:Ignore evidence that supports many different rules

How to detect likely DUST rules?

Page 16: Do not crawl in the dust different ur ls similar text

Large Support Principle

A pair of URLs (u,v) is an instance of rule r, if: r(u) = v

Support(r) = all instances (u,v) of r

Large Support Principle

The support of a valid DUST rule is large

Page 17: Do not crawl in the dust different ur ls similar text

Rule Support:An Equivalent View

: a string Ex: = “story_”

u: URL that contains as a substring Ex: u =

“http://www.sitename.com/story_2659” Envelope of in u:

A pair of strings (p,s) p: prefix of u preceding s: suffix of u succeeding Example: p = “http://www.sitename.com/”, s

= “2659” E(α): all envelopes of in URLs that appear in

input URL list

Page 18: Do not crawl in the dust different ur ls similar text

Envelopes Example

Page 19: Do not crawl in the dust different ur ls similar text

Rule Support:An Equivalent View : an alias DUST rule

Ex: = “story_”, = “story?id=“

Lemma: |Support( )| = | E() ∩ E()| Proof:

bucket(p,s) = { | (p,s) E() } Observation: (u,v) is an instance of if

and only if u = p s and v = p s for some (p,s)

Hence, (u,v) is an instance of iff (p,s) E() ∩ E()

Page 20: Do not crawl in the dust different ur ls similar text

Large Buckets Often there is a large set of

substrings that are interchangeable within a given URL while not being DUST:

page=1,page=2,… lecture-1.pdf, lecture-2.pdf

This gives rise to large buckets:

Page 21: Do not crawl in the dust different ur ls similar text

Big Buckets: popular prefix suffix Often do not contain similar content Big buckets are expensive to process

I am a DUCK not a DUST

Small Bucket Principle

Small Buckets Principle

Most of the support of valid Alias DUST rules

is likely to belong to small buckets

Page 22: Do not crawl in the dust different ur ls similar text

Scan Log and form buckets Ignore big buckets For each small Bucket:

For every two substrings α, β in the bucket print (α, β)

Sort by (α, β) For every pair (α, β):

Count If (Count > threshold) print α β

Algorithm – Detecting Likely DUST RulesNo Fetch here!

Page 23: Do not crawl in the dust different ur ls similar text

Size and Comments

Consider only instances of rules whose size “matches”

Use ranges of sizes Running time O(Llog(L)) Process only short substrings Tokenize URLs

Page 24: Do not crawl in the dust different ur ls similar text

Input: URL list Detect likely DUST rules Eliminate redundant rules Validate DUST rules using samples:

Eliminate DUST rules that are “wrong” Further eliminate duplicate DUST rules

No Fetch Required

Our Basic Framework

Page 25: Do not crawl in the dust different ur ls similar text

Eliminating RedundantRules

“/vlsi/” “/labs/vlsi/”“/vlsi” “/labs/vlsi”

Lemma:A substitution rule α’ β’ refines rule α β if and only if there exists an envelope (γ,δ) such that α’ = γ◦α◦δ and β’=γ◦β ◦ δ

Lemma helps us identify refinements easily φ refines ψ? remove ψ if supports match

Rule φ refines rule ψ if SUPPORT(φ) SUPPORT(ψ )

No Fetch here!

Page 26: Do not crawl in the dust different ur ls similar text

Validating Likely Rules For each likely rule r, for both

directions Find sample URLs from list to which r is

applicable For each URL u in the sample:

v = r(u) Fetch u and v Check if content(u) is similar to content(v)

if fraction of similar pairs > threshold: Declare rule r valid

Page 27: Do not crawl in the dust different ur ls similar text

Assumption: if validation beyond threshold in 100

it will be the same for any validation above

Why isn’t threshold 100%? A 95% valid rule may still be worth it Dynamic pages change often

Comments About Validation

Page 28: Do not crawl in the dust different ur ls similar text

We experiment on logs of two web sites: Dynamic Forum Academic Site

Detected from a log of about 20,000 unique URLs

On each site we used four logs from different time periods

Experimental Setup

Page 29: Do not crawl in the dust different ur ls similar text

Precision at k

Page 30: Do not crawl in the dust different ur ls similar text

Precision vs. Validation

Page 31: Do not crawl in the dust different ur ls similar text

How many of the DUST do we find? What other duplicates are there:

Soft errors True copies:

Last semesters course All authors of paper

Frames Image galleries

Recall

Page 32: Do not crawl in the dust different ur ls similar text

In a crawl examined

18% of the crawl was reduced.

DUST Distribution

47.1 DUST

25.7% Images

7.6% Soft Errors

17.9% Exact Copy

1.8% misc

Page 33: Do not crawl in the dust different ur ls similar text

DustBuster is an efficient algorithm Finds DUST rules Can reduce a crawl Can benefit ranking algorithms

Conclusions

Page 34: Do not crawl in the dust different ur ls similar text

THE END

Page 35: Do not crawl in the dust different ur ls similar text

= => --> all rules with “” Fix drawing urls crossing alpha not

all p and all s

Things to fix

Page 36: Do not crawl in the dust different ur ls similar text

So far, non-directional Prefer shrinking rules Prefer lexicographically lowering

rules Check those directions first

Page 37: Do not crawl in the dust different ur ls similar text

Parameter name and possible values What rules:

Remove parameter Substitute one value with another Substitute all values with a single value

Rules are validated the same way the alias rules are

Will not discuss further

Parametric DUST

Page 38: Do not crawl in the dust different ur ls similar text

Unfortunately we see a lot of “wrong” rules

Substitute 1 with 2 Just wrong:

One domain name with another with similar software

False rules examples: /YoninaEldar/ != /DavidMalah/ /labs/vlsi/oldsite != /labs/vlsi -2. != -3.

False Rules

Page 39: Do not crawl in the dust different ur ls similar text

Filtering out False Rulese

Getting rid of the big buckets Using the size field:

False dust rules: May give valid URLs Content is not similar Size is probably different Size ranges used

Tokenization helps

Page 40: Do not crawl in the dust different ur ls similar text

DustBuster – cleaning up the rules Go over list with a window If

Rule a refines rule b Their support size is close

Leave only rule a

Page 41: Do not crawl in the dust different ur ls similar text

DustBuster – Validation

Validation per rule Get sample URLs URLs that the rule can be applied Apply URL => applied URL Get content Compare using shingles

Page 42: Do not crawl in the dust different ur ls similar text

DustBuster - Validation

Stop fetching when: #failures > 100 * (1-threshold)

Page that doesn't exist is not similar to anything else

Why use threshold < 100%? Shingles not perfect Dynamic pages may change a lot fast

Page 43: Do not crawl in the dust different ur ls similar text

Detect Alias DUST – take 2

Tokenize of course Form buckets Ignore big buckets Count support only if size matches Don't count Long substrings Results are cleaner

Page 44: Do not crawl in the dust different ur ls similar text

Eliminate Redundancies 1: EliminateRedundancies(pairs_list R) 2: for i = 1 to |R| do 3: if (already eliminated R[i]) continue 4: to_eliminate_current := false /* Go over a window */ 5: for j = 1 to min(MW, |R| - i) do /* Support not close? Stop checking */ 6: if (R[i].size - R[i+j].size > max(MRD*R[i].size, MAD)) break /* a refines b? remove b */ 7: if (R[i] refines R[i+j]) 8: eliminate R[i+j] 9: else if (R[i+j] refines R[i]) then 10: to_eliminate_current := true 11: break 12: if (to_eliminate_current) 13: eliminate R[i] 14: return R

No Fetch here!

Page 45: Do not crawl in the dust different ur ls similar text

Validate a Single Rule 1:ValidateRule(R, L) 2: positive := 0 3: negative := 0 /* Stop When You Are sure you either succeeded or failed */ 4: while (positive < (1 - ε) N AND (negative < εN) do 5: u := a random URL from L to which R is applicable 6: v := outcome of application of R to u 7: fetch u and v 8: if (fetch u failed) continue /* Something went wrong, negative sample */ 9: if (fetch v failed) OR (shingling(u) shingling(v)) 10 negative := negative + 1 /* Another positive sample */ 11: else 12: positive := positive + 1 13: if (negative ε N ) 14: retrun FALSE 15: return TRUE

Page 46: Do not crawl in the dust different ur ls similar text

Validate Rules 1:Validate(rules_list R, test_log L) 2 create list of rules LR 3: for i = 1 to |R| do /* Go over rules that survived = valid rules */ 4: for j = 1 to i - 1 do 5: if (R[j] was not eliminated AND R[i] refines R[j]) 6: eliminate R[i] from the list 7: break 8: if (R[i] was eliminated) 9: continue /* Test one direction */ 10: if (ValidateRule(R[i].alpha R[i].beta, L)) 11: add R[i].alpha R[i].beta to LR /* Test other direction only if first direction failed*/ 12: else if (ValidateRule(R[i].beta R[i].alpha, L)) 13: add R[i].alpha R[i].beta to LR 14: else 15: eliminate R[i] from the list 16: return LR