crawlers padmini srinivasan computer science department department of management sciences psriniva...

27
Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences http://cs.uiowa.edu/~psriniva [email protected]

Upload: edith-ball

Post on 16-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Crawlers

Padmini Srinivasan

Computer Science Department Department of Management Sciences

http://cs.uiowa.edu/[email protected]

Page 2: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Basics

• What is a crawler?• HTTP client software that sends out an HTTP request for

a page and reads a resppnse.• Timeouts• How much to download?• Exception handling• Error handling• Collect statistics: time-outs, etc.• Follows Robot Exclusion Protocol (de facto, 1994

onwards)

Page 3: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Tippie web site• # robots.txt for http://www.tippie.uiowa.edu/ or http://tippie.uiowa.edu/

• # Rules for all robots accessing the site.

• User-agent: *

• Disallow: /error-pages/• Disallow: /includes/• Disallow: /Redirects/• Disallow: /scripts/• Disallow: /CFIDE/

• # Individual folders that should not be indexed• Disallow: /vaughan/Board/• Disallow: /economics/mwieg/• Disallow: /economics/midwesttheory/• Disallow: /undergraduate/scholars/

• Sitemap: http://tippie.uiowa.edu/sitemap.xml

Page 4: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Robots.txtUser-agent: *Disallow:

User-agent: BadBotDisallow: /

User-agent: GoogleDisallow:User-agent: *Disallow: /

<html><head><meta name="googlebot" content="noindex">

http://www.robotstxt.org/Legal? No. But has been used in legal cases.

Page 5: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Types of crawlers

• Get everything?– Broad…..

• Get everything within on a topic?– Preferential, topical, focused, thematic– What are your objectives behind the crawl?

• Keep it fresh– When does one run it? Get new versus check old?

• How does one evaluate performance?– Sometimes? Continuously? What’s the Gold Standard?

Page 6: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Design

Page 7: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Crawler Parts

• Frontier– List of “to be visited” URLS– FIFO (first in first out)– Priority queue (preferential)– When the Frontier is full

• Does this happen? What to do?

– When the Frontier is empty• Does this happen?• 10,000 pages crawled, average 7 links / page: 60,000 URLS in

the frontier, how so?• Unique URLS?

Page 8: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Crawler Parts• History

– Time-stamped listing of visited URLS – take out of frontier first– Can keep other information too: quality estimate, update frequency, rate of errors (in

accessing page), last update date, anything you want to track related to the fetching of the page.

– Fast lookup• Hashing scheme on the URL itself• Canonicalize:

– Lowercasing; – remove anchor reference parts:

» http://www……./faq.html#time» http://www……./faq.html

– Remove tildas– Add or subtract trailing /– Remove default pages: index.html– Normalize paths: removing parent pointers in url

» http://www…./data/../mydata.html– Normalize port numbers: default numbers (80)

• Spider traps: long URLS, limit length.

Page 9: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Crawler Parts

• Page Repository– Keep all of it? Some of it? Just the anchor texts?– You decide

• Parse the web page– Index and store information (if creating a search engine of

some kind)– What to index? How to index? How to store?

• Stopwords, stemming, phrases, Tag tree evidence (DOM),• NOISE!

– Extract URLS• Google initially: show you next time.

Page 10: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu
Page 11: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

And…

• But why do crawlers actually work?– Topical locality hypothesis

• An on topic page tends to link to other on topic pages.• Empirical test : that two pages that are topically similar have higher

probability of linking to each other than two random pages on the web. (Davison, 2000)

• And too browsing works!

– Status locality?• high status web pages are more likely to link to other high status

pages than to low status pages• Rationale from social theories: relationship asymmetry in social

groups and the spontaneous development of social hierarchies.

Page 12: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Crawler Algorithms

• Naïve best-first crawler– Best-N-first crawler

• SharkSearch crawler– FishSearch

• Focused crawler• Context Focused crawler• InfoSpiders• Utility-biased web crawlers

Page 13: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Naïve Best First Crawler

• Compute cosine between page and query/description as URL score

• Term frequency (TF) and Inverse Document Frequency (IDF) weights

• Multi-threaded: Best-N-crawler (256)

Page 14: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Naïve best-first crawler

• Bayesian classifier to score URLS Chakrabarti et al. 1999

• SVM (Pant and Srinivasan, 2005) better. Naïve Bayes tends to produce skewed scores.

• Use PageRank to score URLS? – How to compute? Partial data. Based on crawled

data – poor results– Later: utility-biased crawler

Page 15: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Shark Search Crawler

• From earlier Fish Search (de Bra et al.)– Depth bound; anchor text; link context; inherited scores

score(u) = g * inherited(u) + (1 – g) * neighbourhood (u)inherited(u) = x * sim(p, q) if sim(p, q) > 0 else inherited(p) (x < 1).

neighbourhood(u) = b * anchor(u) + (1-b) * context(u) (b < 1)

context(u) = 1 if anchor(u) > 0 else sim(aug_context, q)Depth: controls travel in a sub space; no more ‘relevant’ information found.

Page 16: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Focused Crawler

• Chakrabarti et al. Stanford/IIT– Topic taxonomy– User provided sample URLs– Classify these onto the taxonomy

• (Prob(c|url) where Prob(root|url) = 1.

– User iterates selecting and deselecting categories– Mark the ‘good’ categories– When page crawled: relevance(page) = sum(Prob(c|page)) where

sum is over the good categories; score URLS– When crawling:

• Soft mode: use this relevance score to rank URLS• Hard mode: find leaf node with highest score, if any ancestor marked

relevant then add to frontier else not

Page 17: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Context Focused Crawler

• A rather different strategy– Topic locality hypothesis somewhat explicitly used

here– Classifiers estimate distance to relevant page from

a crawled page. This estimate scores urls.

Page 18: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Context Graph

Page 19: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Context Graph

Levels: LProbability (page in class, i.e., level x)x = 1, 2, 3 (other)

Bayes theorem:

Prob(L1|page) = {Prob(page|L1) * Prob(L1)}/Prob(page)

Prob(L1) = 1/L (number of levels)

Page 20: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Utility-Biased Crawler

• Considers both topical and status locality.• Estimates status via local properties• Combines using several functions.– One: Cobb-Douglas function

• Utility(URL) = topicalitya * statusb (a + b = 1)

– if a page is twice as high in topicality and twice as high in status then twice as high utility as well.

– Increases in topicality (or status) cause smaller increases in utility as the topicality (or status) increases.

Page 21: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Estimating Status ~ cool part

• Local properties– M5’ decision tree algorithm

• Information volume• Information location

– http://www.somedomain.com/products/news/daily.html

• Information specificity• Information brokerage• Link ratio: # links/ # words• Quantitative ratio: # numbers/# words• Domain traffic: ‘reach’ data for domain obtained from Alexa

Web Information Service• Pant & Srinivasan, 2010, ISR

Page 22: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Utility-Biased Crawler

• Cobb-Douglas function– Utility(URL) = topicalitya * statusb (a + b = 1)

• Should a be fixed? “one size fits all” Or should it vary based on the subspace?

• Target topicality level (d)• a = a + delta (d – t), 0 <= a <= 1

– t: average estimated topicality of the last 25 pages fetched– Delta is a step size (0.01)

• Assume a = 0.7, delta = 0.01 and t = 0.9» a = 0.7 + 0.01(0.7 – 0.9) = 0.7 – 0.002

• Assume a = 0.7, delta = 0.01 and t = 0.4» a = 0.7 + 0.01(0.7 – 0.4) = 0.7 + 0.003

Page 23: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

It’s a matter of balance

Page 24: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Crawler Evaluation

• What are good pages? • Web scale is daunting• User based crawls are short, but web agents?• Page importance assessed– Presence of query keywords– Similarity of page to query/description– Similarity to seed pages (held out sample)– Use a classifier – not the same as used in crawler– Link-based popularity (but within topic?)

Page 25: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Summarizing Performance

• Precision– Relevance is Boolean: yes/no• Harvest rate: # of good pages/total # pages

– Relevance is continuous• Average relevance over crawled set

– Recall• Target recall: held out seed pages (H)

– |H ∧ pages crawled|/|pages crawled|

• Robustness– Start same crawler on disjoint seed sets. Examine overlap of

fetched pages

Page 26: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Sample Performance Graph

Page 27: Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences psriniva padmini-srinivasan@uiowa.edu

Summary

• Crawler architecture• Crawler algorithms• Crawler evaluation• Assignment 1– Run two crawlers for 5000 pages.– Start with the same set of seed pages for a topic.– Look at overlap and report this over time

(robustness)