crawlers padmini srinivasan computer science department department of management sciences psriniva...
TRANSCRIPT
Crawlers
Padmini Srinivasan
Computer Science Department Department of Management Sciences
http://cs.uiowa.edu/[email protected]
Basics
• What is a crawler?• HTTP client software that sends out an HTTP request for
a page and reads a resppnse.• Timeouts• How much to download?• Exception handling• Error handling• Collect statistics: time-outs, etc.• Follows Robot Exclusion Protocol (de facto, 1994
onwards)
Tippie web site• # robots.txt for http://www.tippie.uiowa.edu/ or http://tippie.uiowa.edu/
• # Rules for all robots accessing the site.
• User-agent: *
• Disallow: /error-pages/• Disallow: /includes/• Disallow: /Redirects/• Disallow: /scripts/• Disallow: /CFIDE/
• # Individual folders that should not be indexed• Disallow: /vaughan/Board/• Disallow: /economics/mwieg/• Disallow: /economics/midwesttheory/• Disallow: /undergraduate/scholars/
• Sitemap: http://tippie.uiowa.edu/sitemap.xml
Robots.txtUser-agent: *Disallow:
User-agent: BadBotDisallow: /
User-agent: GoogleDisallow:User-agent: *Disallow: /
<html><head><meta name="googlebot" content="noindex">
http://www.robotstxt.org/Legal? No. But has been used in legal cases.
Types of crawlers
• Get everything?– Broad…..
• Get everything within on a topic?– Preferential, topical, focused, thematic– What are your objectives behind the crawl?
• Keep it fresh– When does one run it? Get new versus check old?
• How does one evaluate performance?– Sometimes? Continuously? What’s the Gold Standard?
Design
Crawler Parts
• Frontier– List of “to be visited” URLS– FIFO (first in first out)– Priority queue (preferential)– When the Frontier is full
• Does this happen? What to do?
– When the Frontier is empty• Does this happen?• 10,000 pages crawled, average 7 links / page: 60,000 URLS in
the frontier, how so?• Unique URLS?
Crawler Parts• History
– Time-stamped listing of visited URLS – take out of frontier first– Can keep other information too: quality estimate, update frequency, rate of errors (in
accessing page), last update date, anything you want to track related to the fetching of the page.
– Fast lookup• Hashing scheme on the URL itself• Canonicalize:
– Lowercasing; – remove anchor reference parts:
» http://www……./faq.html#time» http://www……./faq.html
– Remove tildas– Add or subtract trailing /– Remove default pages: index.html– Normalize paths: removing parent pointers in url
» http://www…./data/../mydata.html– Normalize port numbers: default numbers (80)
• Spider traps: long URLS, limit length.
Crawler Parts
• Page Repository– Keep all of it? Some of it? Just the anchor texts?– You decide
• Parse the web page– Index and store information (if creating a search engine of
some kind)– What to index? How to index? How to store?
• Stopwords, stemming, phrases, Tag tree evidence (DOM),• NOISE!
– Extract URLS• Google initially: show you next time.
And…
• But why do crawlers actually work?– Topical locality hypothesis
• An on topic page tends to link to other on topic pages.• Empirical test : that two pages that are topically similar have higher
probability of linking to each other than two random pages on the web. (Davison, 2000)
• And too browsing works!
– Status locality?• high status web pages are more likely to link to other high status
pages than to low status pages• Rationale from social theories: relationship asymmetry in social
groups and the spontaneous development of social hierarchies.
Crawler Algorithms
• Naïve best-first crawler– Best-N-first crawler
• SharkSearch crawler– FishSearch
• Focused crawler• Context Focused crawler• InfoSpiders• Utility-biased web crawlers
Naïve Best First Crawler
• Compute cosine between page and query/description as URL score
• Term frequency (TF) and Inverse Document Frequency (IDF) weights
• Multi-threaded: Best-N-crawler (256)
Naïve best-first crawler
• Bayesian classifier to score URLS Chakrabarti et al. 1999
• SVM (Pant and Srinivasan, 2005) better. Naïve Bayes tends to produce skewed scores.
• Use PageRank to score URLS? – How to compute? Partial data. Based on crawled
data – poor results– Later: utility-biased crawler
Shark Search Crawler
• From earlier Fish Search (de Bra et al.)– Depth bound; anchor text; link context; inherited scores
score(u) = g * inherited(u) + (1 – g) * neighbourhood (u)inherited(u) = x * sim(p, q) if sim(p, q) > 0 else inherited(p) (x < 1).
neighbourhood(u) = b * anchor(u) + (1-b) * context(u) (b < 1)
context(u) = 1 if anchor(u) > 0 else sim(aug_context, q)Depth: controls travel in a sub space; no more ‘relevant’ information found.
Focused Crawler
• Chakrabarti et al. Stanford/IIT– Topic taxonomy– User provided sample URLs– Classify these onto the taxonomy
• (Prob(c|url) where Prob(root|url) = 1.
– User iterates selecting and deselecting categories– Mark the ‘good’ categories– When page crawled: relevance(page) = sum(Prob(c|page)) where
sum is over the good categories; score URLS– When crawling:
• Soft mode: use this relevance score to rank URLS• Hard mode: find leaf node with highest score, if any ancestor marked
relevant then add to frontier else not
Context Focused Crawler
• A rather different strategy– Topic locality hypothesis somewhat explicitly used
here– Classifiers estimate distance to relevant page from
a crawled page. This estimate scores urls.
Context Graph
Context Graph
Levels: LProbability (page in class, i.e., level x)x = 1, 2, 3 (other)
Bayes theorem:
Prob(L1|page) = {Prob(page|L1) * Prob(L1)}/Prob(page)
Prob(L1) = 1/L (number of levels)
Utility-Biased Crawler
• Considers both topical and status locality.• Estimates status via local properties• Combines using several functions.– One: Cobb-Douglas function
• Utility(URL) = topicalitya * statusb (a + b = 1)
– if a page is twice as high in topicality and twice as high in status then twice as high utility as well.
– Increases in topicality (or status) cause smaller increases in utility as the topicality (or status) increases.
Estimating Status ~ cool part
• Local properties– M5’ decision tree algorithm
• Information volume• Information location
– http://www.somedomain.com/products/news/daily.html
• Information specificity• Information brokerage• Link ratio: # links/ # words• Quantitative ratio: # numbers/# words• Domain traffic: ‘reach’ data for domain obtained from Alexa
Web Information Service• Pant & Srinivasan, 2010, ISR
Utility-Biased Crawler
• Cobb-Douglas function– Utility(URL) = topicalitya * statusb (a + b = 1)
• Should a be fixed? “one size fits all” Or should it vary based on the subspace?
• Target topicality level (d)• a = a + delta (d – t), 0 <= a <= 1
– t: average estimated topicality of the last 25 pages fetched– Delta is a step size (0.01)
• Assume a = 0.7, delta = 0.01 and t = 0.9» a = 0.7 + 0.01(0.7 – 0.9) = 0.7 – 0.002
• Assume a = 0.7, delta = 0.01 and t = 0.4» a = 0.7 + 0.01(0.7 – 0.4) = 0.7 + 0.003
It’s a matter of balance
Crawler Evaluation
• What are good pages? • Web scale is daunting• User based crawls are short, but web agents?• Page importance assessed– Presence of query keywords– Similarity of page to query/description– Similarity to seed pages (held out sample)– Use a classifier – not the same as used in crawler– Link-based popularity (but within topic?)
Summarizing Performance
• Precision– Relevance is Boolean: yes/no• Harvest rate: # of good pages/total # pages
– Relevance is continuous• Average relevance over crawled set
– Recall• Target recall: held out seed pages (H)
– |H ∧ pages crawled|/|pages crawled|
• Robustness– Start same crawler on disjoint seed sets. Examine overlap of
fetched pages
Sample Performance Graph
Summary
• Crawler architecture• Crawler algorithms• Crawler evaluation• Assignment 1– Run two crawlers for 5000 pages.– Start with the same set of seed pages for a topic.– Look at overlap and report this over time
(robustness)