web search – summer term 2006 iv. web search - crawling (part 2) (c) wolfgang hürst,...
TRANSCRIPT
Web Search – Summer Term 2006
IV. Web Search -Crawling (part 2)
(c) Wolfgang Hürst, Albert-Ludwigs-University
Crawling - Recap from last timeGeneral procedure: Continuously process a list of URLs and collect respective web pages and links that come along
Two problems: Size and frequent changes
Page selection:
Based on metrics, i.e.
- Importance Metric (goal)
- Ordering Metric (selection)
- Quality Metric (evaluation)
Experimental verification with a representative test collection
Crawling - Recap from last timeGeneral procedure: Continuously process a list of URLs and collect respective web pages and links that come along
Two problems: Size and frequent changes
Page selection:
Based on metrics, i.e.
- Importance Metric (goal)
- Ordering Metric (selection)
- Quality Metric (evaluation)
Experimental verification with a representative test collection
Crawling - Recap from last timeGeneral procedure: Continuously process a list of URLs and collect respective web pages and links that come along
Two problems: Size and frequent changes
Page selection:
Based on metrics, i.e.
- Importance Metric (goal)
- Ordering Metric (selection)
- Quality Metric (evaluation)
Experimental verification with a representative test collection
Crawling - Recap from last timeGeneral procedure: Continuously process a list of URLs and collect respective web pages and links that come along
Two problems: Size and frequent changes
Page selection:
Based on metrics, i.e.
- Importance Metric (goal)
- Ordering Metric (selection)
- Quality Metric (evaluation)
Experimental verification with a representative test collection
Crawling - Recap from last timeGeneral procedure: Continuously process a list of URLs and collect respective web pages and links that come along
Two problems: Size and frequent changes
Page selection:
Based on metrics, i.e.
- Importance Metric (goal)
- Ordering Metric (selection)
- Quality Metric (evaluation)
Experimental verification with a representative test collection
Page refresh:
Estimating rate of change: see last lecture (Note: other studies exist, e.g. [5])
Observations:- Frequent changes- Significant differences, e.g. among domains
Hence: Update rule necessary
3. Page Refresh (Update Rules)Problem: The web is continuously changingGoal: Index and update pages in a way that keeps the index as fresh and young as possible (given the limited resources)
Distinguish between
Periodic crawlers: Download K pages and stop, repeat this after some time t, and replace old with new collection
Incremental crawlers: Continuously crawl the web and incrementally update your collection
3.2 Incremental Crawlers
Freshness of a page pi at time t
Freshness of a local collection P at time t
Main Goal: Keep local collection up-to-date
Two measures: Freshness and Age
3.2 Incremental Crawlers
Age of a page pi at time t
Age of a local collection P at time t
Main Goal: Keep local collection up-to-date
Two measures: Freshness and Age
3.2 Incremental Crawlers
Time average of freshness of page pi at t
Time average of freshness of a local collection P at time t
(Time average of age: analogous)
Main Goal: Keep local collection up-to-date
Two measures: Freshness and Age
Example for Freshness and Age
1
0
0
ELEMENT IS
CHANGED
SYNCHRONIZED
AGE
FRESHNESS
(SOURCE: [6])
Design alternative 1: Batch mode vs. steady crawler
Batch mode crawler:Periodic update of all pages of a collection
Steady crawler:Continuous update
BATCH MODE CRAWLER STEADY CRAWLER
FRESHNESS
TIME (MONTH)TIME (MONTH)
FRESHNESS
Note: Assuming a distribution of Poisson, we can prove that the average freshness over time is identical in both cases (for the same average crawling speed!)
Design alternative 2: In-place vs. shadowing
Replace old with new version of a page in-place or via shadowing, i.e. after all versions of one crawl have been downloaded
Shadowing keeps two collections: The crawlers collection and the current collection
BATCH MODE CRAWLER STEADY CRAWLER
Design alternative 3: Fixed vs. variable frequency
Fixed frequency / uniform refresh policy: Same access rate to all pages (independent of their actual rate of change)
Variable frequency: Access pages depending on their rate of change
Example: Proportional refresh policy
Variable frequency update
Obvious assumption for a good strategy: Visit a page that changes frequently more often
Wrong!!!
The optimum update strategy (if we assume a distribution of Poisson) looks like this:
RATE OF CHANGE OF A PAGE
OPTIM
UM
U
PD
ATE
TIM
E
Variable frequency update (cont.)
Why is this a better strategy?
Illustration with a simple example:
P1
P2
Summary of different design alternatives
Steady
In-place update
Variable frequency
Batch-mode
Shadowing
Fixed frequency
vs.
vs.
vs.
3.3 Expl. for an Incremental Crawler
Two main goals:
- Keep the local collection fresh Regular, best-possible updates of the pages in the index
- Continuously improve the quality of the collection Replace existing pages with low quality through new pages with higher quality
3.3 Expl. for an Incremental Crawler
WHILE (TRUE)
URL = SELECT_TO_CRAWL (ALL_URLS);
PAGE = CRAWL (URL);
IF (URL IN COLL_URLS) THEN UPDATE (URL, PAGE) ELSE TMP_URL = SELECT_TO_DISCARD (COL_URLS); DISCARD (TMP_URL); SAVE (URL, PAGE); COLL_URLS = (COLL_URLS - {TMP_URL}) U {URL}
NEW_URLS = EXTRACT_URLS (PAGE);
ALL_URLS = ALL_URLS U NEW_URLS;
3.3 Expl. for an Incremental Crawler
ALL_URLSCOLL_URLS
ADD_URLS
UPDATE/SAVE
COLLECTION
RANKINGMODULE
DISCARDSCAN
SCANADD/REMOVE
CRAWLMODULE
UPDATEMODULE
CRAWLCHECK SUM
POP
PUSH BACK
References - Web Crawler[1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001Chapter 2 (Crawling web pages)
[2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998Chapter 4.3 (Crawling the web)
[3] CHO, GARCIA-MOLINA, PAGE: "EFFICIENT CRAWLING THROUGH URL ORDERING", WWW 1998
[4] CHO, GARCIA-MOLINA: "THE EVOLUTION OF THE WEB AND IMPLICATIONS FOR AN INCREMENTAL CRAWLER", PROCEEDINGS OF THE 26th INTL. CONF. ON VERY LARGE DATA BASES (VLDB 2000)
[5] FETTERLY, MANASSE, NAJORK, WIENER: "A LARGE-SCALE STUDY OF THE EVOLUTION OF WEB PAGES", WWW 2003
[6] CHO, GARCIA-MOLINA: "SYNCHRONIZING A DATABASE TO IMPROVE FRESHNESS", ACM SIGMOD 2000