web crawlers. web crawler a web crawler is a computer program that browses the world wide web in a...

Web Crawlers

Web crawler

• A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner.

• Other terms ants, automatic indexers, bots, and worms or Web spider, Web robot, Web scutter.

• Process is called Web crawling or spidering.

Use Of Web Crawlers

• To create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.

• For automating maintenance tasks on a Web site, such as checking links or validating HTML code.

• To gather specific types of information from Web pages, such as harvesting e-mail addresses

• A Web crawler is one type of bot, or software agent.

• Starts with a list of URLs to visit, called the seeds.

• It identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier.

• URLs from the frontier are recursively visited according to a set of policies.

Characteristics of Web Crawling

• its large volume – only download a fraction of the Web pages within a given time, so prioritize.

• its fast rate of change – it is very likely that new pages have been added to the site, or that pages have already been updated or even deleted while downloading pages.

• dynamic page generation – pages being generated by server-side scripting languages has also created difficulty in crawling.

Crawling policies

• a selection policy that states which pages to download,

• a re-visit policy that states when to check for changes to the pages,

• a politeness policy that states how to avoid overloading Web sites,

• a parallelization policy that states how to coordinate distributed Web crawlers.

Selection policy

– Pageranks– Path ascending– Focused crawling

Re-visit policy

• Freshness : This is a binary measure that indicates whether the local copy is accurate or not.

• Age :This is a measure that indicates how outdated the local copy is

Re-visit policy

• Uniform policy: This involves re-visiting all pages in the collection with the same frequency, regardless of their rates of change.

• Proportional policy: This involves re-visiting more often the pages that change more frequently. The visiting frequency is directly proportional to the (estimated) change frequency.

Politeness Policy• Costs of using Web crawlers include:– network resources, crawlers require bandwidth

and operate with a high degree of parallelism during a long period of time;

– server overload, if the frequency of accesses to a given server is too high;

– poorly-written crawlers, can crash servers or routers, download pages they cannot handle; and

– personal crawlers, if deployed by too many users, can disrupt networks and Web servers.

Parallelization Policy• a crawler that runs multiple processes in parallel.• Goal to maximize the download rate while

minimizing overhead from parallelization and to avoid repeated downloads of the same page.

• Avoid downloading same page more than once.• Crawling system requires policy for assigning

new URLs discovered during crawling process, same URL can be found by two different crawling processes.

Components• Search engine– Responsible for deciding which new documents to

explore, and for initiating the process of their retrieval.

• Database– Used to store the document metadata, full-text

index, and the hyperlinks between documents.

• Agents– Responsible for retrieving the documents from

the web under the control of search engine.

Components(Contd..)

• Query server– Responsible for handling the query processing

service.

• libWWW– This is the CERN WWW library, used by agents to

access several different kinds of contents using different protocols.

Web Crawler Architecture

Crawling Infrastructure

• Maintains a list of unvisited URLs called the frontier, list is initialized with seed URLs which may be provided by a user or another program.

• Each crawling loop involves picking the next URL to crawl from the frontier, fetching the page corresponding to the URL through HTTP.

Crawling Infrastructure(contd..)

• Before the URLs are added to the frontier they may be assigned a score that represents the estimated benefit of visiting the page corresponding to the URL.

• The crawling process may be terminated when a certain number of pages have been crawled.

• If the crawler is ready to crawl another page and the frontier is empty, the situation signals a dead-end for the crawler.

Graph Search Problem• Crawling can be viewed as a graph search

problem.• Web is seen as a large graph with pages at its

nodes and hyperlinks as its edges.• A crawler starts at a few of the nodes (seeds) and

then follows the edges to reach other nodes.• The process of fetching a page and extracting the

links within it is analogous to expanding a node in graph search.

Frontier

• The frontier is the to-do list of a crawler that contains the URLs of unvisited pages.

• In graph search terminology the frontier is an open list of unexpanded (unvisited) nodes.

• frontier can filled rather quickly as pages are crawled.

• frontier may be implemented as a FIFO queue in which case we have a breadth-first crawler that can be used to blindly crawl the Web.

History and Page Repository

• A time-stamped list of URLs that were fetched by the crawler.

• It shows the path of the crawler through the Web starting from the seed pages.

• A URL entry is made into the history only after fetching the corresponding page.

• History may be used for post crawl analysis and evaluations.

History and Page Repository(contd..)

• In its simplest form a page repository may store the crawled pages as separate files.

• Each page must map to a unique file name.• A compact string using some form of hashing

function with low probability of collisions.• MD5 one-way hashing function that provides

a 128 bit hash code for each URL.

Fetching

• An HTTP client which sends an HTTP request for a page and reads the response.

• Client needs to have timeouts to make sure that an unnecessary amount of time is not spent on slow servers or in reading large pages.

• Client needs to parse the response headers for status codes and redirections.

Fetching(contd..)

• Error checking and exception handling is important during the page fetching process.

• To collect statistics on timeouts and status codes for identifying problems or automatically changing timeout values.

Robot Exclusion Protocol

• provides a mechanism for Web server administrators to communicate their file access policies.

• To identify files that may not be accessed by a crawler.

• Done by keeping a file named robots.txt under the root directory of the Web server.

Parsing• To parse its content to extract information

that will feed and possibly guide the future path of the crawler.

• Parsing may imply simple hyperlink/URL extraction or it may involve the more complex process of tidying up the HTML content in order to analyze the HTML tag tree.

• Steps to convert the extracted URL to a canonical form, remove stopwords from the page's content and stem the remaining words.

URL Extraction and Canonicalization

• To extract hyperlink URLs from a Web page, we can use these parsers to find anchor tags and grab the values of associated href attributes.

• Convert any relative URLs to absolute URLs using the base URL of the page.

• Different URLs that correspond to the same Web page can be mapped onto a single canonical form.

Canonicalization Procedures

• convert the protocol and hostname to lowercase.

• remove the `anchor' or `reference' part of the URL.

• perform URL encoding for some commonly used characters such as `~'.

• for some URLs, add trailing `/'s.• use heuristics to recognize default Web pages.

Canonicalization Procedures(contd..)

• remove `..' and its parent directory from the URL path.

• leave the port numbers in the URL unless it is port 80, add port 80 when no port number is specified.

Stoplisting

• remove commonly used words or stopwords such as “it" and “can".

• process of removing stopwords from text is called stoplisting.

• system recognizes no more than nine words (“an", “and", “by", “for", “from", “of", “the",

• “to", and “with") as the stopwords.

Stemming

• stemming process normalizes words by conflating a number of morphologically similar words to a single root form or stem.

• Example– “connect”, " connected" and “connection" are all

reduced to “connect.“

• Stemming reduced the precision of the crawling results.

HTML tag tree

• Crawlers may assess by examining the HTML tag context in which it resides.

• The crawler only needs the links within a page, and the text or portions of the text in the page by using HTML parsers.

Example

URL Normalization

• Needed to avoid crawling the same resource more than once.

• Also called URL canonicalization, refers to the process of modifying and standardizing a URL in a consistent manner.

• Several types of normalization includes conversion of URLs to lowercase, removal of "." and ".." segments, and adding trailing slashes to the non-empty path component.

Crawler identification

• Identification is also useful for administrators, knowing when they may expect their Web pages to be indexed by a particular search engine.

• Spambots and other malicious Web crawlers are unlikely to place identifying information in the user agent field, or they may mask their identity as a browser or other well-known crawler.

Multi-threaded Crawlers

• sequential crawling loop spends a large amount of time in which either the CPU is idle or network is idle.

• Here, each thread follows a crawling loop, can provide reasonable speed-up and efficient use of available bandwidth.

Page Importance

• Keywords in document• Similarity to a query• Similarity to seed pages• Classifier score• Retrieval system rank• Link-based popularity

Summary Analysis

• Acquisition rate• Average relevance• Target recall• Robustness

Nutch

• Is a Open Source web crawler• Nutch Web Search Application– Maintain DB of pages and links– Pages have scores, assigned by analysis– Fetches high-scoring, out-of-date pages– Distributed search front end– Based on Lucene

Examples

• Yahoo Crawler (Slurp) is the name of the Yahoo Search crawler.

• Google Crawler, but the reference is only about an early version of its architecture, was based in C++ and Python.

Open-source crawlers

• Aspseek is a crawler, indexer and a search engine written in C and licensed under the GPL.

• DataparkSearch is a crawler and search engine released under the GNU General Public License.

• YaCy, a free distributed search engine, built on principles of peer-to-peer networks (licensed under GPL).

web crawlers. web crawler a web crawler is a computer program that browses the world wide web in a...

Documents

web pages

web crawling

web servers

web spider

web robot

web scutter

characteristics of web

distributed web crawlers