final srs

Y.M.C.A UNIVERSITY OF ENGINEERING, FARIDABAD.

PROJECT-SRS

ON

WEB CRAWLER

MAYUR GARG

ROLLNO-: IT-2337-2K7MENTOR-:Mrs. DEEPIKA

Objective/ Aim:

This project aims at developing a highly efficient WEB CRAWLER that browses the World Wide Web in a methodical, automated manner.

Introduction:

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms for Web crawlers are ants, automatic indexers, bots, and worms or Web spider, Web robot, or—especially in the FOAF community—Web scutter.

The process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

WEB CRAWLER ARCHITECTURE

Web crawlers typically identify themselves to a Web server by using the User-agent field of an HTTP request. Web site administrators typically examine their Web servers’ log and use the user agent field to determine which crawlers have visited the web server and how often. The user agent field may include a URL where the Web site administrator may find out more information about the crawler. Spambots and other malicious Web crawlers are unlikely to place identifying information in the user agent field, or they may mask their identity as a browser or other well-known crawler.

It is important for Web crawlers to identify themselves so that Web site administrators can contact the owner if needed. In some cases, crawlers may be accidentally trapped in a crawler trap or they may be overloading a Web server with requests, and the owner needs to stop the crawler. Identification is also useful for administrators that are interested in knowing when they may expect their Web pages to be indexed by a particular search engine.

Specifications:

Server node is responsible for managing a database, URL DB, of URLs that have been visited and those to visit in future. The controller process on the server takes care of the URL DB. The process responds to the client message requesting a list of URLs to retrieve. The retriever process makes up many connections to web servers simultaneously and downloads contents. Each client has a cache, Robots DB, of Robots.txt files of web servers. Retrieved contents are stored in a local disk of the client. The retriever returns two lists, retrieved URLs and found URLs . The found URLs are the links which have been found in retrieved pages. The controller process in the server receives the lists and register them to the URL DB. It also extracts new URLs which have not been retrieved yet and enqueue them

Forms:

A form named ‘Web Frame’ is being used here. It contains an entry field for the user where the user can type his\her valid URL (web address), including the "http://" portion, in the text field at the top of the application window. But here, only one content type can be used here i.e. text/html.

A button for search/stop is being provided in the form with the help of which, the user can either retrieve the search results or it can stop the crawler processing whenever required. Hence for searching a URL, the search button is clicked.

Status area below the scrolling list reports which page it is currently searching. As it encounters links on the page, it adds any new URLs to the scrolling list. The application remembers which pages it's already visited, so it won't search any web page twice. This prevents infinite loops. As you inspect the list of URLs, you can see that the application performs a breadth-first search. In other words, it accumulates a list of all the links that are on the current page before it follows any of the links to a new page. If you let the tour run without stopping, it will eventually stop on its own once it's found 50 files. At this point, it reports "reached search limit of 50." (You can increase the limit by changing the SEARCH_LIMIT constant in the source code.) The application will also stop automatically if it

encounters a dead end—meaning that it's traversed all the files that are directly or indirectly available from the starting position you specified. If this happens, the application reports "done."

The next time you click Search, the list of files gets cleared, and the search process starts over again.

Data Flow Diagrams:User Operations

Admin Operations

Modules:-

Following are the modules which are used in web crawler:-

Administrator Side:-

Page Setting:- Log Setting:-

Search:-

Database Management:-

final srs

Documents