crawl the web using apache nutch and … the web using apache nutch and lucene abstract: the...

7
CRAWL THE WEB USING APACHE NUTCH AND LUCENE Abstract: The availability of information in large quantities on the Web makes it difficult for user selects resources about their information needs. Search engine works on data collection from the Web by software program is called crawler, bot or spider. In this study, I focused make the Web Crawling fetch only related topics and it reject topics are not relevant. I used each of (Apache Nutch and Lucene) to clarify work of Web Crawling. Each of Nutch and Lucene are released under the (Apache Software Foundation). Nutch is a web search engine works to search and index Web Pages from the World Wide Web (WWW). Nutch is free open-source based or built on top of Lucene. It a free/open source in the field of information retrieval and it has more software library to fetch or crawl the Web. The benefits to use Lucene in this study is that it does not matter about information in the "WWW", like PDF, txt and MS Word when indexed these information, it working to change them to documents, they are useful to user. Apache Nutch and Lucene are written by java. Keywords: Search Engine, Web Crawling, Apache Nutch, Apache Lucene, Java open source. 1. Introduction The World Wide Web has become the largest source for the providing of information to Internet users around the world. Because of the considerable increasing of information of the Web makes it difficult for user selects resources about their information needs. User gets this information from the Web by "search engines". At the heart of all "Search Engines" there is a program called "Web Crawler". It works to fetch URLs from the Web in large quantities. For example "Web Crawler" is fetching more topics not relevant, but the user need to fetch few pages of topics relevant. Search engines are essential services to find the content on the Web. Google and Yahoo are Popular search engine [1]. Commercial search engines like Google and Yahoo! have over a hundred billion documents indexed [2] [3]. We used tubitak keyword to search about them via use Google, then we got very many results about tubitak (Google results is 1,500.000), as shown in the figure 1 below. Figure 1: Google about tubitak keyword So, search engines nowadays are becoming more and more necessary and popular in surfing the Internet [4]. However, how these search engines like Google and Yahoo works is unknown to many people. We used Apache Nutch Web crawler, because it open-source, which we can modify it and developed. Aim of this paper first, through a research into Open-source search engine Nutch and Lucene, introduces how a common search engine works and second, via using Nutch, a search engine we can get topics related to the user's query, and the exclusion of topics not related. 2. BACKGROUND 2.1 Web Crawling A web crawler is a program/software or programmed script that browses the World Wide Web in a systematic, automated manner. The structure of the WWW is a graphical structure, i.e., the links presented in a web page may be used to open other web pages. Internet is a directed graph where webpage as a node and hyperlink as an edge, thus the search operation may be summarized as a process of traversing directed graph. By following the linked structure of the Web, web crawler may traverse several new web pages starting from a webpage. A web crawler move from page to page by the using of graphical structure of the web pages. Such programs are also known as robots, spiders, and worms. Web crawlers are designed to retrieve Web pages and insert them to local repository. Crawlers are basically used to create a replica of all the visited pages that are later processed by a search engine that will index the downloaded pages that help in quick searches. Search engines job is to storing information about several webs pages, which they retrieve from WWW. These pages are retrieved by a Nibras Othman Abdulwahid Graduate School of Natural and Applied Sciences Dept. of Mathematics and Computer Science Ministry of Higher Education and Scientific Research, Iraq [email protected]

Upload: doankhanh

Post on 23-Apr-2018

251 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: CRAWL THE WEB USING APACHE NUTCH AND … THE WEB USING APACHE NUTCH AND LUCENE Abstract: The availability of information in large quantities on the Web makes it …

CRAWL THE WEB USING APACHE NUTCH

AND LUCENE

Abstract: The availability of information in large quantities on the Web makes it difficult for user selects resources about

their information needs. Search engine works on data collection from the Web by software program is called crawler, bot or

spider. In this study, I focused make the Web Crawling fetch only related topics and it reject topics are not relevant. I used

each of (Apache Nutch and Lucene) to clarify work of Web Crawling. Each of Nutch and Lucene are released under the

(Apache Software Foundation). Nutch is a web search engine works to search and index Web Pages from the World Wide Web

(WWW). Nutch is free open-source based or built on top of Lucene. It a free/open source in the field of information retrieval

and it has more software library to fetch or crawl the Web. The benefits to use Lucene in this study is that it does not matter

about information in the "WWW", like PDF, txt and MS Word when indexed these information, it working to change them to

documents, they are useful to user. Apache Nutch and Lucene are written by java.

Keywords: Search Engine, Web Crawling, Apache Nutch, Apache Lucene, Java open source.

1. Introduction

The World Wide Web has become the largest source

for the providing of information to Internet users

around the world. Because of the considerable

increasing of information of the Web makes it difficult

for user selects resources about their information needs.

User gets this information from the Web by "search

engines". At the heart of all "Search Engines" there is a

program called "Web Crawler". It works to fetch URLs

from the Web in large quantities. For example "Web

Crawler" is fetching more topics not relevant, but the

user need to fetch few pages of topics relevant. Search

engines are essential services to find the content on the

Web. Google and Yahoo are Popular search engine [1].

Commercial search engines like Google and Yahoo!

have over a hundred billion documents indexed [2] [3].

We used tubitak keyword to search about them via use

Google, then we got very many results about tubitak

(Google results is 1,500.000), as shown in the figure 1

below.

Figure 1: Google about tubitak keyword

So, search engines nowadays are becoming more and

more necessary and popular in surfing the Internet [4].

However, how these search engines like Google and

Yahoo works is unknown to many people. We used

Apache Nutch Web crawler, because it open-source,

which we can modify it and developed. Aim of this

paper first, through a research into Open-source search

engine Nutch and Lucene, introduces how a common

search engine works and second, via using Nutch, a

search engine we can get topics related to the user's

query, and the exclusion of topics not related.

2. BACKGROUND

2.1 Web Crawling

A web crawler is a program/software or programmed

script that browses the World Wide Web in a

systematic, automated manner. The structure of the

WWW is a graphical structure, i.e., the links presented

in a web page may be used to open other web pages.

Internet is a directed graph where webpage as a node

and hyperlink as an edge, thus the search operation

may be summarized as a process of traversing directed

graph. By following the linked structure of the Web,

web crawler may traverse several new web pages

starting from a webpage. A web crawler move from

page to page by the using of graphical structure of the

web pages. Such programs are also known as robots,

spiders, and worms. Web crawlers are designed to

retrieve Web pages and insert them to local repository.

Crawlers are basically used to create a replica of all

the visited pages that are later processed by a search

engine that will index the downloaded pages that help

in quick searches. Search engines job is to storing

information about several webs pages, which they

retrieve from WWW. These pages are retrieved by a

Nibras Othman Abdulwahid

Graduate School of Natural and Applied Sciences

Dept. of Mathematics and Computer Science

Ministry of Higher Education and Scientific Research, Iraq

[email protected]

Page 2: CRAWL THE WEB USING APACHE NUTCH AND … THE WEB USING APACHE NUTCH AND LUCENE Abstract: The availability of information in large quantities on the Web makes it …

Web crawler that is an automated Web browser that

follows each link it sees [6].

2.2 Working of Web Crawler

The working of Web crawler is beginning with initial

set of URLs known as seed URLs. They download web

pages for the seed URLs and extract new links present

in the downloaded pages. The retrieved web pages are

stored and well indexed on the storage area so that by

the help of these indexes they can later be retrieved as

and when required. The extracted URLs from the

downloaded page are confirmed to know whether their

related documents have already been downloaded or

not. If they are not downloaded, the URLs are again

assigned to web crawlers for further downloading. This

process is repeated till no more URLs are missing for

downloading. Millions of pages are downloaded per

day by a crawler to complete the target. Figure 2 show

the crawling processes [6].

Figure 2: Flow of a crawling process

The working of a web crawler may be discussed as

follows [6], [7]:

Selecting a starting seed URL or URLs

Adding it to the frontier

Now picking the URL from the frontier

Fetching the web-page corresponding to that

URL

Parsing that web-page to find new URL links

Adding all the newly found URLs into the

frontier

Go to step 2 and reiterate till the frontier is

empty

Thus a web crawler will recursively keep on inserting

newer URLs to the database repository of the search

engine. So we can see that the major function of a web

crawler is to insert new links into the frontier and to

choose a fresh URL from the frontier for further

processing after every recursive step [6].

3. MATERIALSAND

IMPLEMENTATION OF WORK

3.1 MATERIALS

In this part will be a brief talk about Apache Nutch

and Apache Lucene and remind the function of each

one of them. Then we will display the results.

3.1.1 Apache Nutch

Nutch is an open-source search engine based on

Lucene Java, which is an open-source information

retrieval library supported by the Apache Software

Foundation for the search and index component,

providing a crawler program, an Index engine and a

Query engine. Nutch consists of the following three

parts [4], [11]: 1. Pages collection (fetch):- The program of

collecting pages, by timely collection or

incremental collection, chooses the URLs,

through which pages are to be visited and then

fetched to the local disk by the crawler.

2. Creating index:- The program of creating

index converts the pages or other files into the

txt-document, divides them into segments,

filters some useless information and then,

creates and assists indexes which are

composed by some smaller indexes based on

key words or inverted documents.

3. Searcher: - The program of searcher accepts

user’s query words through segmentation and

filtering and then divides them into groups of

key words, according to which correspondent

pages are matched in treasury index. Then, it

puts the matches in order by sorting and

returns the results to the users. The overall

framework of Nutch is listed in figure 3.

Figure 3: Framework of Nutch

3.1.2 Apache Lucene

Apache Lucene is a high-performance, full-featured

text search engine library written entirely in Java. It is

a technology suitable for nearly any application that

requires full-text search, especially cross-platform.

Figure 4 below displays the framework of the Lucene

[8].

Page 3: CRAWL THE WEB USING APACHE NUTCH AND … THE WEB USING APACHE NUTCH AND LUCENE Abstract: The availability of information in large quantities on the Web makes it …

Figure 4: Framework of Lucene

Three basic steps followed by Lucene the first, Lucene

does not care about the source of the data, its format, or

even its language as long as you can convert it to text.

This means you can use Lucene to index and search

data stored in files, web pages on remote web servers,

documents stored in local file systems, simple text files,

Microsoft Word documents, HTML or PDF files, or

any other formats, from which you can extract textual

information. Figure 4 below displays the Document

Converting.

Figure 5: Document Converting via Lucene

Second step, Analysis: - After the completion of the

process of converting data, indexing and have created

Lucene Documents. Then comes the process of

preparing and sort the data indexed, Lucene makes the

data more convenient for indexing. To do this, Lucene

works on the segmentation of textual data into parts. It

works on remove all frequent but have no meaning

tokens from the input, such as stop words (a, an, the, in,

on, and soon) in English text. There is an important

point about documents that contain metadata such as

the author, the title, the last modified date, and

potentially much more. Metadata refers to "data about

data". We must separate these information or metadata

and indexed as separate section. The third step: Storing

the Index: - Lucene works to sort documents, such as

words or numbers, to access to them quickly. Lucene

works to break documents indexed into terms, like this

example "This is a red car", Lucene separates it into

the symbols or words: This, is, a, red, car, and then

Lucene works filtering stop words, as shown in figure

6 below.

Figure 6: Indexing Process

3.2 IMPLEMENTATION A NUTCH WEB

CRAWLING

In this part we will run our Web Crawler. First must

start from seed URL. In this study we chose the

(http://tubitak.gov.tr/) URL. There are four factors

affecting the crawler’s ability:

1. Depth: the depth of the download.

2. topN: the amount of page hyperlinks before

the download.

3. Threads: the threads which the download

programmer uses.

4. Delay: the delay time of the host visiting.

The work process of the Nutch’s Crawler includes

four steps as follows:

1. Create the initial collection of the URL.

2. Begin the Fetching based on the pre-defined

Depth, topN, Threads and Delay.

3. Create the new URL waiting list and start the

new round of Fetching.

4. Unite the resources downloaded in the local

disk.

Figure 7 below shows the Crawl Life cycle [12].

Figure 7: Crawl Life cycle

Page 4: CRAWL THE WEB USING APACHE NUTCH AND … THE WEB USING APACHE NUTCH AND LUCENE Abstract: The availability of information in large quantities on the Web makes it …

4. RESULTS

4.1 Crawl Testing

First, the test starts with crawling tubitak website

http://tubitak.gov.tr/ . Figure 7 and 8 are shows the start

and finish process.

Figure 7: Start Crawling

Figure 8: Finish Crawling

4.2 Search Testing

We deploy the project into the Tomcat start the tomcat

visit the http://localhost:8080/nutch-1.1/en/ , and then

Figure 9 appears to user.

Figure 9: Nutch Interface

Then the user can put query or keyword in the box

search, and then we can see the results from Figure 10.

Figure 10: Result

4.3 USE LUKE TO ANALYZE THE

LUCENE INDEXES

Luke is a handy development and diagnostic tool,

which accesses already existing Lucene indexes and

allows you to display and modify their content in

several ways [9]: Browse by document number, or by term

View documents / copy to clipboard

Retrieve a ranked list of most frequent terms

Execute a search, and browse the results

analyze search results

selectively delete documents from the index

reconstruct the original document fields, edit

them and re-insert to the index

Optimize indexes

Luke allows you to look at individual documents in an

index, as well as perform ad hoc queries. Figure 11

shows the merged index for our example, found in the

index directory.

Figure 11: Luke Lucene Results

We can search about any word exist in

http://tubitak.gov.tr/, for example we searched for

"sonucunda" and we got results, figure 12 shows the

results.

Page 5: CRAWL THE WEB USING APACHE NUTCH AND … THE WEB USING APACHE NUTCH AND LUCENE Abstract: The availability of information in large quantities on the Web makes it …

Figure 12: Search result about "sonucunda" keyword

4.4 WORD FREQUENCY

In this study, we took a sample example to explain the

word frequency that means how many times each word

appears in the Web page. We use python Programming

Language to count the frequency to each word in the

Web page. Our work is divided into two parts first part,

read the Web page and take all the words appears in the

Web page, second part, filter the words into two parts

(Meaningful words and stop words). The meaningful

word like University and the stop word like the, to, an

and etc. The meaningful words are Important. That

words which will calculate, then removing all stop

words. In this study we took Cankaya University Web

page to calculate all words. Figure 13 below shows

Cankaya history Web page with words.

Figure 13: Cankay University Web page

We took sample example to test our code to read some

sentences, figure 14 shows list of words.

Figure 14: List of words

So, our code is working correctly to read list sentences.

We start to read the Cankaya history Web page and

index all words in the Web page. The words of

Cankaya history exist between this tag <p

align=justify> </p>, figure 15 shows source code of

Cankaya history Web page.

Figure 15: Python code to indexing words

Now we can call or put the URL

(http://www.cankaya.edu.tr/universite/tarihce_en.php)

in our code and return all the words with frequency for

the Web page, sorted in order of descending frequency

with print the list of words. Figure 16 below shows the

Words- frequency.

Figure 16: Words- frequency

After run the code above, now we will be removing all

stop words and keep only the meaningful words,

figure 17 shows the stop words from the list.

Figure 17: stop words list

After run the code above, now we will be removing all

stop words and keep only the meaningful words,

figure 5.33 shows the stop words from the list.

def stripTags(pageContents):

startLoc = pageContents.find("<palign=justify>")

pageContents = pageContents[startLoc:]

inside = 0 text = ' '

for char in pageContents:

if char == '<': inside = 1

elif (inside == 1 and char == '>'):

inside = 0 elif inside == 1:

continue else:

text += char

return text

Page 6: CRAWL THE WEB USING APACHE NUTCH AND … THE WEB USING APACHE NUTCH AND LUCENE Abstract: The availability of information in large quantities on the Web makes it …

Figure 18 and table 1 below shows the list of

meaningful words after removing all stop words after

this step we will draw the chart of meaningful words

with each repetition of the word in the Web page.

Figure 18: meaningful words and chart

Table 1: List of word- Frequencies

4.5 TAG CLOUD AND WORD

COUNTER

After the completion of the indexing process, we used

Tag Cloud and Word Counter to view the most

important words in each Web Page has been indexed

via Lucene. The advantage of using the Tag Cloud is

display the words the different shapes and colors, and

we can be controlled to change the size font, colors and

showing places words, in figures 13 and 14 shows the

Tag Cloud or Text Cloud, and we can see in figuers

below important words be of a large size like Universit,

Research. We used two URLs to view the important

words, the URLs are:

1. ttp://www.cankaya.edu.tr/universite/tarihce_en.

php

2. http://tubitak.gov.tr/tr/kurumsal/icerik-

hakkimizda

Figure 13: Cankay University Tag Cloud

Figure 14: TUBITAK Web Page Tag Cloud

5. CONCLUSIONS

In this study we used Apache Nutch is an open source

search engine. It is contains and rich many libraries

that can be used to information retrieval from the Web

and it support multiple languages. The important

reason to use Nutch is works over Lucene. It open

source written by java. It doesn't care about types of

data like PDF, MS Word and HTML. After Nutch

fetch web pages, Lucene works to index and analysis

these data, then convert them to text. Biggest

advantages to use Nutch search engine, it contains

high transparency we can be developed it because it

open source.

ACKNOWLEDGMENT

I would like to thank the Minister and the employees

of Ministry of Higher Education and Scientific

Research who have helped enrich my knowledge and

for all their support and encouragement. Finally I

would like to thank my husband and my family for all

their support and belief in me and for being a knack

boosting morale during rough times.

I would like to thank my advisor, Dr. Abdül Kadir for

providing me with valuableresources and insights

needed for my project. He has been a guideline for me

throughout the project. I really appreciate his valuable

time spent for the betterment of this project.

Page 7: CRAWL THE WEB USING APACHE NUTCH AND … THE WEB USING APACHE NUTCH AND LUCENE Abstract: The availability of information in large quantities on the Web makes it …

REFERENCES

[1] http://www.ebizmba.com/articles/search-engines

[2] B. Barla Cambazoglu, Flavio P. Junqueira and

Vassilis Plachouras, (2010, April 26–30). A

Refreshing Perspective of Search Engine

Caching. International World Wide Web

Conference Committee (IW3C2).

[3] Brin, Sergey and Page Lawrence. The anatomy of a

large-scale hypertextual Web search engine.

Computer Networks and ISDN Systems, April

1998

[4] Guojun Yu , Xiaoyao Xie*, and Zhijie Liu, (18-20

July 2010). "The Design and Realization of Open-

Source Search Engine Based on Nutch". IEEE.

pp.176 – 180.

[5] https://nutch.apache.org/

[6] Md. Abu Kausar, V. S. Dhaka and Sanjeev Kumar

Singh, (February 2013). Web Crawler: A Review.

International Journal of Computer Applications,

pp.31.

[7] Monica Peshave, HOW SEARCH ENGINES

WORK AND A WEB CRAWLER

APPLICATION.

http://www.micsymposium.org/mics_2005/papers

/paper89.pdf

[8] http://lucene.apache.org/

[9] http://www.getopt.org/luke/

[10]https://today.java.net/pub/a/today/2006/01/10/intro

duction-to-nutch-1.html, by Tom White, January

10, 2006

[11] Gheoca Razvan, Papadopoulos Constantinos, Pop

Aurel, "Every Cam", Analysis of a Web Camera

Search Engine. Computer Science Department,

University of Washington, Seattle WA 98105

[12] Steve Watt, Web Crawling and Data Gathering

with Apache Nutch, Jan 31, 2011,

http://www.slideshare.net/wattsteve/web-

crawling-and-data-gathering-with-apache-nutch