crawl the web using apache nutch and … the web using apache nutch and lucene abstract: the...
TRANSCRIPT
CRAWL THE WEB USING APACHE NUTCH
AND LUCENE
Abstract: The availability of information in large quantities on the Web makes it difficult for user selects resources about
their information needs. Search engine works on data collection from the Web by software program is called crawler, bot or
spider. In this study, I focused make the Web Crawling fetch only related topics and it reject topics are not relevant. I used
each of (Apache Nutch and Lucene) to clarify work of Web Crawling. Each of Nutch and Lucene are released under the
(Apache Software Foundation). Nutch is a web search engine works to search and index Web Pages from the World Wide Web
(WWW). Nutch is free open-source based or built on top of Lucene. It a free/open source in the field of information retrieval
and it has more software library to fetch or crawl the Web. The benefits to use Lucene in this study is that it does not matter
about information in the "WWW", like PDF, txt and MS Word when indexed these information, it working to change them to
documents, they are useful to user. Apache Nutch and Lucene are written by java.
Keywords: Search Engine, Web Crawling, Apache Nutch, Apache Lucene, Java open source.
1. Introduction
The World Wide Web has become the largest source
for the providing of information to Internet users
around the world. Because of the considerable
increasing of information of the Web makes it difficult
for user selects resources about their information needs.
User gets this information from the Web by "search
engines". At the heart of all "Search Engines" there is a
program called "Web Crawler". It works to fetch URLs
from the Web in large quantities. For example "Web
Crawler" is fetching more topics not relevant, but the
user need to fetch few pages of topics relevant. Search
engines are essential services to find the content on the
Web. Google and Yahoo are Popular search engine [1].
Commercial search engines like Google and Yahoo!
have over a hundred billion documents indexed [2] [3].
We used tubitak keyword to search about them via use
Google, then we got very many results about tubitak
(Google results is 1,500.000), as shown in the figure 1
below.
Figure 1: Google about tubitak keyword
So, search engines nowadays are becoming more and
more necessary and popular in surfing the Internet [4].
However, how these search engines like Google and
Yahoo works is unknown to many people. We used
Apache Nutch Web crawler, because it open-source,
which we can modify it and developed. Aim of this
paper first, through a research into Open-source search
engine Nutch and Lucene, introduces how a common
search engine works and second, via using Nutch, a
search engine we can get topics related to the user's
query, and the exclusion of topics not related.
2. BACKGROUND
2.1 Web Crawling
A web crawler is a program/software or programmed
script that browses the World Wide Web in a
systematic, automated manner. The structure of the
WWW is a graphical structure, i.e., the links presented
in a web page may be used to open other web pages.
Internet is a directed graph where webpage as a node
and hyperlink as an edge, thus the search operation
may be summarized as a process of traversing directed
graph. By following the linked structure of the Web,
web crawler may traverse several new web pages
starting from a webpage. A web crawler move from
page to page by the using of graphical structure of the
web pages. Such programs are also known as robots,
spiders, and worms. Web crawlers are designed to
retrieve Web pages and insert them to local repository.
Crawlers are basically used to create a replica of all
the visited pages that are later processed by a search
engine that will index the downloaded pages that help
in quick searches. Search engines job is to storing
information about several webs pages, which they
retrieve from WWW. These pages are retrieved by a
Nibras Othman Abdulwahid
Graduate School of Natural and Applied Sciences
Dept. of Mathematics and Computer Science
Ministry of Higher Education and Scientific Research, Iraq
Web crawler that is an automated Web browser that
follows each link it sees [6].
2.2 Working of Web Crawler
The working of Web crawler is beginning with initial
set of URLs known as seed URLs. They download web
pages for the seed URLs and extract new links present
in the downloaded pages. The retrieved web pages are
stored and well indexed on the storage area so that by
the help of these indexes they can later be retrieved as
and when required. The extracted URLs from the
downloaded page are confirmed to know whether their
related documents have already been downloaded or
not. If they are not downloaded, the URLs are again
assigned to web crawlers for further downloading. This
process is repeated till no more URLs are missing for
downloading. Millions of pages are downloaded per
day by a crawler to complete the target. Figure 2 show
the crawling processes [6].
Figure 2: Flow of a crawling process
The working of a web crawler may be discussed as
follows [6], [7]:
Selecting a starting seed URL or URLs
Adding it to the frontier
Now picking the URL from the frontier
Fetching the web-page corresponding to that
URL
Parsing that web-page to find new URL links
Adding all the newly found URLs into the
frontier
Go to step 2 and reiterate till the frontier is
empty
Thus a web crawler will recursively keep on inserting
newer URLs to the database repository of the search
engine. So we can see that the major function of a web
crawler is to insert new links into the frontier and to
choose a fresh URL from the frontier for further
processing after every recursive step [6].
3. MATERIALSAND
IMPLEMENTATION OF WORK
3.1 MATERIALS
In this part will be a brief talk about Apache Nutch
and Apache Lucene and remind the function of each
one of them. Then we will display the results.
3.1.1 Apache Nutch
Nutch is an open-source search engine based on
Lucene Java, which is an open-source information
retrieval library supported by the Apache Software
Foundation for the search and index component,
providing a crawler program, an Index engine and a
Query engine. Nutch consists of the following three
parts [4], [11]: 1. Pages collection (fetch):- The program of
collecting pages, by timely collection or
incremental collection, chooses the URLs,
through which pages are to be visited and then
fetched to the local disk by the crawler.
2. Creating index:- The program of creating
index converts the pages or other files into the
txt-document, divides them into segments,
filters some useless information and then,
creates and assists indexes which are
composed by some smaller indexes based on
key words or inverted documents.
3. Searcher: - The program of searcher accepts
user’s query words through segmentation and
filtering and then divides them into groups of
key words, according to which correspondent
pages are matched in treasury index. Then, it
puts the matches in order by sorting and
returns the results to the users. The overall
framework of Nutch is listed in figure 3.
Figure 3: Framework of Nutch
3.1.2 Apache Lucene
Apache Lucene is a high-performance, full-featured
text search engine library written entirely in Java. It is
a technology suitable for nearly any application that
requires full-text search, especially cross-platform.
Figure 4 below displays the framework of the Lucene
[8].
Figure 4: Framework of Lucene
Three basic steps followed by Lucene the first, Lucene
does not care about the source of the data, its format, or
even its language as long as you can convert it to text.
This means you can use Lucene to index and search
data stored in files, web pages on remote web servers,
documents stored in local file systems, simple text files,
Microsoft Word documents, HTML or PDF files, or
any other formats, from which you can extract textual
information. Figure 4 below displays the Document
Converting.
Figure 5: Document Converting via Lucene
Second step, Analysis: - After the completion of the
process of converting data, indexing and have created
Lucene Documents. Then comes the process of
preparing and sort the data indexed, Lucene makes the
data more convenient for indexing. To do this, Lucene
works on the segmentation of textual data into parts. It
works on remove all frequent but have no meaning
tokens from the input, such as stop words (a, an, the, in,
on, and soon) in English text. There is an important
point about documents that contain metadata such as
the author, the title, the last modified date, and
potentially much more. Metadata refers to "data about
data". We must separate these information or metadata
and indexed as separate section. The third step: Storing
the Index: - Lucene works to sort documents, such as
words or numbers, to access to them quickly. Lucene
works to break documents indexed into terms, like this
example "This is a red car", Lucene separates it into
the symbols or words: This, is, a, red, car, and then
Lucene works filtering stop words, as shown in figure
6 below.
Figure 6: Indexing Process
3.2 IMPLEMENTATION A NUTCH WEB
CRAWLING
In this part we will run our Web Crawler. First must
start from seed URL. In this study we chose the
(http://tubitak.gov.tr/) URL. There are four factors
affecting the crawler’s ability:
1. Depth: the depth of the download.
2. topN: the amount of page hyperlinks before
the download.
3. Threads: the threads which the download
programmer uses.
4. Delay: the delay time of the host visiting.
The work process of the Nutch’s Crawler includes
four steps as follows:
1. Create the initial collection of the URL.
2. Begin the Fetching based on the pre-defined
Depth, topN, Threads and Delay.
3. Create the new URL waiting list and start the
new round of Fetching.
4. Unite the resources downloaded in the local
disk.
Figure 7 below shows the Crawl Life cycle [12].
Figure 7: Crawl Life cycle
4. RESULTS
4.1 Crawl Testing
First, the test starts with crawling tubitak website
http://tubitak.gov.tr/ . Figure 7 and 8 are shows the start
and finish process.
Figure 7: Start Crawling
Figure 8: Finish Crawling
4.2 Search Testing
We deploy the project into the Tomcat start the tomcat
visit the http://localhost:8080/nutch-1.1/en/ , and then
Figure 9 appears to user.
Figure 9: Nutch Interface
Then the user can put query or keyword in the box
search, and then we can see the results from Figure 10.
Figure 10: Result
4.3 USE LUKE TO ANALYZE THE
LUCENE INDEXES
Luke is a handy development and diagnostic tool,
which accesses already existing Lucene indexes and
allows you to display and modify their content in
several ways [9]: Browse by document number, or by term
View documents / copy to clipboard
Retrieve a ranked list of most frequent terms
Execute a search, and browse the results
analyze search results
selectively delete documents from the index
reconstruct the original document fields, edit
them and re-insert to the index
Optimize indexes
Luke allows you to look at individual documents in an
index, as well as perform ad hoc queries. Figure 11
shows the merged index for our example, found in the
index directory.
Figure 11: Luke Lucene Results
We can search about any word exist in
http://tubitak.gov.tr/, for example we searched for
"sonucunda" and we got results, figure 12 shows the
results.
Figure 12: Search result about "sonucunda" keyword
4.4 WORD FREQUENCY
In this study, we took a sample example to explain the
word frequency that means how many times each word
appears in the Web page. We use python Programming
Language to count the frequency to each word in the
Web page. Our work is divided into two parts first part,
read the Web page and take all the words appears in the
Web page, second part, filter the words into two parts
(Meaningful words and stop words). The meaningful
word like University and the stop word like the, to, an
and etc. The meaningful words are Important. That
words which will calculate, then removing all stop
words. In this study we took Cankaya University Web
page to calculate all words. Figure 13 below shows
Cankaya history Web page with words.
Figure 13: Cankay University Web page
We took sample example to test our code to read some
sentences, figure 14 shows list of words.
Figure 14: List of words
So, our code is working correctly to read list sentences.
We start to read the Cankaya history Web page and
index all words in the Web page. The words of
Cankaya history exist between this tag <p
align=justify> </p>, figure 15 shows source code of
Cankaya history Web page.
Figure 15: Python code to indexing words
Now we can call or put the URL
(http://www.cankaya.edu.tr/universite/tarihce_en.php)
in our code and return all the words with frequency for
the Web page, sorted in order of descending frequency
with print the list of words. Figure 16 below shows the
Words- frequency.
Figure 16: Words- frequency
After run the code above, now we will be removing all
stop words and keep only the meaningful words,
figure 17 shows the stop words from the list.
Figure 17: stop words list
After run the code above, now we will be removing all
stop words and keep only the meaningful words,
figure 5.33 shows the stop words from the list.
def stripTags(pageContents):
startLoc = pageContents.find("<palign=justify>")
pageContents = pageContents[startLoc:]
inside = 0 text = ' '
for char in pageContents:
if char == '<': inside = 1
elif (inside == 1 and char == '>'):
inside = 0 elif inside == 1:
continue else:
text += char
return text
Figure 18 and table 1 below shows the list of
meaningful words after removing all stop words after
this step we will draw the chart of meaningful words
with each repetition of the word in the Web page.
Figure 18: meaningful words and chart
Table 1: List of word- Frequencies
4.5 TAG CLOUD AND WORD
COUNTER
After the completion of the indexing process, we used
Tag Cloud and Word Counter to view the most
important words in each Web Page has been indexed
via Lucene. The advantage of using the Tag Cloud is
display the words the different shapes and colors, and
we can be controlled to change the size font, colors and
showing places words, in figures 13 and 14 shows the
Tag Cloud or Text Cloud, and we can see in figuers
below important words be of a large size like Universit,
Research. We used two URLs to view the important
words, the URLs are:
1. ttp://www.cankaya.edu.tr/universite/tarihce_en.
php
2. http://tubitak.gov.tr/tr/kurumsal/icerik-
hakkimizda
Figure 13: Cankay University Tag Cloud
Figure 14: TUBITAK Web Page Tag Cloud
5. CONCLUSIONS
In this study we used Apache Nutch is an open source
search engine. It is contains and rich many libraries
that can be used to information retrieval from the Web
and it support multiple languages. The important
reason to use Nutch is works over Lucene. It open
source written by java. It doesn't care about types of
data like PDF, MS Word and HTML. After Nutch
fetch web pages, Lucene works to index and analysis
these data, then convert them to text. Biggest
advantages to use Nutch search engine, it contains
high transparency we can be developed it because it
open source.
ACKNOWLEDGMENT
I would like to thank the Minister and the employees
of Ministry of Higher Education and Scientific
Research who have helped enrich my knowledge and
for all their support and encouragement. Finally I
would like to thank my husband and my family for all
their support and belief in me and for being a knack
boosting morale during rough times.
I would like to thank my advisor, Dr. Abdül Kadir for
providing me with valuableresources and insights
needed for my project. He has been a guideline for me
throughout the project. I really appreciate his valuable
time spent for the betterment of this project.
REFERENCES
[1] http://www.ebizmba.com/articles/search-engines
[2] B. Barla Cambazoglu, Flavio P. Junqueira and
Vassilis Plachouras, (2010, April 26–30). A
Refreshing Perspective of Search Engine
Caching. International World Wide Web
Conference Committee (IW3C2).
[3] Brin, Sergey and Page Lawrence. The anatomy of a
large-scale hypertextual Web search engine.
Computer Networks and ISDN Systems, April
1998
[4] Guojun Yu , Xiaoyao Xie*, and Zhijie Liu, (18-20
July 2010). "The Design and Realization of Open-
Source Search Engine Based on Nutch". IEEE.
pp.176 – 180.
[5] https://nutch.apache.org/
[6] Md. Abu Kausar, V. S. Dhaka and Sanjeev Kumar
Singh, (February 2013). Web Crawler: A Review.
International Journal of Computer Applications,
pp.31.
[7] Monica Peshave, HOW SEARCH ENGINES
WORK AND A WEB CRAWLER
APPLICATION.
http://www.micsymposium.org/mics_2005/papers
/paper89.pdf
[8] http://lucene.apache.org/
[9] http://www.getopt.org/luke/
[10]https://today.java.net/pub/a/today/2006/01/10/intro
duction-to-nutch-1.html, by Tom White, January
10, 2006
[11] Gheoca Razvan, Papadopoulos Constantinos, Pop
Aurel, "Every Cam", Analysis of a Web Camera
Search Engine. Computer Science Department,
University of Washington, Seattle WA 98105
[12] Steve Watt, Web Crawling and Data Gathering
with Apache Nutch, Jan 31, 2011,
http://www.slideshare.net/wattsteve/web-
crawling-and-data-gathering-with-apache-nutch