thesis final on project search

8/18/2019 Thesis Final on project search

http://slidepdf.com/reader/full/thesis-final-on-project-search 1/68

CHAPTER I

INTRODUCTION

1.1 E-BUSINESS

Electronic-Business commonly referred to as "e-business" or an internet business, defined as

the application of information and communication technologies (ICT) in support of all the

activities of business, uch as buying and selling of the products! nd all the business related

transaction automated #nline! E$ample of e-business %eb sites are ma&on!com, ebay!com,

flipcart!com, napdeal!com and many more! 'o% a day, the uantity of available information

is rapidly rising on E-business! #ne of the most important goals of these e-businesses to

provide relevant information to user at the clic of mouse!

In such a huge, fragmented and unstructured information collection today*s greatest problem

is to find relevant information! +or this online information retrieval system %e are using the

machine learning such as genetic algorithm to find relevant information!

1.2 INFORMATION RETRIEVAL

Information retrieval (I) is finding material (usually documents) of an unstructured nature

(usually te$t) that satisfies an information need from %ithin large collections (usually stored

on computers)!s defined in this %ay, information retrieval used to be an activity that only a

fe% people engaged in reference librarians, paralegals, and similar professional searchers!

'o% the %orld has changed, and hundreds of millions of people engage in information

retrieval every day %hen they use a %eb search engine or search their email! Information

retrieval is fast becoming the dominant form of information access, overtaing traditional

database-style searching!

I can also cover other inds of data and information problems beyond that specified in the

core definition above! The term .unstructured data/ refers to data %hich does not have clear,

semantically overt, easy-for-a-computer structure! It is the opposite of structured data, the

canonical e$ample of %hich is a relational database, of the sort companies usually use to

1



maintain product inventories and personnel records! The field of information retrieval also

covers supporting users in bro%sing or filtering document collections or further processing a

set of retrieved documents! 0iven a set of documents, clustering is the tas of coming up %ith

a good grouping of the documents based on their contents! It is similar to arranging boos on

a booshelf according to their topic!

Information retrieval systems can also be distinguished by the scale on %hich they operate,

and it is useful to distinguish three prominent scales! In %eb search, the system has to provide

search over billions of documents stored on millions of computers! 1istinctive issues are

needed to gather documents for inde$ing, being able to build systems that %or efficiently at

this enormous scale, and handling particular aspects of the %eb, such as the e$ploitation of

hyperte$t and not being fooled by site providers manipulating page content in an attempt to

boost their search engine ranings, given the commercial importance of the %eb!

Information retrieval

2! To process large document collections uicly! The amount of online data has

gro%n at least as uicly as the speed of computers, and %e %ould no% lie to be

able to search collections that total in the order of billions to trillions of %ords!

3! To allo% more fle$ible matching operations! +or e$ample, it is impractical l to

perform the uery omans 'E countrymen %ith grep, %here 'E might be

defined as .%ithin 4 %ords/ or .%ithin the same sentence/!

5! To allo% raned retrieval in many cases you %ant the best ans%er to an

information need among many documents that contain certain %ords!

1.3 WEB SEARCH ENGINE

%eb search engine is a soft%are system that is designed to search for information on the

6orld 6ide 6eb! The search results are generally presented in a line of results often referred

to as search engine results pages (E7s)! The information may be a specialist in %eb pages,

images, information and other types of files! ome search engines also mine data available in

databases or open directories! 8nlie %eb directories, %hich are maintained only by human

editors, search engines also maintain real-time information by running an algorithm on a %eb

cra%ler!

2



+ig 2!2 the 9arious Component of 6eb earch Engine

search engine operates in the follo%ing order

2! 6eb Cra%ling

3! Inde$ing

5! earching

1.4 WEB CRAWLING

6eb cra%ling is the process by %hich %e gather pages from the 6eb, in order to inde$ them

and support a search engine! The ob:ective of cra%ling is to uicly and efficiently gather as

many useful %eb pages as possible, together %ith the lin structure that interconnectsthem!%eb cra%ler; it is sometimes referred to as a spider!

1.4.1 FEATURES A CRAWLER MUST PROVIDE 6e list the desiderata for %eb

cra%lers in t%o categories features that %eb cra%lers must provide follo%ed by features they

should provide!

3



obustness The 6eb contains servers that create spider traps, %hich are generators of %eb

pages that mislead cra%lers into getting stuc fetching an infinite number of pages in a

particular domain! Cra%lers must be de-signed to be resilient to such traps! 'ot all such traps

are malicious; some are the inadvertent side-effect of faulty %ebsite development!

7oliteness 6eb servers have both implicit and e$plicit policies regulating the rate at %hich a

cra%ler can visit them! These politeness policies must be respected!

1.4.2 FEATURES A CRAWLER SHOULD PROVIDE

1istributed The cra%ler should have the ability to e$ecute in a across multiple machines!

calable The cra%ler architecture should permit scaling up the cra%l adding e$tra machines

and band%idth!

7erformance and efficiency The cra%l system should mae efficient use of various system

resources including processor, storage and net%or band-%idth!

<uality 0iven that a significant fraction of all %eb pages are of poor utility for serving user

uery needs, the cra%ler should be biased to%ards fetching .useful/ pages first!

+reshness In many applications, the cra%ler should operate in continuous mode it should

obtain fresh copies of previously fetched pages! search engine cra%ler, for instance, can

thus ensure that the search engine*s inde$ contains a fairly current representation of each

inde$ed %eb page! +or such continuous cra%ling, a cra%ler should be able to cra%l a page

%ith a freuency that appro$imates the rate of change of that page!

E$tensible Cra%lers should be designed to be e$tensible in many %ays =to cope %ith ne%

data formats, ne% fetch protocols, and so on! This demands that the cra%ler architecture be

modular!

1.4.3 CRAWLING

The basic operation of any hyperte$t cra%ler (%hether for the 6eb, an intranet or other

hyperte$t document collection) is as follo%s! The cra%ler begins %ith one or more 8>s that

constitute a seed set! It pics a 8> from this seed set, and then fetches the %eb page at that

8>! The fetched page is then parsed, to e$tract both the te$t and the lins from the page

(each of %hich points to another 8>)! The e$tracted te$t is fed to a te$t inde$er! Thee$tracted lins (8>s) are then added to a 8> frontier, %hich at all times consists of 8>s

4



%hose corresponding pages have yet to be fetched by the cra%ler! Initially, the 8> frontier

contains the seed set; as pages are fetched, the corresponding 8>s are deleted from the 8>

frontier! The entire process may be vie%ed as traversing the %eb graph! In continuous

cra%ling, the 8> of a fetched page is added bac to the frontier for fetching again in the

future!

1

+ig 2!3 %eb as 0raph

1.4.4 ARCHITECTURE OF WEB CRAWLER

1#C #B#T 8>

+7* TE?7>ETE ET

+ig 2!5 architecture of basic %eb cra%ler

5

D

A

EB

F

C

DNS

P

A

R

S

E

W

W

W

DUP

URL

ELIM

CONTEN

T

SEEN?

URL

FILTER

F

E

T

C

H

URL FRONTIER



1.5 PREPROCESSING

2! Collect the documents to be inde$ed!

3! Toeni&e the te$t 0iven a character seuence and a defined document unit, toeni&ation is

the tas of chopping it up into pieces, called toens, perhaps at the same time thro%ing a%ay

certain characters, such as punctuation! @ere is an e$ample of toeni&ation

Input +riends,omans ,Countrymen, >end me your ears;

#utput

5! 1o linguistic pre processing of toens!

I! 1ropping common terms stop %ords ometimes, some e$tremely common

%ords %hich %ould appear to be of little value in helping select documents

matching a user need are e$cluded from the vocabulary entirely! These %ords

are called stop %ords! E$ample

a an and are as at be by for from

has he in is it its of on that the

to %as %ere %ill %ith

II! Capitali&ationAcase-folding! common strategy is to do case-folding by

reducing all letters to lo%er case! #ften this is a good idea it %ill allo%

instances of utomobile at the beginning of a sentence to match %ith a uery

of automobile!

III! temming and lemmati&ation temming usually refers to a crude heuristic

process that chops off the ends of %ords in the hope of achieving this goal

correctly most of the time, and often includes the removal of derivational

affi$es! >emmati&ation usually refers to doing things properly %ith the use of

a vocabulary and morphological analysis of %ords, normally aiming to remove

inflectional endings only and to return the base or dictionary form of a %ord,

%hich is no%n as the lemma! The goal of both stemming and lemmati&ation

6

RomansFr!n"s !ars#o$rm!%!n"Co$n&r#m!n



is to reduce inflectional forms and sometimes derivationally related forms of a

%ord to a common base form! +or instance am, are, is ' be

car, cars, car*s, cars* ' car

The result of this mapping of te$t %ill be something lie

the boy*s cars are different colors 'the boy car be differ color

The most common algorithm for stemming English, and one that has

repeatedly been sho%n to be empirically very effective, is 7orter*s algorithm

(7orter 2D)!

1.6 INDEXING IN VECTOR SPACE MODEL

1.6.1 VECTOR SPACE MODEL

The representation of a set of documents as vectors in a common vector space is no%n as

the vector space model and is fundamental to a host of information retrieval operations

ranging from scoring documents on a uery, document classification and document

clustering! In this model, a document is vie%ed as a vector in n-dimensional document space

(%here n is the number of distinguishing terms used to describe contents of the documents

in a collection) and each term represents one dimension in the document space! uery is

also treated in the same %ay and constructed from the terms and %eights provided in the user

reuest! 1ocument retrieval is based on the measurement of the similarity bet%een the uery

and the documents! This means that documents %ith a higher similarity to the uery

are :udged to be more relevant to it and should be retrieved by the I in a higher position in

the list of retrieved documents! In This method, the retrieved documents can beorderly presented to the user %ith respect to their relevance to the uery!

TE? +E<8E'C ssign to each term in a document a %eight for that, term that

depends on the number of occurrences of the term in the document! 6e %ould lie to

compute a score bet%een a uery term t and a document d, based on the %eight of t in d! The

simplest approach is to assign the %eight to be eual to the number of occurrences of term t

(



in document d! This %eighting scheme is referred to as term freuency and is denoted, tf t , d

%ith the subscripts denoting the term and the document in order!

1#C8?E'T +E<8E'C The document freuency df

t , defined to be the number of

documents in the collection that contain a term t!

I'9EE 1#C8?E'T +E<8E'C 1enoting the total number of documents in a

collection by ', %e define the inverse document freuency (idf) of a term t as follo%s

idf t =log N

df t

Thus the idf of a rare term is high, %hereas the idf of a freuent term is liely to be lo%!

T+-I1+ 6EI0@TI'0 6e no% combine the definitions of term freuency and inverse

document freuency, to produce a composite %eight for each term in each document! The tf-

idf %eighting scheme assigns to term t a %eight in document d given by

tf −idf t , d=tf t ,d ×idf t ……. !"

In other %ords,tf −idf t , d assigns to term t a %eight in document d that is

2! @ighest %hen t occurs many times %ithin a small number of documents(thus lending high

discriminating po%er to those documents);

3! lo%er %hen the term occurs fe%er times in a document, or occurs in many documents (thus

offering a less pronounced relevance signal);

5! lo%est %hen the term occurs in virtually all documents!

1#C8?E'T 9ECT# t this point, %e may vie% each document as a vector %ith one

component corresponding to each term in the dictionary, together %ith a %eight for each

component that is given by (i)! +or dictionary terms that do not occur in a document, this

%eight is &ero! This vector form %ill prove to be crucial to scoring and raning; step, %e

introduce the overlap score measure the score of a document d is the sum, over all uery

terms, of the number of times each of the uery terms occurs in d! 6e can refine this idea so

)



that %e add up not the number of occurrences of each uery term t in d, but instead the tf-idf

%eight of each term in d!

Score (q , d )=∑t qϵ

tf −idf t , d

1#T 7#18CT 6e denote byV (d ) the vector derived from document d, %ith one

component in the vector for each dictionary term! Cosine imilarity, the standard %ay of

uantifying the similarity bet%een t%o documentsd1 and

d2 is to computethe cosine

similarity of their vector representations V (d1) and

V (d2)

|V (d1 )|∨V (d2 )∨¿

¿ (d1 , d2 )=V ( d1 ) ∙ V (d2)

¿……. !!"

+ig no 2!F cosine similarity illustrated

%here the numerator represents the dot product (also no%n as the inner product) of the

vectors, V (d1) and

V (d2) %hile the denominator is the product oftheir Euclidean

*



lengths! The dot product x ∙ y of t%o vectors is defined as ∑i=1

m

xi yi , let V (d ) denote

the document vector for d, %ith ? components V 1(d )

G!! V m(d )

The Euclidean length

of d is defined to be

√∑i=1

m

V i2

(d)

The effect of the denominator of Euation (ii) is thus to length-normali&e the vectors

V (d1) and V (d2) , to unit vectors

¿

V (d2)∨¿v (d1 )=V (d

1)/ ¿ and

¿

V (d2 )∨¿v (d2 )=V (d2)/ ¿ ! 6e can

then re%rite (ii) as

¿ (d1 , d2 )=v (d1 ) ∙ v ( d2 )

imilarly <uery is represented as the vector v (q ) and the similarity bet%een uery and

document vector is calculated as

|V (q )|∨V ( d )∨¿

Score (q , d )=V (q ) ∙ V (d )

¿

1.# GENETIC ALGORITHM

0enetic lgorithm is search algorithm based on the mechanics of natural selection and

natural genetics! They combine survival of the fittest among string structures %ith a

structured yet randomi&ed information e$change to form a search algorithm %ith some of the

innovative flair of human search!

0enetic algorithm %as developed by Hohn @olland and his colleagues in the university of

?ichigan!

1+



1.#.1 A SIMPLE GENETIC ALGORITHM

The mechanics of a simple genetic algorithm are surprisingly simple, involving nothing more

comple$ that copying strings and s%apping partial strings! The e$planation of %hy this

simple process %ors is more subtle and po%erful!

simple genetic algorithm that yields good results in many practical problems is composed

of three operators

2! eproduction3! Crossover

5! ?utation

E7#18CTI#' is a process in %hich individual string are copied according to their

ob:ective function values, f (biologists call this function the fitness function)! Intuitively, %e

can thin of the function f as some measures of profit, utility of goodness that %e %ant to

ma$imi&e! Copying strings according to their fitness values means that string %ith a higher

value have a higher probability of contributing one or more offspring in the ne$t generation!

This operator is an artificial version of natural selection, a .1ar%inian urvival/ of the fittest

among string creature!

C##9E ?ay proceed in t%o steps

2! ?embers of the ne%ly reproduced strings in the mating pool are mated at random!

3! Each pair of strings undergoes crossing over as follo%s

a! n integer position along the strings uniformly at random bet%een 2 and the

string length less one 2, l-2J !

b! T%o ne% strings are created by s%apping all characters bet%een position K2

and l inclusively!

?8TTI#' 7lays a decidedly secondary role in the operation of genetic algorithm!

?utation is needed because, even though reproduction and crossover effectively search and

recombine e$tant notation, occasionally they may become over&ealous and lose some

potentially useful genetic material (2*s or D*s at particular locations)! In artificial genetic

systems, the mutation operator protects against such an irrecoverable loss! In simple genetic

11



algorithm, mutation is the occasional (%ith small probability) random alteration of the values

of string position simply means (changing 2*s or D*s and vice versa)!

1.#.2 GENETIC ALGORITHM STEPS

NO

$ES

12

,!n!ra&! In&a% -o-$%a&on

R! ro"$.&on

E/a%$a&! !a.0 n"/"$a%

Crosso/!r

M$&a&on

S&o--n Cr&!ra

m!&

STOP



1.% PROBLEM STATEMENT

T# #7TI?ILE1 T@E EC@ E8>T %hen the user enter the uery to search for a

particular information, the information retrieval system or the search engine retrieve the

search result %hich are both relevant and irrelevant to the user!

1.& AIM AND OB'ECTIVE

@ere our basic aim is to retrieve relevant result and to reduce the number of irrelevant result

retrieve! nd the result retrieve must be in the decreasing order of relevance!

#BHECTI9E

• Implement vector space model!

• #ptimi&e the <uery using the genetic algorithm!

• etrieve the result using optimi&ed uery!

1.1( ORGANI)ATION OF THESIS

The rest of this thesis report is organi&ed as follo%s

6e present the literature revie% in Chapter 3!

esearch analysis i!e! theoretical, computational, and analytical, are presented in

Chapter 5!

The results and discussion of thesis are presented in Chapter F!

6e present the conclusion of %hole thesis in Chapter 4!

+inally, %e present the future scope of %or in chapter M!

CHAPTER II

13



LITERATURE REVIEW

E-business may be defined as the conduct of industry, trade and commerce using the

computer net%ors! The term "e-business" %as coined by IBM*+ mareting and Internet

teams in 2M! 2J Electronic business methods enable companies to lin their internal and

e$ternal data processing systems more efficiently and fle$ibly, to %or more closely %ith

suppliers and partners, and to better satisfy the needs and e$pectations of their customers! The

internet is a public through %ay! +irms use more private and hence more secure net%ors for

more effective and efficient management of their internal functions! In practice, e-business is

more than :ust e-commerce! 6hile e-business refers to more strategic focus %ith an emphasis

on the functions that occur using electronic capabilities, e-commerce is a subset of an overall

e-business strategy! E-commerce sees to add revenue streams using the 6orld 6ide 6eb or

the Internet to build and enhance relationships %ith clients and partners and to improve

efficiency using the Empty 9essel strategy! #ften, e-commerce involves the application of

no%ledge management systems.

C,!+/0, D. M!, P, R,7 and H!!8, S8,9:, 3J Introduction to

Information etrieval, Cambridge 8niversity 7ress! 3DD! The first eight chapters of the boo

are devoted to the basics of information retrieval, and in particular the heart of search

engines; %e consider this material to be core of information retrieval! Chapter 2 introduces

inverted inde$es, and sho%s ho% simple Boolean ueries can be processed using such

inde$es! Chapter 3 builds on this introduction by detailing the manner in %hich documents

are preprocessed before inde$ing and by discussing ho% inverted inde$es are augmented in

various %ays for functionality and speed! Chapter 5 discusses search structures for

dictionaries and ho% to process ueries that have spelling errors and other imprecise matches

to the vocabulary in the document collection being searched! Chapter F describes a number of

algorithms for constructing the inverted inde$ from a te$t collection %ith particular attention

to highly scalable and distributed algorithms that can be applied to very large collections!

desire to measure the e$tent to %hich a document matches a uery, or the score of a document

for a uery, motivates the development of term %eighting and the computation of

14



cores in Chapters M, N, leading to the idea of a list of documents that are ran-ordered for a

uery! Chapter focuses on the evaluation of an information retrieval system based on the

relevance of the documents it retrieves, allo%ing us to compare the relative performances of

different systems on benchmar document collections and ueries! Chapter discusses

methods by %hich retrieval can be enhanced through the use of techniues lie relevance

feedbac and uery e$pansion, %hich aim at increasing the lielihood of retrieving relevant

document! Chapter 24 introduces support vector machines, %hich many researchers currently

vie% as the most effective te$t classification method! 6e also develop connections in this

chapter bet%een the problem of classification and seemingly disparate topics such as the

induction of scoring functions from a set of training e$amples! Chapter 2 a summary of the

basic challenges in %eb search, together %ith a set of techniues that are pervasive in %eb

information retrieval! 'e$t, Chapter 3D describes the architecture and reuirements of a basic

%eb cra%ler! +inally, Chapter 32 considers the po%er of lin analysis in %eb search, using in

the process several methods from linear algebra and advanced probability theory!

0enetic lgorithms in earch, #ptimi&ation O ?achine >earning 5JD7!; E. G/<;

%ith fore%ord by '/, H/<<; This te$t introduces the theory, operation, and application of

genetic algorithms- search algorithms based on the mechanics of natural selection and

genetics!

H. C,,FJ Information retrieval using probabilistic techniues has attracted significant

attention on the part of researchers in information and computer science over the past fe%

decades! In the 2Ds no%ledge-based techniues also made an impressive contribution to

PPintelligentQQ information retrieval and inde$ing! ?ore recently, information science

researchers have turned to other ne%er artificial-intelligence based inductive learning

techniues including neural net%ors, symbolic learning, and genetic algorithms! These

ne%er techniues, %hich are grounded on diverse paradigms, have provided great

opportunities for researchers to enhance the information processing and retrieval capabilities

of current information storage and retrieval systems!

This article provides an overvie% of these ne%er techniues and their use in information

science research! The three popular methods the connectionist @opfield net%or, the

15



symbolic I15AI14, and evolution-based genetic algorithms! 1iscuss their no%ledge

representations and algorithms in the conte$t of information retrieval!! These techniues are

robust in their ability to analy&e user ueries, identify usersQ information needs, and suggest

alternatives for search! 6ith proper user-system interactions, these methods can greatly

complement the prevailing full-te$t, ey%ord-based, probabilistic, and no%ledge-based

techniues!

A,=; A. A. R;>, B, A. A;< L? , A;< M!; A. A<!, and O+= A. S; ,

4J This study investigates the use of genetic algorithms in information retrieval! The method

is sho%n to be applicable to three %ell-no%n documents collections, %here more relevant

documents are presented to users in the genetic modification! This paper presents a ne%

fitness function for appro$imate information retrieval %hich is very fast and very fle$ible,

than cosine similarity fitness function!

E= A< M+,@ F+ A< M+, ; M/,==; O,= N++ MJ In

information retrieval research; 0enetic lgorithms (0) can be used to find global solutions

in many difficult problems! The study used different similarity measures (1ice, Inner

7roduct) in the 9?, for each similarity measure %e compared ten different 0 approaches

based on different fitness functions, different mutations and different crossover strategies to

find the best strategy and fitness function that can be used %hen the data collection is the

rabic language! The results sho%s that the 0 approach %hich uses one-point

crossover operator, point mutation and Inner 7roduct similarity as a fitness function is the

best I system in 9?!

C!+! L/ 0:-P<@ V!8 P. G/-B/, F <! ; M/-A/NJ recently

there have been appearing ne% applications of genetic algorithms to information retrieval,

most of them speciRcally to relevance feedbac! The evolution of the possible solutions are

guided by Rtness functions that are designed as measures of the goodness of the solutions!

These functions are naturally the ey to achieving a reasonable improvement, and %hich

function is chosen most distinguishes one e$periment from another! In previous %or, theyfound that, among the functions implemented in the literature, the ones that yield the best

16



results are those that tae into account not only %hen documents are retrieved, but also the

order in %hich they are retrieved! @ere, therefore evaluate the efRcacy of a genetic algorithm

%ith various order-based Rtness functions for relevance feedbac (some of them of our o%n

design), and compare the results %ith the Ide dechi method, one of the best traditional

methods!

A.S.S!7 S,@ B.P,!</=! S!=/, J s information has been increasing enormously

in the %orld, it is difficult to retrieve the proper information as per the user satisfaction! In

this %or, document cra%ler is used for gathering and e$tracting information from the

documents available from online databases and other databases! ince search space is too

large, 0enetic lgorithm (0) is used to find out the combination terms! In the proposed

document retrieval system, %e e$tract the ey%ords from the document cra%ler and %ith

these ey%ords 0 generates combination terms! The proposed %or is having three main

features +irst is to e$tract ey%ords and other information from the database by a document

cra%ler! econd is to generate the combination terms using genetic algorithm! Third, results

generated from the 0 are applied to information retrieval system to generate better results!

+rom the results obtained, the relevance of the documents are verified using evaluation

measures namely precision and recall!

D<! L8,7,A;<=!; A. A<, J this paper presents an adaptive method using

genetic algorithm to modify user*s ueries, based on relevance :udgments! This algorithm

%as adapted for the three %ell-no%n documents collections (CII, '>7 and CC?)! The

method is sho%n to be applicable to large te$t collections, %here more relevant documents

are presented to users in the genetic modification! The algorithm sho%s the effects of

applying 0 to improve the effectiveness of ueries in I systems! +urther studies are

planned to ad:ust the system parameters to improve its effectiveness! The goal is to retrieve

most relevantdocuments %ith less number of non-relevant documents %ith respect to userQs

uery in information retrieval system using genetic algorithm!

P,!</=! S!=/, 2DJ etrieval of relevant documents from a large document collection isa challenging tas! 1ocument etrieval is concerned %ith inde$ing and retrieving documents

1(



provided in a document collection! 1ocuments are represented by document descriptors

%hich are defined as terms or ey%ords e$tracted from the te$tual documents! +ormulating

an optimal uery %ith a set of document descriptors involves searching a huge search space

for the better permutation and combination of terms! s 0enetic lgorithm is %ell suited for

searching huge search spaces, in this paper, a t%o stage method is proposed for efficient

information retrieval system using genetic algorithm! 0enetic lgorithm generates the best

combination terms from a set of the document descriptors!

A;<=!; A.A<,22J imilar to 0enetic algorithm, Evolution strategy is a process of

continuous reproduction, trial and selection! Each ne% generation is an improvement on the

one that %ent before! This paper presents t%o different proposals based on the vector space

model (9?) as a traditional model in information etrieval (TI)! The first uses evolution

strategy (E)! The second uses the document centroid (1C) in uery e$pansion techniue!

Then the results are compared; it %as noticed that E techniue is more efficient than the

other methods!

A;> T/=, 23J Current approaches to information retrieval rely on the creativity of

individuals to develop ne% algorithms! This investigation uses the genetic algorithms (0)

and genetic programming (07) to learn I algorithms and e$amined! 1ocument structure

%eighting is a techniue %hereby different parts of a document (title, abstract, etc!) contribute

unevenly to the overall document %eight during raning! 'ear optimal %eights can be

learned %ith a 0! 1oing so sho%s a statistically significant 4S relative improvement in

?7 for vector space inner product and Croft*s probabilistic raning, but no improvement

for B?34! T%o applications of this approach are suggested offline learning, and relevance

feedbac! In a second set of e$periments, a ne% raning function %as learned using 07! This

ne% function yields a statistically significant 22S relative improvement on unseen ueries

tested on the training documents! 7ortability tests to different collections (not used in

training) demonstrate the performance of the ne% function e$ceeds vector space and

probability, and slightly e$ceeds B?34! >earning %eights for this ne% function is proposed!

The application of genetic learning to stemming and thesaurus construction is discussed!

temming rules such as those of the 7orter algorithm are candidates for 07 learning %hereas

synonym sets are candidates for 0 learning!

1)



D. V!/, 25J 0enetic algorithms (0s) search for good solutions to a problem by

operations inspired from the natural selection of living beings! mong their many uses, %e

can count information retrieval (I)! In this field, the aim of the 0 is to help an I system to

find, in a huge documents te$t collection, a good reply to a uery e$pressed by the user! The

analysis of phenomena seen during the implementation of a 0 for I has brought us to a

ne% crossover operation! This article introduces this ne% operation and compares it %ith

other learning methods!

The goal of this article is to introduce a ne% crossover operator for the 0 used in I! The

analysis presented in the third section sho%s the origin of the ne% operator, and the results,

compared to the classical 0, indicate that the crossover operator can be improved!

comparison bet%een our application of the 0 and the method of the relevance feedbac

sho%s that, even if the 0 is less efficient than more direct methods, it still has its advantages

and %ill probably continue to be studied in the future!

I<=!/ R. S!<7@ '/ N+ S/:@ ! S. S/+@ 2FJ The vector space model is a

mathematical-based model that represents terms, documents and ueries by vectors and

provides a raning! In this model, the subspace of interest is formed by a set of pair %ise

orthogonal term vectors, indicating that terms are mutually independent! @o%ever, this is a

simplification that doesnQt correspond to the reality! Based on this scenery, in this %or, an

e$tension to the vector space model to tae into account the correlation bet%een terms! In the

proposed model, term vectors are rotated in space geometrically reflecting the dependence

semantics among terms! 6e rotate terms based on a data mining techniue called association

rules! The retrieval effectiveness of the proposed model is evaluated and the results sho%s

that our model improves in average precision, relative to the standard vector space model, for

all collections evaluated, leading to a gain up to 52S!

H!- @ $-? @ G!-? T@ X!/-:,/ F @ X!/- H@ 24JThis paper

brought for%ard a ind of arithmetic of information retrieval, namely combining the positive

genes of 0enetic lgorithm and 9ector pace ?odel on the base of nature language! 0enetic

lgorithm is used for a predication case-frame of uery in this system! Based on @o% 'etthis algorithm gains inherent character tics of data ob:ects, and retrieve the useful information

1*



according to those characteristics! o the system implements information retrieval from

hierarchical no%ledge of concepts! This paper also introduces the ey technology of

information retrieval! ccording to the research on the algorithm model, %e design an I

system in financial domain! It is the most effective particularly in uestion-ans%ering system!

In addition, it can be e$tended to the other domain! It considerably enhances the intelligence

degree of information retrieval by 8 algorithm!

S,/:! L!. C,? ),/@ H/> C,@ 2MJ a mass of distributed and dynamic

information on the 6eb has resulted in .information overload/! 6ith the flood of

information, it has become an important research issue to search the 6eb based on traditional

information retrieval technology! @o%ever, various systems and ambiguous terminology of

information retrieval on the 6eb bring much trouble to users in application and researchers in

development as %ell! This paper proposes the same interface of 6eb document retrieval to

users, it is the model based on multi-agent! Each document in the documents base or from

6eb is represented as a vector in the vector space of classable sememes! The uery from user

is also represented as a vector! The relevance bet%een them can be measured by using the

cosine angle bet%een the uery and its nearest neighbors in the vector space! E$periments

have been done and their results sho%n that this scheme yield good results!

A=! +,+@ =!< D!=!<!<@ 2NJ this %or designed an information retrieval system

"7I" - 7recision Information etrieval ystem - based on the modified vector space model

introducing a ne% uery %eighting formula and similarity function! These modifications of

the classical vector space model aimed to improve the average precision level of the system!

T%o %ell-no%n I parameters, precision and recall %ere used to compute the performance

of system!

L/!+ S. W@ 2JThe 9ector pace ?odel is one of the most common information

retrieval (I) methods for te$t document search! The cosine of the angle or the Euclidean

distance bet%een the uery vector and each document vector is commonly used to measure

similarity for uery matching! Even though the vector space model starts %ith a term-by-document matri$, it inevitably loses the information of relations bet%een uery terms in the

2+



document in the first place! This paper presents a modified vector space model for measuring

similarity bet%een the uery and the document %hen responding to a multi-term uery! ?ore

%eight is assigned to the ey%ords based on the ad:acency bet%een the terms in the

documents! Thus, %hen a document contains the ad:acency terms, its vector %ill typically

move closer to the uery vector to sho% stronger relevancy bet%een uery and the document!

21



CHAPTER III

RESEARCH METHODOLOG$

3.1 UER$

#ur goal is to develop a system to address the ad hoc retrieval tas! This is the most standard

I tas! In it, a system aims to provide documents from %ithin the collection that are relevant

to an arbitrary user information need, communicated to the system by means of a one-off,

user-initiated uery! n information need is the topic about %hich the user desires to no%

more, and is differentiated from a uery, %hich is %hat the user conveys to the computer in an

attempt to communicate the information need! document is relevant if it is one that the user

perceives as containing information of value %ith respect to their personal information need!

user is interested in a topic lie .pipe line leas/ and %ould lie to find relevant documents

regardless of %hether they precisely use those %ords or e$press the concept %ith other %ords

such as pipeline rupture! To assess the effectiveness of an I system (i!e!, the uality of its

search results), a user %ill usually %ant to no% t%o ey statistics about the system*s returned

results for a uery

7recision 6hat fractions of the returned results are relevant to the information need

ecall 6hat fraction of the relevant documents in the collection %ere returned by the system

3.2 UER$ OPTIMI)ATION

<uery #ptimi&ation means optimi&ing the uery so that it can retrieve more relevant result

and to reduce the number of irrevelent document retrieved!

3.3 UER$ OPTIMI)ATION SEARCH S$STEM

3.3.1 THE VECTOR SPACE MODEL

• 1ocuments and ueries are both are represented as vectors

22



d i=( wi ,1 , wi ,2, … … w i ,t )

• Eachw i , j is a %eight for term : in document i!

• "bag-of-%ords representation"• imilarity of a document vector to a uery vector U cosine of the angle bet%een them

θ

+ig! 5!2 ngle bet%een 1ocument nd

<uery

Cosine imilarity ?easure

Sim( d i , q

) = cos θ

x ∙ y = |x| |y| cos θ =

|d i|∨q∨¿d i ∙ q

¿ =

∑ j

wi , j × wq , j

√∑ j

wi , j2 √∑

j

wq , j2

• Cosine is a normali&ed dot product

• 1ocuments raned by decreasing cosine value

o Sim(d, q) = 1 when d = q

o Sim(d, q) = 0 when d and q share no terms

3.3.2 BUILDING IR S$STEM

The proposed system is based on 9ector pace ?odel (9?) in %hich both documents and

ueries are represented as vectors! +irstly, to determine documents terms, %e used the

follo%ing procedure

E$traction of all the %ords from each document!

•

Elimination of the stop-%ords from a stop-%ord list-

23



• temming the remaining %ords using the porter stemmer that is the most commonly

used stemmer in English!

fter using this procedure, the final number of terms that described all documents of the

collection, %e assigned the %eights by using the follo%ing formula %hich proposed by alton

and Bucley

aij=(0.5+0.5 tf ij

maxtf )× log N

ni

√(0.5+0.5tf ij

maxtf )2

×( log N

ni )2 GGGGG (5!2)

6hereaij is the %eight assigned to the term

t j in document Di ,

tf ij is the number

of times that term t j appears in document

Di ,ni is the number of documents

inde$ed by the termt j and finally, ' is the total number of documents in the database!

+inally, %e normali&e the vectors, dividing them by their Euclidean norm! This is according

to the study of 'oreault etal!, of the best similarity measures %hich maes angle comparisons bet%een vectors!

6e carry out a similar procedure %ith the collection of ueries, thereby obtaining the

normali&ed uery vectors! Then, %e apply the follo%ing steps

• +or each collection, each uery is compared %ith all the documents, using the cosine

similarity measure! This yields a list giving the similarities of each uery %ith all

documents of the collection!• This list is raned in decreasing order of similarity degree!

• ?ae a training data consists of the top 4 document of the list %ith a corresponding

uery!

• utomatically, the ey%ords (terms) are retrieved from the training data and the

terms %hich are used to form a binary uery vector!

• dapt the uery vector using the genetic approach!

3.3.3 THE GENETIC APPROACH

24



#nce significant ey%ords are e$tracted from training data (relevant and irrelevant

documents) including %eights are assigned to the ey%ords! The binary %eights of the

ey%ords are formed as a uery vector! 6e have applied 0 for t%o fitness function to get an

optimal or near optimal uery vector, also %e have compared the result of the t%o 0

approach %ith the classical I ystems %ithout using 0! This %ill be e$plained in the

follo%ing subsections!

I. R0+!/ /? , 8,/=/+/=+

These chromosomes use a binary representation, and are converted to a real representation by

using a random function! 6e %ill have the same number of genes (components) as the uery

and the feedbac documents have terms %ith non-&ero %eights! The set of terms contained in

these documents and the uery is calculated! The si&e of the chromosomes %ill be eual to

the number of terms of that set, %e get the uery vector as a binary representation and

applying the random function to modify the terms %eights to real representation! #ur 0

approach receives an initial population chromosomes corresponding to the top 24 documents

retrieved from classical I %ith respect to that uery!

II. F!++ ?8!/

+itness function is a performance measure or re%ard function, %hich evaluates ho% each

solution, is good! In our %or, %e used t%o 0s %ith t%o different fitness functions (a) the

first 0 system (02) uses a measure of cosine similarity bet%een the uery vector and the

chromosomes of the population as a fitness function, %ith the euation

∑i=1

t

xi ∙ yi

√∑i=1

t

x i

2

∙∑i=1

t

y i

2

GGGGGG!(5!3)

%here xi is the real representation %eight of term i in the chromosome,

y i is the real

representation %eight of that term in the uery vector and t is the total number of terms in

theuery vector as in a given chromosome ! The value of the cosine similarity lies on the

interval D, 2J according to the similarity bet%een a chromosome and the uery!

III. S<8!/

25



s the selection mechanism, the 0 uses VVsimple random sampling**! This consists of

constructing roulette %ith the same number of slots as there are individuals in the population,

and in %hich the si&e of each slot is directly related to the individual*s fitness value! @ence,

the best chromosomes %ill on average achieve more copies, and the %orst fe%er copies!

lso, %e have used the VVelitism** strategy, as a complement to the selection mechanism! fter

generating the ne% population, if the best chromosome of the preceding generation is by

chance absent, the %orst individual of the ne% population is %ithdra%n and replaced by that

chromosome!

IV. O0/+

In our 0 approaches, %e use t%o 0 operators to produce offspring chromosomes, %hich

are

• C/++/7 is the genetic operator that mi$es t%o chromosomes together to form ne%

offspring! Crossover occurs only %ith crossover probability 7c! Chromosomes are not

sub:ected to crossover remain unmodified! The intuition behind crossover is

e$ploration of a ne% solutions and e$ploitation of old solutions! 0as construct a better

solution by mi$ture good characteristic of chromosome together! @igher fitness

chromosome has an opportunity to be selected more than lo%er ones, so good solutional%ays alive to the ne$t generation! 6e use a single point crossover, e$changes the

%eights of sub-vector bet%een t%o chromosomes, %hich are candidate for this

process!

• M!/ is the second operator uses in our 0 systems! ?utation involves the

modification of the gene values of a solution %ith some probability 7m! In accordance

%ith changing some bit values of chromosomes give the different breeds!

Chromosome may be better or poorer than old chromosome! If they are poorer than

old chromosome they are eliminated in selection step! The ob:ective of mutation is

restoring lost and e$ploring variety of data!

26



+ig! 5!3 7roposed rchitecture of uery #ptimi&ation earch ystem

2(

W ! 2 C r a 3 % ! r

In&a% -o-$%a&on

F&n!ss $n.&on

!#or"s

.o%%!.&!"

,!n!&. !s&

.omna&on &!rms

A--%# !n!&. o-!ra&orsSn%! -on& Crosso/!r

Inorma&on

R!&r!/a% S#s&!m

Do.$m!n&

Da&aas!



3.3.4 GENETIC ALGORITHM STEPS

+ig 5!5 steps of 0enetic lgorithm

2)



3.4 EVALUATION OF UER$ OPTIMI)ATION SEARCH S$STEM

There are several %ays to measure the uality of <#, such as the system efficiency and

effectiveness, and several sub:ective aspects related to the user satisfaction! Traditionally, the

retrieval effectiveness (usually based on the document relevance %ith respect to the user*s

needs) is the most considered! There are different criteria to measure this aspect, %ith the

precision and the recall being the most used!

7recision ( 7 ) is the rate bet%een the relevant documents retrieved by the I in response to

a uery and the total number of documents retrieved, %hilst ecall ( ) is the rate bet%een

the number of relevant documents retrieved and the total number of relevant documents to the

uery e$isting in the database! The mathematical e$pression of each of them is sho%ed as

follo%s

7 U Number of documents retrieved∧relevant

Total retrieved U

∑d

rd ∙ f d

∑d

f dG!!(5!5)

U Number of documents ret rieved∧relevant

Total relevant ∈collection U

∑d

rd ∙ f d

∑d

rdG!!(5!F)

%ith rd∈ WD, 2 X being the relevance of document d for the user and

f d∈ WD,2 X being

the retrieval of document d in the processing of the current uery! 'otice that both measures

are defined in D,2J, %ith being the optimal value!

The evaluation function herein is the non-interpolated average precision! 6hich is similar to

average precision but %ith the cut-off points euivalent to the training documents! In this

2*



measure function, the documents are simply raned! >et d1 ,

d2 , ! ! !,¿ D∨¿

d¿denote

the sorted documents by decreasing order of the values of the similarity measure function,

%here Y1Y represents the number of training documents! The function r (d) gives therelevance of a document d! It returns 2 if d is relevant, and D other%ise! The non-interpolated

average precision is defined as follo%s

¿ D∨¿1

j

¿ D∨¿r (d i ) ∙∑ j=1

¿

¿

v! "=1

D∑i=1

¿

¿

GGGG!(5!4)

6hen r ( d i ) returns 2, if

d i is relevant and D other%ise %here Y1Y represent the number

of documents!

3+



CHAPTER IV

RESULTS AND DISCUSSION

4.1 UER$ OPTIMI)ATION SEARCH S$STEM EXAMPLE

4.1.1 DATABASE contains these documents

12!t$t 13!t$t

15!t$t 1F!t$t

+ig! F!2 documents

1ocuments go under preprocessing process! nd inde$ is built in vector space model

31

S0-m!n& o o%" "ama!" I a r! D!%/!r# o s%/!r arr/!" n s%/!r

&r$.7

S0-m!n& o .oa% arr/!" n a &r$.7S0-m!n& o o%" arr/!" n a

&r$.7



Terms < 12 13 15 1F df 1Adf I1+ 6 6d2 6d3 6d5 6dF

rrived D D 2 2 2 5 FA5U2!55 D!235 D D D!235 D!235 D!235

Coal D D D D 2 2 FA2UF D!MD3D D D D D D!MD3D

1amaged D 2 D D D 2 FA2UF D!MD3D D D!MD3D D D D

1elivery D D 2 D D 2 FA2UF D!MD3D D D D D!MD3D D

+ire D 2 D D D 2 FA2UF D!MD3D D D!MD3D D D D

0old 2 2 D 2 D 3 FA3U3 D!5D2D D!5D2D D!5D2D D D!5D2D D

hipment D 2 D 2 2 5 FA5U2!55 D!235 D D!235 D D!235D D!235

ilver 2 D 2 D D 2 FA2UF D!MD3D D!MD3D D D!MD3D D D

Truc 2 D 2 2 2 5 FA5U2!55 D!235 D!235 D D!235 D!235 D!235

Table F!2 9ector pace Inde$

document vector (1oc) %ith n ey%ords and a uery vector %ith m uery terms can be

represented as

1oc U(term1 ,term2 ,term3 ,………#,termn)

<uery U(qterm1,qterm2 ,qterm3 ,……,qtermm)

6e use binary term vector, so eachtermi (or

qterm j ) is either D or 2!termi is set to

&ero %hentermi is not presented in document and set to one %hen

termi is presented in

document!

+or e$ample, user enters a uery into our system that could retrieve F documents! These

documents are

12 U Wshipment, gold, damaged, fireX

13 U Wdelivery, silver, arrived, trucX

15 U Wshipment, gold, arrived, trucX

1F U Wshipment, coal, arrived, trucX

32



ll ey%ords of these documents can be arranged in the ascending order as

A!7;@ 8/<@ ;=;@ ;<!7@ ?!@ /<;@ +,!0=@ +!<7@ 8

Encode in the chromosome representation as

12 U D D 2 D 2 2 2 D D

13 U 2 D D 2 D D D 2 2

15 U 2 D D D D 2 2 D 2

1F U 2 2 D D D D 2 D 2

< U D D D D D 2 D 2 2

These chromosomes are called initial population that feed into genetic operator process! The

length of chromosome depends on number of ey%ords of documents retrieved from user

uery! +rom our e$ample the length of each chromosome is bits!

Y D1∨¿ U √ 0.60202+0.60202+0.30102+0.12382=√ 0.91144=0.9546

Y D2∨¿ U √ 0.12382

+0.60202

+0.60202

+0.12382

=√ 0.8691=¿ D!533

Y D3∨¿ U √ 0.12382

+0.30102+0.1238

2+0.1238

2=√ 0.3695=0.6078

Y D4∨¿ U √ 0.12382

+0.60202+0.1238

2+0.1238

2=√ 0.6390=0.7993

Y Di∨¿ U √∑i

wi , j

2

GG! (F!2)

Y<Y U √ 0.30102+0.60202+0.12382=√ 0.6843=0.8272

Y<Y U √∑

i

w$ , j

2

GGG (F!3)

33



Compute all dot products (&ero products ignored)

< Z D1

U D!5D2D Z D!5D2D U D!DDMD

<Z D2 U D!MD3D Z D!MD3D K D!235 Z D!235 U D!5NNN

<Z D

3 U D!5D2D Z D!5D2D K D!235 Z D!235 U D!2D4

<Z D4 U D!235 Z D!235 U D!D245

<Z Di U

∑i

w$ , j wi , j GG!! (F!5)

Calculate the similarity value

Cosineθ D1 U

|$|∗¿ D1∨¿$∗ D1

¿ U

0.09060

0.8272∗0.9546 U D!2DF4

Cosineθ D2 U

|$|∗¿ D2∨¿$∗ D2

¿ U

0.3777

0.8272∗0.9322 U D!F34M

Cosineθ D

3 U

|$|∗¿ D3∨¿$∗ D3

¿ U

0.1059

0.8272∗0.6078 U D!DNN

Cosineθ D

4 U

|$|∗¿ D4∨¿$∗ D4

¿ U

0.0153

0.8272∗0.7993 U D!D2FN

34



Cosineθ Di U im(<,

Di )

im(<, Di ) U

∑i

w $, j wi , j

√∑ j

w$ , j2 √∑

i

wi , j2

4.1.2 FITNESS EVALUATION

+itness function is a performance measure or re%ard function %hich evaluate ho% good each

solution is! The information retrieval problem is ho% to retrieve user reuired documents! It

seems that %e could use the fitness function (F!F) to calculate the distance bet%een

document and uery!

Cosine θ Di U im (<, Di )

im (<, Di ) U

∑i

w$ , j wi , j

√∑ j

w$ , j2 √∑

i

wi , j2 G (F!F)

esult from these fitness functions are interval D to 2! By 2!D means document and uery is

sameness! 9alues near 2!D mean documents and uery are more relevant and values near D!D

mean documents and uery are less relevant! 9alues evaluate from fitness functions are called

.fitness/!

4.1.3 SELECTION

35



chromosomes, give the different breeds! Chromosomes may be better or poorer than old

chromosomes! If they are poorer than old chromosomes, they are eliminated in selection step!

The ob:ective of mutation is restoring lost and e$ploring variety of data! +or e$ample

randomly mutate chromosome at position M!

D D D 2 D ( D 2 2

esult D D D 2 D 1 D 2 2

4.2 PROCESS OF OUR S$STEM

2! 8ser enters uery into our system!

3! ?atch ey%ords from user uery %ith list of ey%ords

5! Encode documents retrieved by user uery to chromosomes (initial population)

F! 7opulation feed into genetic operator process such as selection, crossover, and

mutation!

4! 1o step F until ma$ generation is reached! 6e %ill get an optimi&e uery chromosome for

document retrieval!

M! 1ecode optimi&e uery chromosome to uery and retrieve document from database!

4.3 TEST CASE FORMULATION

This e$perimentation tests for ueries %ith fitness function cosine coefficient! +itness

function tests %ith set of parameters probability of crossover (7c U D!), and probability of

mutation (7m UD!D2, D!2D, D!5D) to compare the efficiency of retrieval system The

information retrieval efficiency measures from precision 7, recall , test accuracy +2!

7 U

¿ relevant documents∨¿

|relevant documents|%∨documents retrieved∨¿¿

¿ G! (F!4)

3(



U

¿documents &etrieved∨¿

|relevant documents|%∨documents retrieved∨¿¿

¿ GG!!! (F!M)

+2 U2 "&

"+ & GGG (F!N)

esult 7ercentage

Total 1ocs Techniue 7 +2

2DD 6ithout 0 2DDS 2DDS 2DDS

6ith 0 2DDS 2DDS 2DDS

3DD 6ithout 0 N!3S 42!D3S MN!DS

6ith 0 M!DDS M!24S M!DNS

5DD 6ithout 0 3!2MS 43!DFS MM!43S

6ith 0 4!N2S N!4DS M!MDS

FDD 6ithout 0 3!2MS 43!DFS MM!4346ith 0 4!N2S N!4D M!MDS

4DD 6ithout 0 3!2MS 43!DFS MM!43S

6ith 0 4!N2S N!4DS M!MD

∑ ¿5 W!,/ GA &4.%%J 61.43J #3.33J

W!, GA &(.63J &1.#3J &1.1#J

Table F!3 value of precision, recall and +2!

3)



Pr!.son R!.a%% F1

+8++9

1+8++9

2+8++9

3+8++9

4+8++9

5+8++9

6+8++9

(+8++9

)+8++9

*+8++9

1++8++9

W&0o$& ,A

W&0 ,A

+ig! F!3 The average percentage result for 7, and +2!

+ig! F!3 sho%s the average result of precision, recall and +2 for e-business topics! +rom the

results, the precision of 0 (D!M5S) are lo%er than the precision, 7 %ithout 0 (F!S)! It

means only some of the documents that are relevant to the user search! @o%ever, the recall,

result %ith 0 is 2!N5S compared to M2!F5S %ithout 0! It means that 2!N5S of the

documents are successfully search by the system based on the uery selected by the user! The

+2%ith 0 (2!2NS) are also higher than the result %ithout 0 (N5!55S)! +rom this result,

%e believed that the searched document based on the 0 have higher accuracy rate rather

than the result %ithout 0!

3*



1++ 2++ 3++ 4++ 5++(58++9

)+8++9

)58++9

*+8++9

*58++9

1++8++9

1+58++9

&0o$& ,A

&0 ,A

+ig F!5 7recision

1++ 2++ 3++ 4++ 5++

+8++9

2+8++9

4+8++9

6+8++9

)+8++9

1++8++9

12+8++9

&0o$& ,A

&0 ,A

+ig F!F ecall

4+



1++ 2++ 3++ 4++ 5++

+8++9

2+8++9

4+8++9

6+8++9

)+8++9

1++8++9

12+8++9

&0o$& ,A

&0 ,A

+ig F!4 +2

s sho%n in fig! F!5, %e found that the precision values %ith 0 and %ithout 0 are

decreased %hen total number of documents increased! This is caused by ey%ord e$pansionfrom 0 process that maing the result after that is not accurate to user search but relevant

by the system search! @o%ever the recall and +2 value %ith 0 as sho%n in fig! F!F and fig

F!4, is higher than recall and +2 value %ithout 0!

41



4.4 SCREEN SHORTS

42



+ig! F!M %eb Cra%ler

43



+ig F!N inde$ing in vector space model and searching interface

44



CHAPTER V

CONCLUSION

6e have used cra%ler here, %hich retrieve documents from E-business %eb pages! These %eb

pages go under information retrieval process!

nd finally the proposed <uery #ptimi&ation earch ystem is a t%o stage approach

• +irst uses genetic algorithm to obtain the set of best combination of terms in the first

stage!

• econd stage uses the output %hich is obtained from the first stage to retrieve more

relevant results!

Thus a novel t%o stage approach to document retrieval using 0enetic lgorithm has been

proposed! The proposed information retrieval system is more efficient %ithin a specific

domain as it retrieves more relevant results! This has been verified using the evaluation

measures, precision and recall!

45



CHAPTER VI

FUTURE SCOPE OF WOR

+uture scope of the %or

6e found that by using genetic algorithm, the searching process of the e-business %ebsite is

optimi&ed!

+urthermore, %e can use a feedbac mechanism to the search system the user*s suggestions

about the found documents, %hich leads to a ne% uery using a genetic algorithm! In the ne%

search stage, more relevant documents are given to the user!

The future research plan is to improve the performance of the user search activities such that

user profiles can be learned automatically!

6e believe that the e$perimental results are interesting and useful for related research and

that the research issue identified should be further studied in other collaborative environments

uch as search engines, and the search system in personal computers!

46



REFERENCES

K1J Christopher 1! ?anning, 7rabhaar aghavan and @inrich ch[t&e, Introduction to

Information etrieval, Cambridge 8niversity 7ress! 3DD!httpAAinformationretrieval!orgA

K20enetic lgorithms in earch #ptimi&ation and ?achine >earning! 1avid E! 0oldberg!

D2A2; 7ublisher ddison-6esley!

K3@! Chen, .?achine learning for information retrieval neural net%ors, symbolic learning,

and genetic algorithms/! Hournal of the merican ociety for Information cience, FM(5),

24, pp! 2F=32M!

K4hmed ! ! ad%an, Bahgat ! bdel >atef, bdel ?geid ! li, and #sman ! ade,

.8sing 0enetic lgorithm to Improve Information etrieval ystems/ 6orld cademy of

cience, Engineering and Technology 2N 3DD

K5 Eman l ?ashagba, +eras l ?ashagba and ?ohammad #thman 'assar .<uery

#ptimi&ation 8sing 0enetic lgorithms in the 9ector pace ?odel/ IHCI International

Hournal of Computer cience Issues, 9ol! , Issue 4, 'o 5, eptember 3D22I' (#nline)

2MF-D2F %%%!IHCI!org

K6 Cristina >o\ pe&-7u:alte, 9icente 7! 0uerrero-Bote, +e\ li$ de ?oya-nego\n .#rder

Based +itness +unctions for 0enetic lgorithms pplied to elevance +eedbac/! Hournal of

the merican ociety for Information cience and Technology, 4F(3)243=2MD, 3DD5

K# !!iva athya, B!7hilomina imon, . 1ocument etrieval ystem %ith Combination

Terms 8sing 0enetic lgorithm/! International Hournal of Computer and Electrical

Engineering, 9ol! 3, 'o! 2, +ebruary, 3D2D2N5-2M5!

K%1etelin >uchev, bdelmgeid ! ly,.pplying 0enetic lgorithm in uery improvement

problem/ International Hournal "Information Technologies and ]no%ledge" 9ol!2 A 3DDN!

K&7hilomina imon, .T%o tage pproach to 1ocument etrieval using 0enetic

lgorithm/! International Hournal of ecent Trends in Engineering, 9ol! 2, 'o! 2, ?ay 3DD

K1( bdelmgeid !ly, .Enhancing Information etrieval by using Evolution

trategies /,Information theories and applications 9ol! 24, 5M-5NM, 3DD‖

4(

http://www.ijcsi.org/

http://www.ijcsi.org/



K11ndre% T!, "an rtificial Intelligence pproach to Information etrieval",

Information 7rocessing and ?anagement, FD(F)M2-M53, 3DDF!

K121! 9ra:itoru, .Crossover improvement for the genetic algorithm in information retrieval/,

Information 7rocessingO ?anagement, 5F(F), pp! FD4=F24, 2!

K13 Ilm^rio ! ilva, Ho_o 'unes ou&a, ]arina ! antos, .1ependence among Terms in

9ector pace ?ode/! 7roceedings of the International 1atabase Engineering and

pplications ymposium (I1E*DF) 2D-DMADF ` 3DDF IEEE

K14@ai-yan ]ang, an-fang , 0ui-fa Teng, iao-&hong +an , iao-yang @e, .esearch on

'atural >anguage I ystem based on 0enetic lgorithm and 9?/, 7roceedings of the

Third International Conference on ?achine >earning and Cybernetics, hanghai, 3M-3

ugust 3DDF!

K15 hao&i >i! Changfe Lhou, @uo%ang Chen, .6eb 1ocument etrieval Based on ?ulti-

agent/, the th International Conference on Computer upported Cooperative 6or in

1esign 7roceedings!

K16mir ]arshenas, ]amil 1imililer, .7I n Information etrieval ystem based on the

9ector pace ?odel/, N-2-F3FF-32-MADA 3DD IEEE!

K1#J >ouis ! 6ang, .elevance 6eighting of ?ulti-Term <ueries for 9ector pace ?odel/,

N-2-F3FF-3NM4-ADA`3DD IEEE!

4)



APPENDIX

6eb Cra%ler %ritten in Hava

8sage +rom command line

:ava 6ebCra%ler 8> 'J

6here 8> is the url to start the cra%l, and ' (optional) is the ma$imum number of pages to

do%nload!

import :ava!te$t!Z;

import :ava!util!Z;

import :ava!net!Z;

import :ava!io!Z;

public class 6ebCra%ler W

public static final int EC@>I?IT U 3D; AA bsolute ma$ pages

public static final boolean 1EB80 U false;

public static final tring 1I>>#6 U "1isallo%";

public static final int ?ILE U 3DDDD; AA ?a$ si&e of file

AA 8>s to be searched

9ector ne%8>s;

AA ]no%n 8>s

@ashtable no%n 8>s;

4*



AA ma$ number of pages to do%nload

Int ma$ 7ages;

AA initiali&es data structures! argv is the command line arguments!

public void initiali&e(tringJ argv) W

8> url;

]no%n 8>s U ne% @ashtable();

ne%8>s U ne% 9ector();

try W url U ne% 8>(argvDJ); X

catch (?alformed8>E$ception e) W

ystem!out!println("Invalid starting 8> " K argvDJ);

return;

X

no%n8>s!put(url,ne% Integer(2));

ne%8>s!addElement(url);

ystem!out!println("tarting search Initial 8> " K url!totring());

?a$ 7ages U EC@>I?IT;

if (argv!length 2) W

int i7ages U Integer!parseInt(argv2J);

if (i7agesma$7ages) ma$7ages U i7ages; X

ystem!out!println("?a$imum number of pages" K ma$7ages);

AZBehind a fire%all set your pro$y and port hereZA

5+



7roperties propsU ne% 7roperties(ystem!get7roperties());

props!put("http!pro$yet", "true");

props!put("http!pro$y@ost", "%ebcache-cup");

props!put("http!pro$y7ort", "DD");

7roperties ne%props U ne% 7roperties(props);

ystem!set7roperties(ne%props);

X

AA Chec that the robot e$clusion protocol does not disallo%

AA do%nloading url!

publicbooleanrobotafe(8> url) W

tring str@ost U url!get@ost();

AA form 8> of the robots!t$t file

tring strobot U "httpAA" K str@ost K "Arobots!t$t";

8> urlobot;

try W urlobot U ne% 8>(strobot);

X catch (?alformed8>E$ception e) W

AA something %eird is happening, so donQt trust it

return false;

X

if (1EB80) ystem!out!println("Checing robot protocol " K urlobot!totring());

51



int inde$ U D;

%hile ((inde$ U strCommands!inde$#f(1I>>#6, inde$)) U -2) W

inde$ KU 1I>>#6!length();

tring str7ath U strCommands!substring(inde$);

tringToeni&erst U ne% tringToeni&er(str7ath);

if (st!has?oreToens())

brea;

tring strBad7ath U st!ne$tToen();

AA if the 8> starts %ith a disallo%ed path, it is not safe

if (str8>!inde$#f(strBad7ath) UU D)

return false;

X

return true;

X

AA adds ne% 8> to the ueue! ccept only ne% 8>Qs that end in

AA html or html! old8> is the conte$t, ne%8>tring is the lin

AA (either an absolute or a relative 8>)!

public void addne%url(8> old8>, tring ne%8rltring)

W 8>url;

if (1EB80) ystem!out!println("8> tring " K ne%8rltring);

try W url U ne% 8>(old8>,ne%8rltring);

if (no%n8>s!contains]ey(url)) W

tring filename U url!get+ile();

53



intiuffi$ U filename!lastInde$#f("htm");

if ((iuffi$ UU filename!length() - 5) YY

(iuffi$ UU filename!length() - F)) W

no%n8>s!put(url,ne% Integer(2));

ne%8>s!addElement(url);

ystem!out!println("+ound ne% 8> " K url!totring());

X X X

catch (?alformed8>E$ception e) W return; X

X

AA 1o%nload contents of 8>

public tring getpage(8> url)

W try W

AA try opening the 8>

8>ConnectionurlConnection U url!openConnection();

ystem!out!println("1o%nloading " K url!totring());

urlConnection!setllo%8serInteraction(false);

Inputtreamurltream U url!opentream();

AA search the input stream for lins

AA first, read in the entire 8>

byte bJ U ne% byte2DDDJ;

intnumead U urltream!read(b);

tring content U ne% tring(b, D, numead);

%hile ((numead U -2) OO (content!length() ?ILE)) W

54



numead U urltream!read(b);

if (numead U -2) W

tring ne%Content U ne% tring(b, D, numead);

content KU ne%Content;

X

X return content;

X catch (I#E$ception e) W

ystem!out!println("E# couldnQt open 8> ");

return "";

X

X

AA 0o through page finding lins to 8>s! lin is signalled

AA by a hrefU" !!! It ends %ith a close angle bracet, preceded

AA by a close uote, possibly preceded by a hatch mar (maring a

AA fragment, an internal page marer)

public void processpage(8> url, tring page)

W tringlc7age U page!to>o%erCase(); AA 7age in lo%er case

int inde$ U D; AA position in page

intiEndngle, ihref, i8>, iClose<uote, i@atch?ar, iEnd;

%hile ((inde$ U lc7age!inde$#f("a",inde$)) U -2) W

iEndngle U lc7age!inde$#f("",inde$);

ihref U lc7age!inde$#f("href",inde$);

if (ihref U -2) W

55



i8> U lc7age!inde$#f(""", ihref) K 2;

if ((i8> U -2) OO (iEndngle U -2) OO (i8>iEndngle))

W iClose<uote U lc7age!inde$#f(""",i8>);

i@atch?ar U lc7age!inde$#f("", i8>);

if ((iClose<uote U -2) OO (iClose<uoteiEndngle)) W

iEnd U iClose<uote;

if ((i@atch?arU -2) OO (i@atch?ariClose<uote))

iEnd U i@atch?ar;

tring ne%8rltring U page! ubstring (i8>, iEnd);

addne%url(url, ne%8rltring);

X X X

inde$ U iEndngle;

X

X

AA Top-level procedure! ]eep popping a url off ne%8>s, do%nload

AA it, and accumulate ne% 8>s

public void run(tringJ argv)

W initiali&e(argv);

for (inti U D; ima$7ages; iKK) W

8> url U (8>) ne%8>s!elementt (D);

ne%8>s!removeElementt (D);

if (1EB80) ystem!out!println ("earching " K url!totring());

if (robot afe (url)) W

56



pacage inde$;

import generic!Tuple3;

import:ava!io!+ile;

import:ava!io!+ile'ot+oundE$ception;

import:ava!util!Collection;

import:ava!util!@ash?ap;

import:ava!util!@ashet;

import:ava!util!>ined>ist;

import:ava!util!>ist;

import:ava!util!?ap;

import:ava!util!<ueue;

import:ava!util!et;

import:ava!util!ortedet;

import:ava!util!Treeet;

import:ava!util!9ector;

AZZ

Z The inverted inde$ much more efficiently stores the inde$ed data for later

Z retrieval! This forms the centre of the pro:ect and should be the only class

Z that needs to be instantiated! ll useful functions can be accessed through this

Z class!

5)



for(inti U D; i threads; iKK)

this! ThreadsiJ U ne% Inde$ Thread (this,i);

X

AZZ Z <uery for a list of relevant files! If a term appears in a uery but not in Z the document

collection it %ill be ignored!Z param s the uery stringZ paramma$results the ma$imum

number of results to return Z return a list of files matching the uery ZA

7ublicortedet<uery esult uery (tring s, intma$results) W

AAince the Inde$ Thread converts every term to lo%er case so shall %e

s!to>o%erCase ();

AACreate a list of files to return

orted et<uery esultrevel U ne% Tree et<uery esult();

AAplit the terms of the uery by the non-%ord regular e$pression class

tringJ split U s!split ("6K");

AAThis implementation considers the <uery document vector a binary vector

AAin other %ords - duplicates are not allo%ed

ettring terms U ne% @ash ettring();

for(tring str split)

terms! add (str);

AA+or every document in the collection calculate the similarity coefficient

for(inti U D; ithis!docTab!si&e (); iKK) W

doublesc U D!D;

+or (tring term terms) W

Term 1ata td U inde$! get(term);

6+



if(td U null)

c KU td!getI1+ () Z td!getI1+ () Z td!get+re (i);

X

+ile f U this!docTab!get (i)!get +ile ();

if(sc D)

retval!add (ne% <ueryesult (sc,f));

if(retval!si&e() ma$results)

retval!remove (retval!last ());

eturnretval;

X

AZZZ +orces the inde$ to scan all of its inde$ed files! If some files have been

Z modified or removed then ad:ust the inde$ appropriately! ZA

public void force 8pdate() W

AAT#1# Implement

X

AZZZ dds a file to the ueue for inde$ingZ param f the file to be inde$ed

Z thro%s IllegalrgumentE$ception the file is not readable

Z thro%s +ile'ot+oundE$ception the file is not foundZA

7ublic void inde$(+ile f) thro%s IllegalrgumentE$ception, +ile'ot+oundE$ception W

AA?ae sure this file is not part of the collection already

for (1ocument d this!docTab) W

if(d!get+ile ()!euals(f))

return;

61



for(+ile doc f)

this! Inde$(doc);

X

AZZ Z p+or internal use, notifies completion of inde$ing!ApZ

Z p?ust be synchroni&ed as several threads may access this

Z function concurrently! ince Hava collections ob:ects should

Z not be accessed concurrently synchroni&ed behaviour is reuired!Ap

Z param t the thread that has completed ZA

public synchroni&ed void notifyCompletion(Inde$Thread t) W

AAetrieve the values of this thread

+ile f U t!get+ile();

?aptring,Integerfre U t!getTerm+reuency();

AAdd the document to the document table

try W

this!docTab!add(ne% 1ocument(f));

X catch(E$ception e) W

ystem!err!println("+ileQ" K t!get+ile()!get'ame() K "Q - " K e!get?essage());

e!printtacTrace();

eturn;

X

AA8pdate the collection si&e so that idf can be calculated correctly

+or (Term1ata td this!inde$!values ())

td!set1ocumentCollectioni&e (this!docTab!si&e ());

63



AA+ire off collection si&e listeners

this!fireCollectionChangeEvent();

AAetrieve the document identifier

IntdocId U this!docTab!si&e () - 2;

AA?erge the Terms into the posting list

for (?ap! Entrytring, Integer e fre!entryet()) W

AA0et the term from the inde$ map

Term1ata td U this!inde$!get(e!get]ey());

AAIf the term %as not found create and add it to the inde$

if(td UU null) W

td U ne% Term1ata(this!docTab!si&e());

this!inde$!put (e!get]ey (), td);

this!fireTermChangeEvent ();

AAdd the freuency to the posting list

td!add+reuency (e!get9alue (), docId);

AAet this threads to idle

t!setIdle ();

AAIf the ueue is not empty poll it and start inde$ing again!

AA#ther%ise, chec the first idle pointer and set the thread to idle

If (this!inde$<ueue!si&e () D) W

try W

t!inde$ (inde$<ueue!poll() );

this!fire<ueueChangeEvent();

64



public void removeInde$>istener(Inde$>istener l) W

this!listeners!remove (l);

AZZ Convenience function for firing off collection si&e change events ZA

7rivate void fireCollectionChangeEvent() W

for(Inde$ >istener l this! listeners)

l!si&eChanged (Inde$>istener!s&Types!1#C8?E'TC#>>ECTI#', this!si&e ());

X

AZZ Convenience function for firing off ueue si&e change events ZA

7rivate void fire<ueueChangeEvent () W

+or (Inde$ >istener l this! >isteners)

l!si&eChanged (Inde$>istener!s&Types!+I>E<8E8E, this!inde$<ueue!si&e ());

X

AZZ Convenience function for firing off term si&e change events ZA

7rivate void fireTermChangeEvent () W

+or (Inde$ >istener l this! >isteners)

>!si&e Changed (Inde$>istener!s&Types!TE?C#>>ECTI#', this!inde$!si&e ());

X

AZZ eturns the number of files inde$ed ZA

7ublic int si&e() W

return doc Tab!si&e();

X

AZZ eturns the number of terms globally in the collection ZA

7ublic int termi&e() W

66



returnthis!inde$!si&e();

X

AZZ eturns the number of inde$ing threads this inverted inde$ uses ZA

publicintget'umberThreads() W

returnthis!threads!length;

X

AZZ eturns the current progress of each thread and the file they are inde$ing as a map! ZA

public ?apInteger,Tuple3+loat,+ilegetThread?ap() W

?apInteger,Tuple3+loat,+ile result U ne% @ash?apInteger,Tuple3+loat,+ile();

for(inti U D; ithis!threads!length; iKK) W

Tuple3+loat,+ile p U ne% Tuple3+loat,+ile();

p!first U threadsiJ!progress();

p!second U threadsiJ!get+ile();

result!put(i, p);

X

return result;

X

#verride

public tring totring() W

tringBuilder b U ne% tringBuilder();

b!append("Inde$n");

for(?ap!Entrytring, Term1ata e this!inde$!entryet()) W

6(

thesis final on project search

Documents