abstract - storage.googleapis.comstorage.googleapis.com/.../erin_colvin_dissertation_pa… · web...

USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE

A dissertation presented in partial fulfillment of the requirements for the degree of Doctor of Computer Science

By

Erin Colvin

Colorado Technical University

December, 2014

© Erin Colvin, 2014


Committee

Dr. Donald Kraft, Ph.D., Chair

Dr. Caroline Howard, Ph.D., Committee Member

Dr. Bo Sanden, Ph.D., Committee Member

Date Approved

5/30/2014

ii


Abstract

This is a dissertation that will provide an analysis of the use of fuzzy sets for retrieval of software

for the purpose of reuse. The need for quality software components is growing exponentially but

the ability of software developers to meet this need is not. One major goal of any programmer is

to develop quality software in an efficient amount of time. By reusing an already created and

tested piece of code, programmers can do just that. Most development teams hesitate from the

reuse realm because they cannot find quality software quickly. Software needs to be easily

accessible and the results of a query to find quality software need to meet a user’s expectation.

Most software search algorithms are based on a Boolean search, where a term used to describe a

given software component either matches or doesn’t match a queried term. Here, using fuzzy

logic, a term is given a weight based on its degree of membership, which will result in a

weighted list of matches while maintaining the semantics of Boolean logic. The result is a list of

matched documents in descending order by how well the document matches the queried term.

Various methods of information retrieval implementation, analysis of fuzzy set retrieval,

and benefits of its use for software reuse will be examined and presented in this dissertation. A

deeper explanation of the fundamentals of designing a fuzzy information retrieval system for

software look up will also be examined. Future research options and necessary data storage

systems will be explained as well.

iii


Acknowledgements

I would like to express the deepest appreciation to my mentor and committee chair Dr. Donald

Kraft, who has shown the attitude and the substance of nothing short of a saint and a genius: he

continually encouraged me and conveyed a spirit that nothing is impossible in regard to my

research. He opened many doors to other researchers and always knew the right thing to say to

encourage me to continue on and finish this dissertation. His wit and humor always put a smile

on my face and without his constant help this dissertation would not have been possible.

I would also like to thank my readers and other committee members, Dr. Howard and Dr.

Sanden, you both have been inspirational faculty to me while at Colorado Technical University

and I thank you for your time and advice that you have given me over the past 3 years. I hope

one day I will be as amazing and inspiring as you have been as part of the faculty at CTU.

iv


Dedication

This paper and degree would not have been possible without the continued support of my

wonderful husband, thank you for all you do serving our country and our family. Thank you for

stepping up when my head was in a book, I love you.

To my parents, who taught me that I could do anything I set my mind to, I love you mom

and dad, thanks for all the support. Lastly, to my kids who I hope learn that any goal can be

accomplished if you set your mind to it.

v


Table of Contents

Abstract..........................................................................................................................................iii

Acknowledgements........................................................................................................................iv

Dedication........................................................................................................................................v

List of Figures.................................................................................................................................ix

CHAPTER 1: INTRODUCTION....................................................................................................1

BACKGROUND OF THE PROBLEM......................................................................................2

Software Reuse............................................................................................................................2

Information Retrieval..................................................................................................................3

Fuzzy Retrieval............................................................................................................................4

Purpose of the Study....................................................................................................................5

Statement of the Problem............................................................................................................5

Research Questions......................................................................................................................6

Hypothesis...................................................................................................................................6

Theoretical Framework................................................................................................................6

Objectives of the Study................................................................................................................8

Assumptions and Limitations......................................................................................................8

Summary......................................................................................................................................9

CHAPTER 2: REVIEW OF THE LITERATURE........................................................................11

Introduction to Software Reuse.................................................................................................11

Searching for Software..............................................................................................................18

Information Retrieval - Boolean Logic.....................................................................................26

vi


Fuzzy Sets and Extended Boolean Logic..................................................................................27

Document Pre-processing..........................................................................................................40

Information Retrieval Software.................................................................................................42

Measures of similarity...............................................................................................................43

Literature Review Summary......................................................................................................46

CHAPTER 3: METHODOLOGY................................................................................................48

Lucene.......................................................................................................................................48

Procedure...................................................................................................................................50

Data............................................................................................................................................51

Sample Size...............................................................................................................................57

Instrumentation..........................................................................................................................57

Validity and Reliability.............................................................................................................58

Conclusion.................................................................................................................................59

CHAPTER 4: RESULTS...............................................................................................................60

Data Collection Reviewed.........................................................................................................60

Presentation and Discussion of Findings...................................................................................62

Limitations of the Study............................................................................................................62

Measure of Similarity................................................................................................................63

Quantitative Methodology Measurement..................................................................................67

Hypothesis Testing....................................................................................................................68

Conclusion.................................................................................................................................69

CHAPTER 5: DISCUSSION OF RESULTS AND FUTURE WORKS......................................70

vii


Research Question 1..................................................................................................................70



Future Research.........................................................................................................................74

Conclusion.................................................................................................................................77

References......................................................................................................................................78

Appendix A....................................................................................................................................87

Appendix B....................................................................................................................................92

viii


List of Figures

Table 1: Values calculated by Boolean logic search for term A and B………………………..27

Table 2: Values calculated by vector-processing logic search for term A and B……………...33

Table 3: Returned data with file scores to show MAP calculation…………………………….65

Table 4: Similarity scores with file numbers to show ranking…………….…………………...66

Table 5: AP and MAP scores for all searches………………………………………………….68

Table 6: Search MAP scores and the ranking of best to worst search………....……………….68

Table 7: MAP scores for all searches and % increase over Boolean search……………………69

Figure 1: Indexing Process……………………………………………………………………..41

Figure 2: List of files found in the UNIX data corpus…………………………………………52

Figure 3: The alarm.1 file found in the data corpus……………………………………………53

Figure 4: User interface for custom created software………………………………………….56

Figure 5: List of files found in the UNIX directory……………………………………………61

Figure 6: Graph of AP among all searches for all queries……………………………………...71

Equation 1: RSV for an AND query…………………………………………………………..35

Equation 2: MMM similarity for an OR query………………………………………………..38

Equation 3: MMM similarity for an AND query……………………………………………...38

Equation 4: Paice similarity……………………………………………………………………39

Equation 5: P-Norm similarity for an OR query………………………………………………..39

Equation 6: P-Norm similarity for an AND query……………………………………………..39

Equation 7: Average precision………………………………………………………………….45ix


Equation 8: Mean average precision or MAP………………………………………………….45

x

CHAPTER 1: INTRODUCTION

The main goal of any programmer is to develop and deliver high quality software

applications that meet a customer’s needs effectively and efficiently. For years, programmers

have searched the web and other open source libraries for software components to reuse instead

of creating an application from scratch (Thummalapenta, 2011). The number of open-source

software libraries on the web has increased as well. Software reuse has been a proven effective

tool for developers to meet time-to-market deadlines and produce a solid, error-free piece of

software (Krueger, 1992). Software libraries are full of software components that have been

created and stored with proven working track records. Mockus found that 50% of the code being

created for production was using code from previous programs (2007). The trick is being able to

find an already-written piece of software when needed. When searching for data, most users try

to best quantify the terms they are looking for in as little words as possible and the system yields

the results that it determines to be the best match (Bordogna, Carrara & Pasi, 1992). Searching

for data can be done in many ways; searching for software requires an alternative approach since

the programming language, the behavior of the software, or the intended purpose may require

different search terms.

Many information retrieval (IR) applications today use a Boolean match; a document

either contains the searched term, or it doesn’t. The IR system returns a list of documents in no

particular order requiring the user to discern, which match is the closest fit. Fuzzy logic attaches

a weighted value to those matches based on the degree of match, yielding a more accurate

chance of meeting the user’s goals (Bordogna, Carrara & Pasi, 1992).

By using software that is already created, tested and verified to work, programmers can

reduce development and test time. Less development and test time means a faster time to

market. Software reuse can include one of four types of reusable artifacts, data reuse,


architecture reuse, design reuse, and program reuse (Aziz & North, 2007). If software returned

from a search can be listed by the degree of match, a user will have a better choice of which

software component to use. Using a fuzzy method, the query will return a weighted list of

results, versus a Boolean search. The goal of this study is to implement algorithms using fuzzy

logic that will have a higher success rate of returning a better match of software that can be

reused.

BACKGROUND OF THE PROBLEM

Software Reuse

Software reuse was introduced in 1968 by McIlroy at a NATO conference on software

engineering. McIlroy explained that the industry needed to be standardized in order to

successfully utilize the software that is currently stored to create an ease of finding compatible

components that are already created and available to use (McIlroy, 1968). This seminal work

has been cited many times. Krueger explained that software reuse was a great way for software

engineers to reutilize software components in order to save company money. He stated that

reuse of software has many benefits like minimized rework, more time for development and

more stabilized applications (Krueger, 1992).

Sojer and Henkel in 2010 found that developers are more inclined to reuse software

components if they were easily accessible or easily found in a search. Looking at open-source

software applications, they also found that with no standard for software implementation and

most programs lacking the descriptive comments that help determine the use of the software

component, finding the right software components was difficult to impossible (Sojer & Henkel,

2


2010). Software reuse has been around since the 1960s but has not yet gained an industry

standard. One reason for this is the lack of standard for software storage. Software can be

indexed or stored by behavior, function, text and comment text (Vishal, Chander & Kundu,

2012). This makes searching for software difficult and requires a search algorithm to conform to

multiple attributes. This research will not look at how software is stored or at the comments (if

any), but the point is made that software can be stored and accessed in multiple ways.

Information Retrieval

Information retrieval (IR) has been the topic of many research studies since the advent of

information retrieval algorithms in the 1950s to facilitate the data stores in the new computers of

the time, to the present day Internet (Kraft, Bordogna & Pasi, 1998) (Pasi & Bordogna, 2013)

(Singhal, 2001). Each search has a primary target set of data, from a simple web search for

general information to specifics like restaurants near a certain location; each can utilize different

contextual information to help with the search (Baeza-Yates & Ribeiro-Neto, 2011) (Belkin &

Croft, 1992).

The basic structure of any information retrieval system is an archive where documents or

data are held and a search engine which will retrieve matches of a search query (Kraft, Bordogna

& Pasi, 1998). Matching in a Boolean information retrieval system doesn’t allow the relevance

of document matches to affect the outcome of the search (Bordogna & Pasi, 1993). If a

document contains the searched word once, that document is returned as a match that is equal to

a document that contains the word one hundred times. This limitation doesn’t always return the

most desired results. One adaptation to this is to add weighted values to documents with more

3


relevant matches. The main limitation to this is in knowing which term to deem a higher value

(Bordogna & Pasi, 1993). If searching for a sentence or multiple terms in a phrase, deciding

which term gets the highest weighted value is hard to decipher and when a document is returned

as a match what is the best way to quantify a qualitative return (Bordogna & Pasi, 1993).

Another adaptation is adding weights for specific query terms. For example, if a user searches

the Java library for a function that adds two numbers, the keyword in Java is sum, but if the user

isn’t quite sure if it is the keyword or not and is afraid they may miss documents if they only use

sum, they can put a weight on the word sum and search for similar words or expand their query

to include sum OR add OR plus. This will give preference to documents that contain sum which

is what one would expect being a keyword. This inability to express the importance of terms in a

desired document is the main limitation of Boolean search systems (Bordogna, Carrara & Pasi,

1992).

Fuzzy Retrieval

Using fuzzy logic for information retrieval is not a new concept; it was considered by

Radecki and Kraft in 1979, among others, and helped establish the application of fuzzy set

theory (Pasi & Bordogna, 2013). The three basic models of IR today are Boolean, the vector and

the probabilistic models (Baeza-Yates & Ribeiro-Neto, 2011).

A fuzzy set used for information retrieval contains the same two parts as most IR

systems, an archive of documents and a search engine. Often, a thesaurus can be used to

consider other terms related to the original terms in the document or the query. The system then

assigns a value 0 for no match at all, 1 for exact match and the numbers in between are used as

4


weights that are assigned based on the degree of match (Kraft, Bordogna & Pasi, 1998). This

study looks at using term weights for the information retrieval of software components in an in-

house software library for reuse.

Purpose of the Study

There are many applications that are utilized in this study, the first is a look at software

reuse and why it is not more widely used and one way this can be overcome. This study also

looks at information retrieval techniques and more specifically utilizing fuzzy logic in

combination with information retrieval looking for software to increase the chance of reuse.

This study looks at an enhanced fuzzy set using index term weights, which should increase the

likelihood of a match when searching for software components. By utilizing many of the tools

involved with fuzzy set theory, this study enhances a search for in-house data on software

components. With a more successful return rate of software from a software library search, the

likelihood of reuse also increases.

Statement of the Problem

Current information retrieval algorithms are ineffective in finding software components

because they are either done on the Internet utilizing open source software found on the web or

are constructed using a Boolean logic based matching system or both. It is well documented in

the literature that in order for software to be reused, it has to be found and that the current

information retrieval algorithms are ineffective in finding quality software components in an

efficient manner (Prieto-Diaz, 1991) (Yao, Etzkorn & Virani, 2007).

5


Research Questions

The inability of any current information retrieval algorithm to accurately and efficiently

find matching software raises three questions:

Q1: Can software be retrieved using a fuzzy logic algorithm and return a more accurate match to

the user’s query?

Q2: Which algorithm, mixed, min and max (MMM), p-norm, or Paice provide the most precise

match results and are they all better than the Boolean method?

Q3: Can using a fuzzy logic approach to searching for software components reduce the amount

of needed query words to find an appropriate match?

Hypothesis

Using term weights calculated in the Lucene software, a researcher developed search

implementing the MMM, P-norm and Paice algorithms will result in a better matched list of

returned values over the standard Boolean logic method. Looking at the degree of membership

should yield a higher rate of successful match over the Boolean method and a higher success rate

in returned software components should result in a higher reuse rate.

Theoretical Framework

Information retrieval can be designed using Boolean logic or fuzzy logic, for example,

the vector space and probabilistic models. It has been shown in the literature that Boolean logic

is the easiest, cleanest, most common method for creating an information retrieval system but not

the most successful because it lacks a ranking method. Searching for software has primarily been

6


done on the Internet via open-source code data banks which can store the data in many different

formats and can be hard to search. Based on the literature, there have been no software

information retrieval systems implemented with fuzzy logic; this study will use fuzzy logic for

information retrieval for software components to increase the reuse of software. If software can

easily be accessible and retrieved it can be reused thereby saving programmer’s time and

development efforts and increasing time available for new program development (Mili, Mili &

Mili, 1995).

This study utilizes and executes a researcher developed application that uses Lucene 4.6.1

software’s indexing plug-in and then executes a researcher written search application using the

Lucene search plug-in. The search used the tf/idf (term frequency/inverse document frequency)

calculation as the term weights that were then used in the calculations for similarity of a query

using the mixed, min and max algorithm (MMM), P-Norm, Boolean and Paice algorithms all

written in Java using the Eclipse development environment. Searched “documents” can be

anything from a text files to a song to a .JPEG, but for the purpose of this study we will be using

text files that contain software help instructions for the UNIX operating system. The Lucene

indexer can handle all types of files and indexes them into one indexed file, no matter the starting

format. The system will also be utilizing the stemming functionality built into Lucene to

combine versions of words that stem to the same root, like running, runs, etc. The resulting data

was presented and analyzed using standard precision, recall and the mean average precision

(MAP) measure.

7


Objectives of the Study

The objective of the study was to create a methodology that can be used to determine if

using fuzzy logic can improve the return of software components for the intended purpose of

reuse. By increasing the return of software match via a search, software reuse has been proven

to increase via the literature presented in this study. Increasing the reuse of software can

decrease the developer’s time spent on projects which will allow for new projects to be created,

and create a more steady and reliable product.

Assumptions and Limitations

There are several assumptions and limitations for this research. The first limitation is that

the indexing is done using Lucene software application, with more detailed information is found

on their website, http://www.apache.lucene.org. This research is also using the UNIX help

library for its data corpus. The man help files will be downloaded for the latest version of the

FreeBSD software. No other software library was used. There were no software libraries

available on the TREC website, or any other website, using man pages instead of software files

may change the way the data is searched or the type of data that can be searched but should not

affect the search itself. There is no test data available to show this difference. The data will be

compared to the results of the study done by Maarek to show that the results achieved are in fact

the results expected. The number of relevant files should be similar to the study by Maarek,

since the data corpus is similar. Other exceptions will be noted and documented.

To determine what is relevant and aide the calculation of the recall and precision, a panel

of UNIX experts was asked to rank the returned files in the order they deem most relevant to 8

http://www.apache.lucene.org/


least. Ideally there would be a panel of software users and experts to help determine relevance of

UNIX files.

Summary

Chapter 1 has looked at the background issues associated with information retrieval and

software reuse. Also discussed has been the need for a good information retrieval system that

will successfully retrieve software components to help software be reused. Software can only be

reused if it is readily available and the cost to reuse must out weight the cost to create from

scratch. Software components that are previously used are tested and usually well documented,

reducing the need for new development. Information retrieval can be executed in a number of

ways to include Boolean or fuzzy logic like the vector space and probabilistic models. Boolean

logic will return all documents that contain the queried word; fuzzy logic will return a varying

degree of match based on the documents use of the queried word.

The purpose of this study is to create a search for software in a document corpus local to

the computer versus searches in the literature that look at software on the internet. This study

will utilize the Lucene indexing application to create the index and then a researcher modified

search using term weights. Recall, precision and MAP will be calculated and used to compare

the search algorithms to each other; the higher the MAP value the better the algorithm is at

retrieving successful matches. Chapter 2 looks at the current literature that discusses the topic of

software reuse and information retrieval along with other studies that have used information

retrieval for software. Chapter 3 will explain the methodology used in this study and how the

system is created and executed.

9


Chapter 4 will discuss the details of this study and the data that was produced by the

experiment. The data will be discussed in detail as to what it means for this study and for other

research. Chapter 4 will answer the research questions and discuss how this study either

supports of refutes the study’s hypothesis. Chapter 5 goes into detail about future research that

can expand on this study or take this study in a new direction.

10


CHAPTER 2: REVIEW OF THE LITERATURE

Chapter 1 described the need for software reuse in today’s programming industry and

explained some of the reasons why it is not widely used in everyday practice. One reason is the

need for an accurate look-up or retrieval system that can quickly and accurately return a query

with an appropriate software component. There are many methods to implement information

retrieval, the one at the focus of this study uses fuzzy sets. Because fuzzy set theory offers a

wider range of possible match with its degree of match, it should deliver the most accurate result

to a user’s query.

This chapter will examine the literature that supports the need for software reuse from

when it was introduced in 1968 to today. Information retrieval algorithms come in three major

forms, Boolean, vector space and probabilistic; those applications will be looked at with a

comparison of a Boolean search vs. term-weighted applications. Finally, this chapter explores

the literature of fuzzy sets and how they can be used for information retrieval and their benefit

over other methods.

Introduction to Software Reuse

Software reuse was first introduced by McIlroy at a NATO conference in 1968 (McIlroy,

1968). McIlroy (1968) understood the importance of creating a solid software component and

the need to create an inventory system that will allow these components to be widely accessible

to different machines and users. McIlroy said then that in order for reuse to be widely used,

there is a need for a standardized library to store and index software components (McIlroy,

11


1968). This issue has been widely discussed through the literature as the primary issue that is

hindering software reuse today from becoming an industry standard (Mili, Mili & Mili, 1995).

There are two major benefits to software reuse, as noted by Gibb, McCartan,

O'Donnell, Sweeney and Leon, 1.”Those components that have already been tested provide

higher guarantees of robustness and reliability in any future implementation and 2. Component

reuse should lead to faster development times and lower costs” (2000, p.212). With the

increasing demands for software development and the inability of programmers to keep up,

software reuse is a practice that can reduce the development time and lead to the increased

stability of a system (Yao, Etzkorn & Virani, 2008). Software reuse was first introduced as a

way to minimize creation time and help build a more stable system with components that have

been previously created and tested (Krueger, 1992). Charles Krueger states that although this

practice was introduced in the 1960’s it is still a practice that is not widely used today in software

engineering. Krueger goes on to say that software reuse can be defined as the “direct reuse of

components or code, the abstraction of ideas, or adaptation of software to fit the needs of others”

(1992, p.131). Software reuse also reduces effort and development time which decreases time to

market for certain systems. By decreasing time to market, companies can increase work load

and projects that they would normally not be able to handle and the reuse of quality tested

software means less down time and repairs for network maintenance people (Vishal, Chander &

Kundu, 2012).

In 2012, it was reported that software reuse had a 91% effect on lowering development

time, shortened the testing time by 83%, increased the overall product quality by 76% and

12


shortened time to market by 72% (Kokkoras, Ntonas, Kritikos, Kakarontzas & Stamelos, 2012).

Hewlett-Packard has shown improvements in product quality from code reuse from 24%-76%

and improved time to market from 12% - 42% (Keswani, Joshi, Jatain, 2014). Using a reused

software component will also improve the quality of that component; the more times a

component gets reused, the more chances bugs can be found and fixed and also the initial cost of

creating the component can be made up in just a few reuses (Vishal, Chander & Kundu, 2012).

In their 1995 paper, Mili, Mili and Mili point out that in 1984, 60% of software created

could have been standardized and reused. Quantifying the amount of software that is actually

reused has posed a problem for researchers. Mockus says that more than 50% of the open source

software files that were available to his study in 2007 have been used in more than one program,

and this was based on a file being present in more than one program. But this is just open source

code that is widely available on the internet; Mockus’s 2007 study didn’t look at in-house

software libraries within a company. For a quantifiable measure of reuse, Mockus used a

previous empirical study by NASA that used an algorithm that searched for directories of source

code files and uses the fraction of total files shared over the total number of files (Mockus,

2007). In total, Mockus (2007) looked through 13.2 million open source code files and found

that .52 or 52% of the files had been shared at least once. This means that any file shared on an

open source website, has a 50% chance of being reused (Mockus, 2007).

One issue with software reuse is the vast amount of information that is considered

software. Software in general, consists of coded lines, comment lines and possibly a behavior

diagram (Aziz & North, 2007). Software reuse can include one of four types of reusable

13


artifacts: data reuse, architecture reuse, design reuse, and program reuse (Aziz & North, 2007)

(Mili, Mili & Mili, 1995). Each of these four categories contain items that, if reused, qualify as

software reuse and need to be considered when items are being stored in a repository. Data reuse

is the reuse of standardized data formed or used by a software component, architecture reuse is

the reuse of a standardized set of design and programming techniques dealing with the

organization of the software. Design reuse is the reuse of standardized layout of software

components and lastly program reuse is the reuse of any or all pieces of code used in a software

application (Mili, Mili & Mili, 1995). Rothenberger, Dooley, Kulkarni and Nada (2003) did a

study of 71 software development groups asking different questions about the reuse of software.

They found that the biggest hurdle is finding software that matches their current architecture that

can be reused and that even if code or components are found that don’t exactly match, they do

reuse some part of the code anyway whether rewriting it or adapting to fit their system. The

demand for quality software and the sustainability of code that has already been tested is the

number one driving factor they found in their study.

In a 2012 survey, 87% of software engineers said code was the most important reusable

artifact, where 80% said design was the most reusable and 75% said documentation was the most

reusable of software artifacts (participants were allowed to vote for more than one) (Kokkoras,

Ntonas, Kritikos, Kakarontzas & Stamelos, 2012). Because of the vast diversity in ideas on most

important coding artifact, it generates difficult challenges to creating a reusable repository of

software components.

14


Before software can be reused, it has to be stored so it can be found and easily found; this

is where programmers fluctuate in opinions (Mili, Mili & Mili, 1995). You have to be able to

find the correct software components before they can be reused, says Barringer (1984). This has

lead researchers to look for an effective storage and retrieval method for finding quality

software. This issue has led to the creation of software libraries that can be easily searched and

retained. Burton, Aragon, Bailey, Koehler and Mayes (1987) designed a software library for just

that topic, the storage and reuse of software components. Their library was specifically for the

Ada programming language components that supports the reuse of components from legacy

systems with its use of generic and packaged components. The reusable software library stored

an attribute value based on software components function ability, complexity, structure, quality

of documentation, and level of testing. These attribute values were then used as a compared

number when a query was made (Burton, Aragon, Bailey, Koehler & Mayes, 1987).

Sandhu, Kaur and Singh (2009) looked at reusing software from currently active systems

which they said had not previously been looked at, most software reusability is based on older

versions of software that is stored and no longer in use. They observed that in order for

programmers to reuse software, they must be able to find it useful first (Sandhu, Kaur & Singh,

2009). They found that to get a more accurate measure of reuse, the domain must also be taken

into consideration and devised a neural network that would automatically evaluate the reusability

of object oriented software components. Their metric was successful in measuring the

reusability of software but was not the best as compared to other similar studies (Sandhu, Kaur &

Singh, 2009).

15


It has been shown that if programmers can find good quality software they are more

likely to reuse the code segment. In a study, programmers were asked specifically what drives

them to reuse code and what means are they most likely to use to find the code, programmers

conclusively said that the main means for finding software was searching, either the internet or

current databases they had access to like SourceSafe, a source code check in and check out

application where developers can maintain their finished product (Sim, Clarcke & Holt, 1998).

The biggest drivers of searching for software components included not wanting to rewrite large

sections of code like a sort, or search, to understand current code implementations, for code

repair, the “desire to work on preferred tasks should lead developers to reuse code that they

prefer not to write on their own”, and lastly resource constraints like lack of time and testing

resources encouraged developers to reuse code and monetary incentives (Haefliger, von Krogh &

Spaeth, 2008, p. 183) (Sim, Clarcke & Holt, 1998). In agreement with the study by Haefliger et

al, a study by Agresti (2011) found that programmers discovered a 26% increase in productivity

by reusing old code, and found that overall programmers are willing to reuse code no matter the

time constraint, good code is good code. An interesting discovery from this study found that in

fact programmers did not believe in the if you want something done right do it yourself mentality

but that one of the biggest deterrents for not-reusing code is lack of documentation to what the

code actually does, or overly difficult code (Agresti, 2011). So far the studies examined have

dealt with technical obstacles for reuse of software, but Morisio, Ezran & Tully (2002) did a

study and found that non-technical issues played just as much of a role in reasons why

companies don’t reuse code. By interviewing those persons involved with certain projects, they

16


deducted that overall companies were reluctant to change when they had a system that worked,

and although there are many advantages to reusing code, the process seemed to halt at the top

management level. Executives need to get on board with the process and procedures and they

need to do more than set up a repository of software in order for software reuse to become a

widely practiced method of software development (Morisio, Ezran & Tully, 2002). Other non-

technical issues discussed in a 2014 paper by Keswani, Joshi and Jatain include economic

barriers, many organizations aren’t able to afford the development of reuse groups,

administrative impediments, sometimes reuse across different business units isn’t feasible,

political impediments, often programmers and management are weary about people in other

groups or organizations and skeptical of using their code, and lastly psychological impediments,

programmers want code that they understand and even though time consuming, they are willing

to write all the code themselves. The biggest technical hurdle that Keswani, Joshi and Jatain

discovered in their paper, was the lack of technical skill of programmers. With such a high

demand of software developers, companies are willing to overlook the college or fundamental

training of good programmers and when it comes to understanding other people’s code, they lack

the knowledge of design patters and proper framework in order to understand how to reuse code.

Whatever the motivation for reusing software, this paper will show that not only is code search

easy but it is also effective and the concept that software, if easily found and accessible, can be

reused is the driving reason for this research.

17


Searching for Software

One of the biggest hurdles of storing software is the lack of industry specifications. How

a library is set up determines how it’s searched and different people store different items as

indices or keys, making any search difficult. Maarek et al, implemented an automatic way of

constructing software libraries that can be used for information. Their goal was to construct a

library of software that “provides a sufficient number of components that offer a spectrum of

domains that can be reused as is, or black box reuse, and is organized such that the code closest

to the user’s query is easy to locate” (Maarek, Berry & Kaiser, 1991, p.800). This system

classified software based on the code, internal documentation and contextual information by

creating an index which is then stored with the software as a tag. When a search is made on the

library, the search term is then transformed into an index using the same algorithm that classified

the library and the indices are compared for a match. If there is no exact match the system finds

a similar or functionally similar in nature match (Maarek, Berry & Kaiser, 1991).

Since software can be searched and stored by many different attributes, behavior,

function or structure, a study by Frakes and Pole looked at allowing software to be searched by

any of these attributes (Frakes & Pole, 1994) (Prieto-Diaz & Freeman, 1987). Using a database

of UNIX commands and the Proteus software application, Frakes and Pole (1994) not only

looked at recall and precision but also search time, user preference and helpfulness of the

methods. Using the test set based on the study by Maarek, these authors allowed 35 employees

of the Software Productivity Consortium to query the system seven times using the four

classifications of software storage, keyword, faceted, enumerated and attribute value.

18


As has been seen in previous studies, it has been concluded that because different

searches can be done, recall and precision are usually not significantly different from study to

study. Frakes and Pole had similar results, although the studies returned different documents, the

recall and precision was not significantly different between searches based on faceted,

enumerated, keyword or attribute based search. The difference came in the search times. These

authors found that enumerated searches resulted in the biggest gap between expected search

times and actual, the others were close to predicted. An enumerated classified system is one

where an item is classified by subject and that subject is then broken down into “mutually

exclusive, usually hierarchical, classes” for example in the UNIX command list classification,

UNIX -> Directory -> Create -> Mkdir; the Dewey Decimal system is another system that is an

example of an enumerated classification (Frakes & Pole, 1994, p. 619). This study proves that

searches don’t matter if searching for attribute, behavior or keyword; this is one reason this

research will focus on the setup of the search versus focusing on the searched item (Frakes &

Pole, 1994).

Zhang et al, proposed a hash function for detecting reusable content (Zhang, Wu, Ding &

Huang, 2012). They created a signature of a sentence and store that value with the software

content in the document. When a query is made, the sentence signature is compared to the

distance of the query sentence for a certain threshold value, if that value is less than the threshold

that content is deemed a match. This study was successful but with the increased data stored

with the software component it wasn’t the most efficient way to evaluate software (Zhang, Wu,

Ding & Huang, 2012).

19


Houhamdi and Ghoul designed and implemented the Reuse Description Formalism

(RDF). The RDF code categorization indexing is “capable of representing not only software

component at the code level, but it is also capable of representing more abstract or complex

software entities” for example the capability to represent relationships like an is-a and

component-of relationship (Houhamdi & Ghoul, 2001, p. 41). The RDF is also flexible enough

to represent a new object into the library without having to re-identify all preceding objects, etc.,

and lastly the RDF “provides a consistency verification mechanism” (Houhamdi & Ghoul, 2001,

p. 41). The RDF is a great tool that seems to overcome all shortfalls of other software libraries

but getting it implemented into the mainstream is a big hurdle. Until there is a standardized

library for software there will be no need to standardize the library.

There are many ways to search for software components but they all start with how the

data is stored. Software can be stored based on free-text keywords, faceted index and semantic-

net based (Aziz & North, 2007) (Khalifa, Khayati & Ghezala, 2008). Software retrieval schemes

fall into groups which include “keyword search, faceted classification, signature matching,

behavioral matching and semantic-based method” (Khalifa, Khayati & Ghezala, 2008, p. 134).

Prieto-Diaz discussed how to implement a faceted classification scheme in order for

software to be searched it has to be organized correctly and implementing a faceted classification

system is one way to do that. In a faceted classification keywords are selected from a predefined

list and assigned to the classes, class gets a group of terms and those terms are attributes from

selected facets (Prieto-Diaz & Freeman, 1987). Software can have attributes from groups like

objects, system type, and functional area, giving each class a triple description, Prieto-Diaz and

20


Freeman say that most of the software can be individually grouped in that method. With this

system, software searched has a higher rate of match and when compared to a database with no

organization scheme, had a 100% increase in precision and a 50% reduction in recall (Prieto-

Diaz & Freeman, 1987).

Khalifa, Khayati and Ghezala look at the behavioral matching model in their software

search algorithm. By creating and storing Unified Modeling Language (UML) diagrams based

on a piece of code’s behavior, they say that software searches have a better chance of finding a

match. After the UML document is created it is then parsed into a first order logic which is

stored external to the software. Searching for software can be by keyword or behavior in this

setup and with the introduction of storing a UML diagram the user doesn’t have to know prior

knowledge of how the software behaves in order to search successfully (Khalifa, Khayati &

Ghezala, 2008).

Reiss did another study looking at semantic-based matching. Semantics, is similar to

behavior, as it means what can the system do? but differs from behavior traits in that it is usually

found in the test cases and documentation. Reiss said this type of software searching was

common in the 1990’s, where most searches would look at the signatures of functions and

components. Reiss took this technique one step further by including security

requirements/prerequisites plus the parameters of functions and uses them when searching for

semantic based code. Using Eclipse to build the initial syntax tree, he added a semantic analyzer

that read over the tree and added annotations where needed for each node. The biggest hurdle

21


with this implementation is that each function has to be compiled before its result set can be

searched, and if there is a massive software library this could take some time (Reiss, 2009).

Yao, Etzkorn and Virani also did a study looking at the semantic properties of software in

their 2008 study. They said their study would increase the reuse of software by creating an

automated classification system of software components. Using a tagging mechanism, Yao,

Etzkorn and Virani created a description of the semantics and attached to each software

component. Then a natural language description is assigned to the component and a simple

search engine is used to match a user’s query to either the simple description or semantic

descriptor using a modified version of RDF (Yao, Etzkorn & Virani, 2008).

Marri, Thummalapenta and Xie proposed a new approach to using a code search engine

or CSE. They say that the CSE’s that are available on the internet can only accomplish a simple

task at a time, for example find a function for sum in C++. With the addition of an API that

assists with three common software development tasks, 1) to learn about a common API and its

programming rules, 2) to use those rules to detect flaws in a program and 3) to infer a fix for the

detected defect the authors assumed that they could not only find software but make it better and

increase its chances for reuse (Marri, Thummalapenta & Xie, 2009). Their code search life-cycle

model was able to assist developers with development, maintenance and verification by

searching and fixing software components on the web.

Using the CSE’s available on the internet, Kokkoras, Ntonas, Kritikos, Kakarontzas and

Stamelos created a new system that fed on two or more CSE’s available on the internet. Their

system would have one interfaced page and would direct their search queries to other CSE’s on

22


the web, such as Krungle or Koders thereby eliminating the need to collect and store their own

data (Kokkoras, Ntonas, Kritikos & Kakarontzas, 2012). This study was successful in searching

and finding code on the internet but again was only using code search engines and offered no

increase to the speed or reliability of these searches.

Suresh Thummalapenta introduced a web based software search system to find software

that could be reused. In her dissertation she says that the amount of open source software

libraries has grown exponentially and with it the number of searches of those libraries has also

grown. Open source libraries on the web like SourceForge.net host approximately 230,000

projects. Implementing a WebCrawler to search the internet and also a parser, Thummalapenta

was able to effectively search and find software components using an XOR pattern only

(Thummalapenta, 2011). Although Thummalapenta’s study was done using the web, the way the

documents were searched looking for software components is similar to this study.

Isakowitz and Kauffman propose a method of search for software components that

utilizes hypertext technology. The idea of keyword matching for any information retrieval

system requires the manual labeling and storing of similar or keywords for every software object.

This idea is tedious and not plausible for large software libraries, the authors argue. The other

method introduced by Prieto-Diaz (Faceted Classification) uses keyword associations made by

software components by introducing the commonality of the domain for example in the

Computer Aided Software Engineering (CASE) environment. They purpose using hypertext

links from software stored objects to similar objects; this allows only storage of one link for each

object. Once a link is stored, that links matching link will ultimately lead a user to multiple

23


matches of a searched word. The study concluded with three contributions, a successful

illustration of an approach to automated classification of object repository in a CASE system;

they were able to show that hypertext technology provides a useful set of capabilities when

combined with an “repository-based application meta-model”; and they showed how all this

together can be combined to make a working reuse search support tools (Isakowitz & Kauffman,

1996, p. 421).

Rosalva Gallardo-Valencia and Susan Elliot Sim said that searching over the internet and

searching in a development environment pose two different types of problems, instead of them

both being the same issue on different domains. Searching for code in an integrated

development environment (IDE), programmers are usually looking for one particular piece of

code to, “defect repair, reuse code, understand the problem or impact analysis” (Gallardo-

Valencia & Sim, 2009, p. 50). The author’s say that internet based code search is another topic

because code on the internet can be returned in the form of a code object or component, a

reference to how a program works, a completed program, or an http web address link to the

developers homepage (Gallardo-Valencia & Sim, 2009). This difference in motivation leads to

difference sizes and different looks of query result sets, this, the authors say is why internet-scale

code search is a new topic and should be treated as one, and not compared to IR (Gallardo-

Valencia & Sim, 2009).

Whether a code search is done on the internet or in an IDE, they all use a query to search

for a match. In the article Using Iterative Refinement to Find Reusable Software, Henninger

looks at creating the query and how it can be changed to make code searching easier and faster.

24


Henninger’s study looks at taking a simple query from a user and using CodeFinder’s interface

that gets populated with the hierarchical structure of the current selection as the user continues to

click, this is an extended version of query expansion. As the graph populates and as the user

continues to select terms that fit their query, the system builds the query in the background and

when the user finally selects search the query might have gone from print to print macs on a

Lisp application (Henninger, 1994). This method worked for the purposes of the article but for

the research in this paper, building a query builder is out of scope although query expansion will

be used if applicable in this study.

Sandhu and Singh implemented a nuero-fuzzy approach to find the reusability of

software components that will automatically create a value for a software component based on

reusability, reliability and quality of development. Their approach creates a COM object using

Visual Basic coding language that will run as a service on a user’s machine. This service will

extract a value for a software component based on the “nearest-neighbor-based, agglomerative,

hierarchical, unsupervised conceptual clustering” (Sandhu & Singh, 2007, p. 357). Using a

complete linkage algorithm, the entire document’s similarity value is compared to the value of

the query using the document similarity matrix based on the cophenetic distance (Sandhu &

Singh, 2007). Using a neural network with a fuzzy inference system incorporated, through much

iteration, the authors were able to refine a match successfully.

Finding content for reuse purposes has added challenges according to Zhang, Wu and

Ding. They say that there are added challenges to detecting content for reuse because “reuse may

happen at different levels” and a positive match may not be enough to indicate definite reuse

25


(Zhang, Wu, Ding, and Huange, 2012, p. 405). Thummalapenta (2011) discovered this obstacle

otherwise known as a false positive, or when a document matches but the use of the content is

not what the user intended. This issue will be documented in the future work section of this

paper.

Information Retrieval - Boolean Logic

In 1959 Maron and Kuhns introduced a new novel technique to solve the library indexing

issue by defining an index for a document as a unique tag that identifies the information in that

document. Using this index, data is easier and faster to search through, although they did not

have electronic versions of documents, Maron and Kuhns’s system was still effective in

retrieving information from the library. This system of searching for information using only a

small piece of data representative of the entire entry has been the foundation for the Dewey

Decimal system and other information retrieval systems (Maron & Kuhns, 1959).

Boolean logic is simple and clean, the results are returned in an ordered list all having the

same chance of matching the query as the next. The problem with that in today’s world of data

storage, is the lists that are returned can be very large (Baeza-Yates & Ribeiro-Neto, 2011).

Even though a weighted system would return a more accurate list, the Boolean retrieval model is

still the most popular among search based algorithms (Bordogna & Pasi, 1993). Most searches

today incorporate some form of weighted system for data searches to reduce the amount of

returned data and to limit the number of relevant returns to only a certain percentage of matches.

A Boolean system gives a value of 1 to a document that contains the queried term and a

value of 0 to a document that does not contain the queried term. This is based on only one

26


occurrence of the term, so documents containing the term different multiple times get returned

the same way (Baeza-Yates & Ribeiro-Neto, 2011). This means a document that contains the

searched term 100 times gets returned with the same priority as a document that contains the

searched term 1 time. Although this is a fast way to search through documents, when listing the

matches for the user, the resulting list can be deceiving (Baeza-Yates & Ribeiro-Neto, 2011).

Although it is a simple and clean way to search Boolean logic does have a number of

disadvantages. Salton, Fox and Wu (1983) say that the size of the output is hard to control or

even predict because if the matching is done regardless of number of times a term is located in a

document, the output is not initially ranked based on how a document matches the queried term

so choosing a document that best meets the user’s query is up to the use.

Fuzzy Sets and Extended Boolean Logic

It has been shown that in the Boolean logic model, if one searches for two words

connected by the Boolean AND, a document with only one word match will be discarded just as

a document with no word matches as shown in Table 1 (Baeza-Yates & Ribeiro-Neto, 2011)

(Bookstein, 1979) (Fox & Sharan, 1986) (Verma & Sharma, 2013).

27

A B OR Value

T T T 1

T F T 1

F T T 1

F F F 0

A B AND Value

T T T 1

T F F 0

F T F 0

F F F 0


Table 1: Results of Boolean logic on a document searching for terms A and B

Miyamoto (1990) says that simply, an IR system using fuzzy logic takes a query as input, returns

a list of documents as output and how the documents are scored is a measure of the degree of

match of a document to a query.

Bordogna and Pasi created a fuzzy linguistic approach with generalized Boolean IR

which they say incorporates the imprecision of a Boolean match with a fuzzy model with the

accuracy of a Boolean match. In their study, Bordogna and Pasi replaced the fuzzy weights with

linguistic values, for example replacing .8, and .9 with very important, not important, etc. They

say by replacing the numeric value for a qualitative descriptor, they can get a better sense of how

a term matches a user’s query. They found that this allowed the user to be able to calculate the

recall and precision much easier and to not have to quantify a specific number for the degree of

importance of a term in a document (Bordogna & Pasi, 1993).

There are three basic elements of any information retrieval system: its sets of documents

and terms, an indexing system and perhaps a filter to limit the number of responses. A fuzzy

thesaurus will determine the matching result of documents by its membership value. With a

certain range of membership values assigned to a set, a search will return a broad range of

matches. Nomoto, Kubo and Kosuge tested the use of fuzzy thesaurus generation by searching

for document matches in their 1995 paper. By creating their own fuzzy thesaurus and

implementing a cross-index matrix which was done by assigning the index of each document to a

set and narrowing, broadening, and then taking a cross value of the resulting matrix, they were

able to match terms more accurately, all depending on user’s preference (Nomoto, Kubo &

Kosuge, 1995). 28


There is a difference in data gathering for information retrieval and artificial intelligence.

Maarek, Berry and Kaiser (1991) say indexing for IR is text specific versus indexing for AI

which is knowledge based. One looks at text only while the other looks at context and referring

knowledge. The issue with text only is comparing natural language elements. To tackle this

issue, Chau and Yeh (2004) suggest a generic language independent domain to which every

document is converted before searched for like terms. This helps combat missing matches due to

different terms not translating but adds another step to the process: translation and storing the

translation (Chau & Yeh, 2004).

There are two options to consider when indexing a document. Free text indexing says

there are no limits to the number of indexes that are allowed, versus the controlled indexing

where only a limited number of indexes are allowed. Both indexing ways have the same effect

on outcomes but when looking for software an uncontrolled method or free-text indexing is the

best option for reasons of cost, and performance (Maarek, Berry & Kaiser, 1991). This research

will use a free text method to index the files.

A study performed by Bordogna, Carrara and Pasi extended the Boolean information

retrieval methodology to help satisfy a user’s query better using a weighted system (Bordogna,

Carrara & Pasi, 1992). The author’s used the Retrieval Status Value which is obtained by

combining the resulting weights from the function F : DxT -> [0,1] (Bordogna, Carrara & Pasi,

1992). D is the set of all documents, T is the set of terms and F is a function of the occurrences

of term T in D (Bordogna, Carrara & Pasi, 1992. This RSV is used to represent the closeness to

the ideal document. A value of 0 for an index term means there is no match for the indexed term

29


and a value of 1 indicates a perfect match (Kraft, Bordogna & Pasi, 1998). Based on this value a

constraint system is set up that affirms the value as degree of relevance to the desired ideal. The

results are then listed in descending order from perfect match.

The other statistical models are the vector and probabilistic models (Srinivasan, Ruiz,

Kraft & Chen, 2000). The vector model assigns a non-binary weight to indexed terms and a

degree of similarity is calculated between each stored document in the system. The resulting list

of matched documents is sorted and presented in descending order; this allows documents that

are only partially a match to get returned to the user. The formal definition of the vector model

is such that “the weight w i , j associated with a term-document pair (k i , d j) is non-negative and

non-binary” (Baeza-Yates & Ribeiro-Neto, 2011, p.77). The probabilistic model looks at the

documents and once it finds its matches it assigns a value of probability that the user will find the

document relevant. Instead of assigning a degree of match, it looks at the statistics of

probability. Baeza-Yates and Ribeiro-Neto say “given a query q, the probabilistic model assigns

to each document d j, as a measure of its similarity to the query, the ratio P(d j, relevant-to q)/ P(

d j non-relevant-to q), which computes the odds of the document d j, being relevant to the query

q” (2011, p. 80).

Lofti Zadeh is credited with the creation of fuzzy-set theory back in 1965 but parts of the

theory can be traced back to the 1920’s (Miller, 1996). Fuzzy-set theory is described as a cross

between Boolean logic and multi-valued set theory (Miller, 1996). Zadeh says of fuzzy-set

theory that “as a system becomes more complex, the need to describe it with precision becomes

30


less important” (Miller, 1996, p. 29). Fuzzy sets allow a degree of match or varying scale of

relationship that is otherwise not included in mathematics.

A fuzzy set is defined by Klir and Yuan as:

“mathematically by assigning to each possible individual in the universe of discourse a

value representing its grade of membership in the fuzzy set. For example, a fuzzy set

representing our concept of sunny might assign a degree of membership of 1 to a cloud

cover of 0%, .8 to a cloud cover of 20%, .4 to a cloud cover of 30% and 0 to a cloud

cover of 75%” (1995, p.491).

By assigning a degree of match, there is more flexibility in data searches for finding a better

match.

Using the Klir analysis of degree of match, Triantafyllos, Vassiliadis and Pechanek

developed a database system that would answer natural language questions. The system was

developed to help evaluate the bookkeeping library in the IBM 4381 system. Using a degree of

confidence, they set a limit of acceptable response and anything below that confidence level was

dismissed. Using this system, they were able to get results that were close to manual evaluation

of information retrieval on the same system.

Zadeh (1994) says there are two central concepts to fuzzy logic, the linguistic variable

and the fuzzy if-then rule. The linguistic variable is any variable whose value can be found in

natural language like sentences or words (Zadeh, 1994). The if-then rule says that the

“antecedent and consequents are propositions containing linguistic variables.” (Zadeh, 1994, p.

49) Fuzzy logic is set up in a way that helps group similar terms together, the way the human

31


mind summarizes data. For example if the linguistic values are young, old, infant, the logistic

variable would be age. The other way to describe this way of grouping is by referring to it as a

membership function. Using linguistic values or membership functions, a series of rules can be

defined. These rules are fundamental to the Fuzzy Dependency and Command Language or

FDCL. The FDCL, unlike Fuzzy Prolog, is not a fuzzified version of a standard programming

language and like all languages FDCL is defined by its semantics and syntax (Zadeh, 1994).

FDCL allows many different rules from fuzzy if-then to simple fuzzy. According to

Zadeh, a typical rule links “m antecedent variables X 1,… Xm to n consequent variables, Y1,…,Yn

and has the form: if X1 is A1 and …Xm is Am, then Y1 is B1 and … Yn is Bn, where X = (X1, …,

Xm) and Y = (Y1,…,Yn) are linguistic variables and (A1,…An) and (B1,…Bn) their respective

linguistic values” (Zadeh, 1994, p. 51). For example, if Temperature is low and Pressure is low

then Volume is large. Rules can have two structures, surface or deep. The surface structure is a

rule in its symbolic form, if X is A then Y is B, and the deep structure contains all the

dependencies that define the membership function of a rule (Zadeh, 1994).

In a typical Boolean retrieval model, a document that is queried with term A and term B,

using the AND will result in only the document with both terms present given the value of 1, and

the remaining options will be assigned 0. If the OR operator is used, the document that has

neither term will be assigned the value of 0 and the remaining a value of 1, as shown in Table 1.

Salton, Fox and Wu look at the vector-processing retrieval model and assigning similarity

values based on the Euclidian distance calculation for those documents that meet one but not

both of the matched terms. Using the Euclidian distance from the point (1,1) for AND queries,

32


because (1,1) is the point where a document contains both terms so that is the ideal location for a

perfect match, those document that contain one term but not both would be calculated by using

term weights d A for term A and d B for term B, √(1−d A)2+(1−dB)

2 and from the point (0,0) for

the OR operator the equation, √(dA−0)2+(dB−0)2 (Salton, Fox & Wu, 1983). With a maximum

distance possible of √2, Table 2 shows the new calculations for documents that may fall

between (0,0) and (1,1). Although this study will not use the vector-processing retrieval model,

it is clear to see that using a weighted system will provide a closer match than a Boolean logic

model. The results of Salton, Fox and Wu’s vector-processing retrieval model showed a 172%

improvement in recall and precision from a Boolean logic retrieval model on the same data.

A B OR Value

T T T 1

T F T 1/ √2

F T T 1/ √2

F F F 0

A B AND Value

T T T 1

T F F 1-1/ √2

F T F 1-1/ √2

F F F 0

Table 2: Values calculated using vector-processing retrieval model for term A and term B.

Table 2 also shows that given a query with only one word match in a document, that

document will be lower in the returned list with both queried word matches, which would be of

value 1. Also if only one queried term is found in a document it receives a higher value than the

nonexistence of both terms in an AND query (Salton, Fox & Wu, 183). The vector-processing

model is effective in finding a successful match with more precision than a Boolean system, but

33


if there are more than one term it becomes impossible to determine which term gets the highest

precedence. For example if the search is Information AND Retrieval AND Software AND Reuse,

any document with the word reuse will get the same degree of match as a document containing

the word retrieval. So words that have other uses, like reuse which can be used to describe

anything not just software, will be included in the query results. Bookstein introduced a model

that added weights to not only the searched result list but also to the queried words. Bookstein

suggests to reduce the retrieved set value (RSV) by the reduced term weight, for example if the

term reuse has the membership values of {(d1, 1) (d2 , 0.8¿¿, 0)} where d is a document and the

next number is the membership or how well the document contains the indexed word reuse. If

the request becomes reuse0.5, then the reteived set now becomes {(d1, 0.5) (d2 , 0.4¿¿, 0)}

(Bookstein, 1980).

Brookstein’s approach is the foundation for Buell and Kraft’s 1981 paper that looks at a

weighted retrieval model. Buell and Kraft look at a model that replaces the standard 0 and 1

values assigned to an index term to a continuous value for the membership value which is

calculated by a set of documents (D*) times the set of indexed keywords (I*) (indexed keywords

are keywords from a document that get added to the index file with a corresponding number of

occurrences and location of occurrence added to the index file as well) resulting in a membership

value between 0 and 1: F: D* × I* [0,1] (Buell & Kraft, 1981). Then the calculations for

membership become how much a document is about a query term, vs, if the term exists in the

document or not. The issue of how the system handles the membership of multiple query terms

is taken care of as well and if the Boolean AND is the joining term, taking the maximum

34


membership of all queried terms in the documents will give the membership for the entire query,

or Max[F(d,T), F(d,S)] = F(d,T) + F(d,S) – F(d,T)*F(d,S). As an example, if the query looks

like this C++ AND sum, where C++ becomes T and sum becomes S then the F(d,T) = .8 and

F(d,S) = .2 then

Max [.8, .2] = .8 + .2 - .8*.2

Max[.8, .2] = .8 + .2 - .16

Max[.8, .2] = 1.0 - .16

Max[.8 ,.2] = .84

Maximum of .2 and .8 is .8 = .84 rounds to .8

The model for the Boolean OR is: Min[F(d,T), F(d,S)] = F(d,T)*F(d,S)

Using the same values of F(d,T) = .8 and F(d,S) = .2, then

Min [.8, .2] =.8*.2

Minimum of .8 and .2 is .2 = .16 rounds to .2

Using this alone will yield a better match over a discrete Boolean 0,1 system, but if the query

terms are C++ AND sum, and if a document is all about C++ with no mention of the second

query term that will be returned as the top query match. Buell and Kraft looked at putting

weights on the query terms but found that if a document has a 0 RSV (retrieved set value), the

entire query’s RSV becomes 0. The query (T,a) AND (S, b) where a an b are the values that

indicate the relevance each term has in the query the equation becomes

Max[(F(d,T),a), (F(d,S),b)] = (F(d,T),a)+ (F(d,S),b) - (F(d,T),a)* (F(d,S),b). (1)

35


If a or b = 0, the Max will always be 0 (Buell & Kraft, 1981). For this reason, the query will not

contain a weighted value in this study, but the terms in the documents will be weighted based on

the number of occurrences of the word per document. If both terms have the same weighted

value to a query, but the documents don’t contain any match for either term, the query should

result in no results found. This takes an exception to be included in the algorithm to look for a

returned RSV of 0. Buell and Kraft present another option for this situation, by setting a

threshold value or a value that must be met in order for document or set of documents to be

about a query enough to be returned. This threshold value is used like a checksum: if the

returned value is equal or greater than the threshold then the document is relevant, if not the

document is not about the query enough to present to the user. This is a good approach if users

want to make sure the returned set of documents meet a minimum requirement that is greater

than 0.

Another popular model of IR systems is the BM25, the BM25 model works great with

plain text documents and has been proven itself at TREC. The BM25 model looks at how often

a term is in a document and the average document length in the corpus (Robertson, Zaragoza,

Taylor, 2004). The BM25 also has two boosting variables that are commonly set at k = 2 and b

= .75 for best results. For this research, the BM25 was used as the benchmark from which to

judge the success of the other methods. Lucene integrates the BM25 into a similarity with

default values already set at b = 2 and k = .75 which will not be changed in this study.

Entering queries can be a tricky task that can change the outcome of a search. Query

expansion is the manipulation of similar words or placement of words in the query to search for

36


the same idea just using different terminology that may match the corpus better. Query

expansion is usually discussed along with relevance feedback as feedback methods (Baeza-

Yates, Ribeiro-Neto, 2011). For purposes of this research, query expansion will be explored in

the sense of modifying a query to include a synonym or words or phrases with similar meanings.

This has been shown to increase a search results when a thesaurus is not used (Xu & Croft,

1996). An example of a query expansion looks like this: sum OR add OR plus, words with the

same meaning are OR’d together to return any document that may include them.

Just searching for matching terms may not always yield the best result. By modifying the

term frequency-inverse document frequency scoring and by calculating the overlap between

documents with similar topics, Chowdhury and Bhuyan implemented a fuzzy information

retrieval model using clustering. By grouping documents by inter-document similarity, searches

could move from document to document quicker and had a 10% increase in both recall and

precision over BM25 and tf-idf calculations. Once a matching document was found, any

document in the same cluster would be compared to find a match, and as soon as a document

was not a match the search stopped (Chowdhury & Bhuyan, 2010).

Term frequency and inverse document frequency are calculated to determine the most

frequent words in a document and which document contains a word the most (Fox & Sharan,

1986). This indexing value helps search algorithms quickly look up a term and find the

correlating documents that contain that term. This term weighting function was first introduced

in 1972 as a way to rank documents for information retrieval systems. The basic formula for idf

says that given a set of documents, N, and a term t i occurs in ni of the documents, the idf (t i) =

37


log (N /ni ¿ (Robertson, 2004). The frequency of a term in a given document is then multiplied

by the idf to get the tf-idf number (Robertson, 2004). The tf/idf value will be used as the term

weights in this research. The number of relevant matches returned will be compared to the study

by Maarek (1991) and also verified by the researcher.

Three other models used for information retrieval that are considered an extended

Boolean approach are the MMM (mixed, min and max), Paice and the p-norm model. The

MMM model is based on work by Zadeh and says “an element has a varying degree of

membership to a given set instead of the traditional membership choice”, but only looks at the

min and max document weights for the index term (Frakes & Baeza-Yates, 1992, p. 395). The

MMM is based on the fuzzy set theory and says that each indexed term has a fuzzy set associated

with it and the weight of a document with respect to an index term is considered to be the degree

of membership of the document in the fuzzy set associated with it.

Using the term frequency calculation (tf/idf) as the term weight, the MMM is calculated by:

SIM (orQ¿ , D) = C ¿1* max (tf/idf of queried terms) + C ¿2* min (tf/idf of queried terms) (2)

SIM (orQ¿ , D) = C ¿1* min (tf/idf of queried terms) + C ¿2* max (tf/idf of terms) (3)

Where Q is the query with an OR or with an AND, D is the document with index-term weights

tf/idf, and C is a coefficient for “softness”, Frakes, Ribeiro-Neto says “since we would like to

give the maximum of the document weights more importance while considering an or query and

the minimum more importance while considering an and query” and usually C ¿2 is just 1 - C ¿1

and C ¿2 is calculated as 1 - C ¿1(1991, p. 396). They found that C ¿1 performed best at 0.6 and

38


C ¿1 performed best at 0.3. For purposes of this research we will use the values of C ¿1= 0.6, C ¿2 =

0.4, C ¿1=0.3 , C ¿2=0.7 .

The Paice model was proposed by Paice in 1984and is also based on the fuzzy set theory.

Similar to the MMM model, the Paice model looks at the weighted indexes in the document but

doesn’t stop at the min and the max like the MMM model does, it considers all of the weights of

the document. The Paice value is calculated by

SIM (Q, D) = ∑i=1

n

ri−1d i /∑i=1

n

r i−1 (4)

Where n = number of queries, Q is the query and d is the tf/idf for the document for an OR

query, D = ( A1 or A2 or … or An ¿ where A is the tf/idf for query term 1, etc. and for an AND

query, D = ( A1 and A2 and … and An ¿.

The p-norm model adds another angle to the Paice model by considering the weight of

the query as well as the weights of the documents (Frakes & Baeza-Yates, 1992). In the research

it has been found that p= 2 gives good results, for this research, the weight used will be 2. The p-

norm model for an OR’d query is

SIM (Q¿ p , D ¿ = p√¿¿¿ (5)

Where Q is the query, D is the document, a is the term weight d is the document weight,

p is set to 2, and A is the term for which the document weight is corresponding. For an AND’d

query the model is:

SIM (Q¿ p , D ¿ = p√¿¿¿

(6)

39


With a p value greater than 1 the computational toll is high, but to get a better result

computational expense will be second to results for this study.

Based on the literature, there were no previous studies that looked at how effective using

an extended Boolean approach (MMM, Paice, or P-norm) to information retrieval for software

has been done. For this research, the MMM, Paice, Boolean and p-norm model will be used to

search a document for software components using the tf/idf value as weights calculated in the

software. Using a level deeper than the simple term weights of tf/idf, this study should provide a

better, more accurate match to a user’s search.

Document Pre-processing

Before a document is searched it is usually processed, stripped of white space, common

words removed and indexed. Indexing is for faster lookup and faster search times. There are

many different methods used to index files, the most common is to index a word by the number

of times it appears in a document or the frequency. During the indexing, terms are parsed out

and stored in an index file along with the frequency or some other ranking number that will help

quickly identify the term in a ranked list (Croft, Metzler & Strohman, 2010). The most common

indexing is the inverted index which, for every unique indexed term, contains a list of the

documents that contain that term (Croft, Metzler & Strohman, 2010). This research will be using

the inverted indexing that is included with the Lucene software.

40


Figure 1. The indexing process

Stop words are common words that are found in most documents but not needed as

indexed terms. The most common words are the, and and to and removing them can increase the

speed of a search and decrease the size of the indexed file (Croft, Metzler & Strohman, 2010).

This research will be searching software files, and a stop word algorithm will be used but the list

of stop words will be researcher created to remove software specific common terms like =, ;, for,

while, etc.

Stemming is another helpful action that is done in the indexer and will speed up search

times. Stemming removes versions of words into one word, usually the smallest version (Croft,

Metzler & Strohman, 2010). For example the words run, running, and runs will stem to the

same word, run. This eliminates the need for extra space to store all versions of the indexed

words. The process of stemming takes a words base, after removing prefixes and postfixes and

stores that word and also searches for matches of that base. This allows a word like running to

41


be matched to run and runs, this not only increases the chance of match but also increase the

compression factor by 50% (Frakes & Baeza-Yates, 1992). The most common stemming

algorithm is Porter Stemming algorithm which is included in the Lucene’s SnowballAnalyzer

which will be used in this research.

How a system matches like terms is critical to any system. A part of most information

retrieval system is the thesaurus. The thesaurus is comprised of indexed terms, their

relationships to each other and the design of the how the relationships are laid out. The

relationship design can range from lists to multidirectional graph (Baeza-Yates & Ribeiro-Neto,

2011). By saving the relationship between like terms, a system can be searched without having

to worry about other terms. For example a user may search for a sum function but some may

save their functions using keywords add or plus, a good thesaurus will include all versions in a

search. Other methods include lexical analysis of the text, which turns streams of characters into

streams of words and usually includes removing of hyphens, apostrophes, and punctuation

marks.

Information Retrieval Software

There are open source IR applications available on the internet that were researched for

this study. A couple which were considered for this research but for different reasons were not

further considered. Some of them include Lemur, WorldNet, Terrier, idSearch, Zettair Sphinx

and SMART. Lemur has a self-contained library which would not work with this research, and it

also didn’t allow a software library to be substituted in place of its built in library. WorldNet

was not user friendly and was difficult to install on Windows 8, the same was true with Zettair.

42


SMART too was difficult to install on Windows 8 and also implements a vector model. Terrier

worked with the TREC dataset of which contains no software libraries so this was not applicable

to this research. There was very little documentation available for Zettair Sphinx so no further

time was spent exploring those systems. There are many other free information retrieval

software applications; they all work similarly but do differ in size, some can handle large

amounts of data while others are for small data systems (Eckard & Chappelier, 2007). There are

a growing number of systems that work with the TREC system of data, these systems are being

studied in academic fields everywhere. With the increasing amount of data on the web, search

engines need to be able to handle massive amounts of data in a short amount of time (Eckard &

Chappelier, 2007). Lucene is an open source application that is built on the Apache framework

of free software. The Lucene application contains an indexer and searcher plug in. The Lucene

search does include multiple different calculations of similarity but none allow for a custom

similarity measure to be created, therefore a researcher written similarity will be written using

the Lucene indexer implement the MMM, Paice and P-Norm models.

Measures of similarity

There have been a number of measures created and used in the industry to measure the

quality of retrieval in an information retrieval system, the most well-known is recall and

precision, which will be the measure used in this study. Recall is defined by Croft, Metzler and

Strohman (2010, p. 309) as “the proportion of relevant documents that are retrieved” and

precision is “proportion of a retrieved set of documents that are actually relevant”. These

measures are inversely related, as precision goes up, recall goes down and vice versa (Binkley &

43


Lawrie, 2008). Recall = |A ∩B|/ |A| and precision = |A ∩B|/ |B|, where A is the number of

relevant document and B is the number of retrieved documents. The degree of precision is not a

number that is easily calculated, one way precision can be calculated is to look at specific cutoff

points in the returned ranked list. For this study, the number of relevant documents found should

match those found in the study by Maarek since the same data corpus is used. The study by

Maarek will be used as a standard to which the data in this study should match. Precision will be

considered only based on the first ten items in the list, or the precision at ten documents received.

Ten is a popular base because it is the number of items returned on a single result page used by

most search engine web pages (Turpin & Scholer, 2008).

The more precise the match is, or the more specific the query gets, the less recall will be,

for example if a search is for sum function there should be plenty of matches returned, low

precision, high recall. If the term integer sum function gets added to the query, the more precise

the returned solutions will be, but the number of items returned will be less. Because it has been

shown, that in general, the higher the recall the lower the precision. Precision will be calculated

similar to the Maarek study. By looking at different points of recall, the precision will be

graphed and average precision extrapolated when there is no exact precision calculated. This

research will follow a common procedure to calculate average precision, for each relevant file,

the precision is calculated dependent upon the location of each relevant file in the returned list of

documents (Maarek, 1991, p. 811). This way of calculating precision is common in IR. This

study also uses the mean average precision (or MAP) to calculate the overall effectiveness of the

system and use this value to compare to other systems (Manning, Raghavan & Schutze, 2008).

44


The MAP is just the average of all the individual precisions divided by the total number of

queries.

Recall and precision are the most commonly used measures of information retrieval

algorithms, although Raghavan, Jung and Bollmann (1989) say there are issues with using these

two measures in an IR system. Recall and precision are not suitable measures for multiple

queries that eventually will get averaged, precision values can be off and are based on a user’s

interpretation of relevant (Raghavan, Jung & Bollmann, 1989). Even with the given issues, this

research will utilize the recall and precision along with the MAP measurement for data

evaluation.

As good as recall and precision are in showing the number of relevant documents and the

degree of precision or match an IR system can return, to compare to each other this study will

use the mean average precision measure. The mean average precision (or MAP) is a widely used

measure that results in a single numerical figure that represents the effectiveness of a system

(Turpin & Scholer, 2006). With a single measure of quality across recall values multiple

systems can now be compared to each other (Manning, Raghavan & Schutze, 2008). MAP is

defined as

Average Precision = ∑r=1

N

( P (r ) X rel (r ) )

¿Relevant documents(7)

Mean Average Precision = APQ (8)

45


Where Q is the total number of queries, N is the number of retrieved documents, r is the rank in

the sequence of retrieved documents, P(r) is the precision at rank r, rel(r) is 1 if the item at r is

relevant and 0 if it is not. Average precision is calculated by “taking the mean of the precision

scores obtained after each relevant document is retrieved, with relevant documents that are not

retrieved receiving a precision score of zero. MAP is then the mean of average precision scores

over a set of queries” (Turpin & Scholer, 2006, p. 12). MAP is a popular metric used in IR

system comparisons and has shown to be “stable across query set size and variations in

relevance” (Turpin & Sholer, 2006, p. 12). Because of its stability and ability to be used as a

comparison for IR systems that use different search algorithms, MAP will be the calculation used

to compare the IR systems in this study.

Literature Review Summary

Review of the literature revealed many articles closely related to the topic of study. For

example, from the literature it was found that software that is easily found has a higher

probability of reuse and that there are many ways to search for data including Boolean and fuzzy

logic. Also that software for reuse is a big issue that is not utilized to its fullest capacity and

could benefit companies if software was easier and more accurately retrievable.

For the purpose of this study, it was important to understand the benefits of software

reuse, the many applications of information retrieval and their successes and failures, and how

fuzzy sets work and how they can benefit an information retrieval algorithm for software

components. The literature showed a wide range of applications of information retrieval of data

and the use of fuzzy sets for information retrieval but none showed an effective fuzzy set of

46


information retrieval specific to the retrieval of software. Software reuse has proven its

importance in the software development community and with the use of a fuzzy set to retrieval a

wider range of components this research hopes to increase the success of finding software for the

purpose of reuse.

Chapter 3 introduces the methodology used in this research including the algorithm and

test data sets. This chapter explores the quantitative research methodology and the reasons this

methodology was chosen over a qualitative approach. Chapter 4 discusses the study in more

detail and detailed explanation of what the results mean. Chapter 4 will explain how the data

either confirms or refutes the hypothesis from Chapter 1. Chapter 5 will discuss the items that

were not discussed in this study, and answer the research questions that were not able to be

answered in Chapter 4. Other possible research that includes this study’s algorithms is discussed

in Chapter 5.

47


CHAPTER 3: METHODOLOGY

In Chapter 1 we explained the background of the study, the reason for the study and the

significance to the industry that this study will provide. In Chapter 2 we explored the literature

surrounding software reuse and the many possible ways to effectively retrieve information.

Fuzzy sets were defined and explored more as a way to obtain desired information. In this

chapter we look at the methodology used for employing fuzzy logic for information retrieval of

software, and the statistical evaluation to determine if that selected model was significant. We

also discuss the makeup of the data, how it is gathered, the instrumentation used to gather data

and the limitations of the study. Also discussed is the validity and reliability of the data that is

used for this research.

Lucene

Lucene is a Java based, open source, free software application that is developed on the

Apache framework which is an open-source software group that includes over 150 software

projects. The Lucene application includes an indexer and a searcher. For the purpose of this

application, the Lucene indexer will be used as is to index the data corpus and we developed a

search algorithm to search the data. Lucene will be used in an application we created in the

Eclipse environment using Java. Eclipse is a free integrated development environment (IDE)

that includes a full Java compiler. Lucene has four English language indexers that can parse

documents into the index. The StandardAnalyzer is a general purpose analyzer, the

WhiteSpaceAnalyzer parses data separated by white space, the StopAnalyzer parses out stop

words which are common English language words that usually don’t help in indexing and the

48


SnowballAnalyzer parses out words based on the Porter Stemming algorithm, which parses

words to their roots, for example, the words stopping and stopped will be indexed to stop to

reduce the redundant indexing (Smart, 2006). Frakes, Harman and Candela all observed that in

studies where stemming was done, it resulted in better results, therefore in this study the

SnowballAnalyzer was used to stem and index the documents (Frakes, Baeza-Yates, 2012).

Lucene 4.6.1 treats each document added to the index list as a collection of fields, one

field stores the contents of the data, one stores the path to the document, and the third stores the

time/date stamp of the last update to the index. The IndexConfig class creates an index file, if

there is one already created, it will get overwritten. The IndexWriter class accepts data from

files as fields, which can be changed depending on the files, for this research the default is used

which is the content field, which contains the data from the file. The SnowballAnalyzer is the

analyzer used to parse the data first, stemming words using the Porter’s Algorithm. Then the

data is written to the index file using the IndexWriter. If there is an error or problem in this

process an exception will be thrown in Java writing to the error log file, the data will be skipped

and indexing will continue. Once the index is created, the data can be searched. There are

applications that can read the index file and allow searches to be done, there are also search plug

in’s in Lucene to allow a user to create their own search. Luke is one of the third party

applications that is designed to read a Lucene index. The issue with Luke and the other

applications available was the inability to change the search algorithm to allow for custom

algorithms. For this reason, we have written an adapted search in Java.

49


To search the indexed file, a QueryParser class must be instantiated; this will parse the

search terms entered by the user, into the correct format required by Lucene. For example, if the

query is sum AND add, Lucene requires the search to be in the form +sum +add, the ‘+’

indicates required field. Since Lucene includes a parser, there was no need to change the way

the query was inputted, and we just allowed the software to do the alterations needed. The field

to be searched must also be entered when the QueryParser is created; for this research we will

access the contents field. For scoring, Lucene uses the Similarity class which contains four

options for similarity measure calculations. The DefaultSimilarity implements the tf/idf ranking,

BM25Similarity implements the BM25 ranking algorithm, MultiSimilarity implements the

CombSUM algorithm and a PerFieldSimilarityWrapper allows a different ranking method per

field to be specified (Carpenter, Morris & Baldwin, 2011). Because Lucene does not use a strict

tf/idf but rather a modified tf/idf implemented with boosting factors, like how often a term is

found in the entire corpus, the total number of words in a corpus and the total number of words

in a document, and because there was no way to change those values or implement a custom

similarity into Lucene, we have written a search algorithm to run the calculations needed. This

research will use Boolean, Mixed, Min and Max (MMM), Paice and P-norm model (calculations

in Chapter 2, equations 4-6) (Carpenter, Morris & Baldwin, 2011). The returned data will be

evaluated by experts and used to calculate recall, precision and MAP (calculations defined in

Chapter 2, equations 7-8). The higher the MAP, the more precise a system is.

Procedure

50


Using the UNIX help libraries downloaded as .txt files from the FreeBSD UNIX website,

the Lucene indexing application is used as the corpus of files. The data files will not only be

indexed but also stemmed to reduce redundant words in the index. A developer created search

application with algorithms to calculate the MMM (Min, Mixed, Max), the p-norm, the Boolean

and the Paice values was executed with researcher developed queries that are common UNIX

help commands, for example print, file, move, etc. The returned list of matched files were then

presented in descending order by the resulting similarity scores calculated by the implemented

algorithms. Using a quantitative correlation experimental model, which is used to infer truth in

theories by comparing quantitative data collected from experiments, each method’s returned list

of matches will be compared to the others to determine if correlation is present and if there is

clear evidence based on the results to show the hypothesis is true or not true. Based on similar

studies listed in the literature review, the Boolean method should yield the least precise result

list, or the smallest MAP value.

Data

Using a quantitative methodology approach, we index all the data files in the corpus then

run a search for software components in the UNIX library. The UNIX help files were

downloaded from the FreeBSD UNIX website on to the researcher’s computer. FreeBSD is the

free version of the Berkeley Software Distribution, which is a popular free version of UNIX.

The data will be downloaded from http://www.freebsd.org/cgi/man.cgi/help.html. The UNIX

library is divided into 8 categories, category one is for user commands, 2 for system calls, 3 for

library functions, etc. Since we are trying to design an effective search for software, category

51

http://www.freebsd.org/cgi/man.cgi/help.html


one will be the only files used in this study, which results in 681 files. The files based on this

data store, the files will be indexed and searched. The data files were simple text files with a

variety of sizes depending on the functionality of the command. Figure 2 shows the directory of

files and figure 3 shows the contents of the alarm.1 file. It’s clear to see in Figure 2 that the files

all relate to specific commands that are found in the manual pages. The files contain the details

of the command, and any useful information that a user may need, such as last update, license,

terms of use, etc.

Figure 2: Screenshot of the files in the directory used for the data corpus

52


Figure 3: Screenshot of the alarm.1 man file from the FreeBSD data corpus

Using the Lucene 4.6.1 IndexWriter and SnowballAnalyzer the index has been created from a

corpus of files that contain the individual terms, the path where they are located, and the total

number of files. These statistics have been calculated when indexed and became available to

create a custom search. A user created application using the Lucene 4.6.0 search plug-in has

been run on the indexed data file to search for the user entered query.

We have developed searches that were run on the Lucene created index, after the queries

were parsed into separate query terms, and returned matching documents that included the

queried terms. If the query includes a Boolean “AND”, only the documents that include both

queried terms were considered; if the query includes an “OR” all files that contain either term

and both terms was considered.

53


An array was created that stored the document ID’s of the documents that fit the query.

If the query was an “AND” only the files that contain both terms were added to the result array;

if the query was an “OR”, all files were added to the array. This became the final list of

documents that matched the query. To calculate similarity score, the tf/idf was first to be

calculated. The tf/idfA is the term frequency inverse document frequency for search term A, and

tf/idfB is the term frequency inverse document frequency of search term B. To calculate each of

these, first add one to the log of the frequency per document of a term, then multiply that number

by the log of the total number of files in the corpus divided by the total number of documents the

term is located in. For example, if there are 681 files in the corpus, 30 of them contain TermA,

and TermA is found 4 times in document 1, the tf/idf for document 1, for termA would be 1+

log(4) * log(681 / 30). Although there are many other variations to the tf/idf calculation, Baeza-

Yates and Ribeiro-Neto say this calculation is the most frequently used and the most effective.

Then those scores were used as term weights in the three extended Boolean algorithms. To

calculate the Min, Max and Mixed similarity the pseudo code is below:

If the search is an AND then (1)

MMM = .4 * Min(tf/idfA, tf/idfB) + .6 Max(tf/idfA, tf/idfB))

Else the search is an OR then

MMM = .7 * Min(tf/idfA, tf/idfB) + .3 * Max(tf/idfA, tf/idfB)

End if

.

54


This number was then stored in the MMM array. Using 2 for p and 1 for the document weights,

the pseudo code for the P-Norm calculation is: (the full code for similarity calculations is

available in Appendix B)

If the search is an AND then (2)

PNorm = 1 – √ (1 )2∗(1−tfidfA )2+¿¿¿

Else it must be an OR search then

PNorm = √ (1 )2∗(tfidfA )2+¿¿¿

End if

The Paice was calculated using the recommended values of 1 for an AND query and .7 for an

OR query (Frakes, Baeza-Yates, 2012).

If the search is an AND then

Paice = (10 * MIN(tfidfA,tfidfB) + 11 * MAX(tfidfA, tfidfB))/ ¿ + 11) (3)

Else the search must be an OR then

Paice = (.70 * MAX(tfidfA,tfidfB) + .7 * MIN(tfidfA, tfidfB))/ ¿ + .71)

End if

There were many different situations that first needed to be taken into consideration in

order to find all matched data. The conditions for AND queries included, 1) there is one or more

file that contains both queried terms, then just calculate similarity as normal 2) There are files

that contain only one queried term, in this case, abort similarity calculation. For an OR query, 1)

there are one or more files that contain both queried terms, plus other files that contain only one

queried term, run similarity calculations as normal, 2) there are one or more files that contain the

first term, and one or more files that contain the second term, run similarity calculations using 0

55


for the term not found in the files. And lastly 3) there are one or more files that contain only one

searched term, the other search term is not found in any files, calculate similarity normally using

0 for the other term similarity.

Other conditions that needed to be accounted for include, if no documents match term A

and list of documents matching term B is greater than 1, the check for empty set cannot stop

when a list is 0, both lists need to be checked and both must be 0. In this case, the term A

similarity needs to be 0 while calculating the similarity for term B. And vice versa.

After the MMM, P-Norm and Paice algorithms are calculated, they are stored in an array

and sorted in descending order and displayed on the screen. The data returned from the experts

was used to calculate the recall, precision and MAP.

Figure 4: user interface for researcher created software

56


Figure 4 shows the researcher developed output of the search for a query (addend AND sum) and

the results as calculated by the MMM, P-Norm, Paice and Boolean implementations. The

returned files in ranked order by score, and at the bottom the total number of files that match

each term. Since the query was an AND query, only the files that match both queried terms will

get returned. Term 2, sum, returned 13 files that contain that term while addend only returned 1

file. Because a Boolean search matches 1 for terms being present and 0 for terms not being

present, the Boolean score of 1 indicates the search terms are both found in file number 613. The

other searches return a score based on the similarity equation described in Chapter 2 (equations

2-7, pp. 38 - 39).

Sample Size

Using the ratio of queries to documents that have been used to evaluate IR systems in the

literature, “MED (collection of medical abstracts, 30 queries for 1033 documents) or CISI

(information science abstracts, 35 queries for 1460 information abstracts”, the number of queries

will be between 2-3% of the total number of files to be tested (Maarek, Berry & Kaiser, 1991,

p.811). With a corpus of 681 files, that results in 20 queries (http://www.freebsd.org/cgi/man.

cgi/help.html). Queries are two word minimum queries joined by a Boolean expression AND or

OR. The queries do not contain any keywords, to mimic a user searching for the command that

does a certain action. For example: copy AND file, move OR delete, etc.

Instrumentation

The data corpus is stored and the queries have been run. A program using the Lucene

4.6.1 plug in application has been created in Java using the Eclipse development environment.

57


An index has been created using the Free BSD Unix category 1 data files and queries were run

on that index to gather the results. After the queries were run and the MMM, Paice, P-norm and

Boolean similarities calculated, the results were evaluated by a panel of UNIX experts who

decide if the returned files are relevant and if so in which order. The two UNIX experts who

participated in this study include one personnel from Colorado Technical University and one

system administrator for the U.S. Missile Defense System at Peterson Air Force Base.

A software application has been created that first creates the index using the Lucene

IndexWriter. The user is then prompted for a search query and once the query is entered, the

program parses it out, and searches for each term from the query using the IndexReader which is

part of the IndexConfig class, to read through the index. Once matching documents are found the

document ID’s are stored in an array for later use. The program then loops through the matching

list of documents, calculates the similarity score and stores them in another array. Then the

arrays are sorted largest to smallest and printed.

Recall is calculated as the number of returned relevant documents over the total number

of documents and precision is calculated based on the ratio of relevant documents retrieved over

the number of documents in the database. A report was generated in Excel to show the returned

list of documents per query and sent to the panel of experts. Once the experts return their list of

relevant files, the mean average precision can be calculated. Using an in-house data set is

different than an online data set and results in different values. As the literature has shown, the

Boolean search should yield the worst results.

Validity and Reliability

58


Data retrieved from each search is compared using precision, recall and the MAP

calculated value. To determine precision and the relevance of the files found in the search to the

queried terms, a panel of 3 UNIX experts has been established. Out of the 3 only 2 were able to

participate and return data. The panel has decided if the data returned was relevant and if so how

relevant by ranking the files in the order they deemed most relevant to least. If there were files

not returned by the searches, they were to write them in as well. The files that were not returned

but deemed relevant were then used to calculate recall. Precision was calculated by how well the

searches returned list of files matched the experts. Once the recall, precision and MAP values

were calculated the MAP value was then used to compare the different search methods to each

other, the higher the MAP, the more precise a system is.

Conclusion

This chapter looked at the methodology and procedure used to collect data for analysis by

running multiple queries on a software data bank using the weighted retrieval calculation to

match terms. By running multiple queries the data collected will be analyzed and will be further

discussed in Chapter four. The methodology defined in this study is based on the assumptions

that software in the UNIX help library has similar qualities to other software, and when

searching for software response time is not an issue. By comparing MAP calculated values, the

search with the best MAP value will be concluded as the most precise search of software.

59


CHAPTER 4: RESULTS

In this chapter, the data results will be discusses, the search methods will be reviewed and

the results will be analyzed as to how they answer the research questions. The data collection

will first be reviewed along with any issues encountered while collecting the data. Second, a

thorough review of the results will be presented. Third, a look at each of the IR methods used

will be discussed along with their individual results. Last, the research questions will be looked

at with answers from the results.

Data Collection Reviewed

The data corpus for this study was initially intended to be a software library easily

available for search. The literature had shown that the UNIX online manual pages accessed in

UNIX via the man command have worked successfully as a corpus to simulate a software library

(Maarek, Berry & Kaiser, 1991). Using the FreeBSD version of UNIX manual pages, the man

pages contain not only user commands but also system calls, etc. so it was decided by the

researcher and the team of Unix experts to only use the category 1 files in the Unix manual

pages. Category 1 contains the commands users would enter if using the online help system in a

UNIX environment, and contains fields for each command describing what the command does

and what other commands are required or parameters needed to use the command correctly.

Figure 5, shows the directory of category 1 and there the commands that the file refers to can

easily be identified according to the file name.

60


Figure 5: Directory of category 1 UNIX manual pages

Once the other categories were removed, category 1 was left with 681 files. The other

categories include system calls, function calls, methods and their parameters, etc. that are used

internally to UNIX, therefore since those files are not accessible to the user they were removed

from the corpus to save indexing time and space.

Using the Lucene SnowBallAnalyzer with the Lucene Indexer, the data files were all read

and parsed into an index. The Snowball Analyzer was written by Martin Porter and is a

stemming algorithm that reduces words to their lowest stem in order to reduce the size of the

index and find more relatable matches (McCandless, Hatcher, Gospodnetic, 2010). For example

the terms, stopping, stopped and stops all stem to the term stop. The Lucene index was created

with three fields, filename, path, and contents. The contents field contains the data in the files,

this was the field used to search for matching terms.

61


Presentation and Discussion of Findings

The goal of this dissertation is to test if implementing fuzzy logic into an information

retrieval system could result in a better search outcome. In these models, “a document has a

weight associated with each index term. This document weight is a measure of the degree to

which the document is characterized by that term” (Frakes & Baeza-Yates, 1992, p. 395). Zadeh

has defined a fuzzy set as any set whose elements have a degree of membership, and because the

three models in this study are measured by the degree to which a term belongs to a document, we

can therefore say the Paice, P-Norm and MMM are fuzzy models (Zadeh, 1994).

Typical IR effectiveness is based on search precision, and query recall. The precision is

based on how accurately a returned document matches what the user requests and recall is the

number of returned results based on the number of relevant files in the corpus. Because the

recall and precision vary per query per search, the mean average precision is the measure most

IR studies use to compare searches to each other. The rest of this section will discuss the

quantitative measures to include recall, precision and the mean average precision for four

different search algorithms and test the research hypothesis.

Limitations of the Study

The study was conducted on the UNIX man library pages. Because not all of the man

pages were relevant only the category 1 files were considered. Future research would expand the

corpus to determine if same quality of results is met. The size of the data corpus will also affect

the time of search, future research that compares the time of search should also look at different

size of data corpus to see if the time is affected.

62


The data was also verified by a panel of two experts, ideally there should be more experts

to verify the reliability of the data. A panel of users could also be included to determine if the

data returned matched a user’s needs specifically. By verifying the search is returning

information that is relevant to users would be an area for future study. Finding an algorithm that

can automatically determine what is considered relevant would also be a great topic for future

research. For software specific, the panel should be software engineers that use software

components on a daily basis. This way the reusability of software could be measured more

accurately.

The study also relied on the Lucene 4.6.0 indexer to index the data and read the index.

For future research a new index could be created that will be set up for software specifically.

Measure of Similarity

To calculate the MAP, the returned files were compared to the files returned by the

expert. The equation used to calculate MAP is

Average Precision = ∑r=1

N

( P (r )/rel (r ) )

¿Relevant documents(7)

Mean Average Precision = APQ (8)

Where Q is the total number of queries, N is the number of relevant documents, r is the rank in

the sequence of retrieved documents, P(r) is the precision at rank r, rel(r) is 1 if the item at r is

relevant and 0 if it is not. AP is average precision.

63


Based on the location of the first relevant file in the search’s result list, the average

precision is then calculated. For example, the search for field AND float, returned the data in

table 3. To calculate the precision for the MMM in this search, the number of relevant files = 2;

2 files were returned by the experts as relevant. The files used as relevant were the files that

matched between experts, other returned files were disregarded and not used. To find the

precision, divide the position of each relevant file in the search returned list of files by the

number of relevant files found thus far. For the first relevant file returned, it is located in

position 2, so p1=¿1 /2¿= 0.5; the next relevant file is found in position 4, so second relevant file

(2) divided by position 4, p2=¿2 /4 ¿ = 0.5. Then to find the AP = 0.5+0.5

2 = .0.5 or divide each

relevance by the total number of relevant tiles so 0.5 + 0.5 / 2. Since the relevant files were

found in the same position for all searches, they all get an AP score of 0.5 for this query.

Another example is the query for sum AND add. To calculate the average precision for MMM,

there is only one relevant file, nawk.1, and that is found in position five in the MMM search, so

1/5 = 0.2; for the Paice model, the file is found in the fourth position so ¼ = 0.25; and for the

Boolean model, the file is in the second position, so ½ = 0.5. In this example, the Boolean

search with an AP of 0.5 resulted in a better match, compared to MMM and P-Norm of 0.2, and

Paice at 0.25.

TermABool

Operator

TermB Rank MMM doc

Paice doc

P-norm Doc

Boolean Doc

Expert1 Relevan

t List

Expert 2 Rel list

field AND float 1 whatis Whatis whatis grn.1 printf printf

64


2 printf.1 printf.1 printf.1 printf.1 Sort sort 3 stat.1 stat.1 stat.1 seq.1 Seq awk 4 sort.1 sort.1 sort.1 sort.1 perl 5 grn.1 grn.1 grn.1 stat.1 6 seq.1 seq.1 seq.1 tcsh.1 7 tcsh.1 tcsh.1 tcsh.1 whatis

sum AND add 1 whatis whatis whatis id.1 nawk nawk 2 id.1 unxz.1 id.1 nawk.1 expr perl 3 unxz.1 id.1 unxz.1 ps.1 4 ps.1 nawk.1 ps.1 unxz.1 5 nawk.1 ps.1 nawk.1 Whatis

find AND file 1 find.1 find.1 find.1 afmtodit.1 find find

2 whatis whatis whatis apropos.1 locate locate 3 lex++.1 lex++.1 lex++.1 as.1 less less 4 less.1 less.1 less.1 bsdgrep.1 vi 5 tcsh.1 tcsh.1 tcsh.1 bsdtar.1 6 cpio.1 cpio.1 cpio.1 bzip2.1 7 unxz.1 unxz.1 unxz.1 chflags.1 8 vi.1 vi.1 vi.1 ci.1 9 id.1 id.1 id.1 clang++.1 10 locate.1 locate.1 locate.1 cpio.1

16 test.1 test.1 test.1 find25 xargs.1 xargs.1 xargs.1 less29 bzip.1 bzip.1 bzip.1 locate.1

Table 3: Returned data from searches

The last example is the query for find AND file, for the MMM, Paice and P-Norm, the

three relevant files are located in the same position, so the first file is found at the first position,

1/1 = 1; the second found in the fourth position 2/4 = 0.5; and the third found in the tenth

position, 3/10 = 0.3. To find the AP we add, 1 + 0.5 + 0.3 = 1.8 and then divide by number of

65


relevant files, in this case, 3. 1.8/3 = 0.6. To find the AP for the Boolean search, we had to

include more than the top ten returned files, the three relevant files were found at positions, 16,

25 and 29 respectively. For this calculation, 1/16 = 0.0625; for the second file, 2/25 = 0.08; and

the last file is 3/29 = 0.103. To find the AP, add .0625 + .08 + .103 = 0.2455 then divide by the

number of relevant files, 0.2455/3 = 0.0813. In this example it is clear to see that an AP of 0.6

which was the AP of the MMM, P-Norm and Paice model, is much higher than the Boolean

score of 0.08.

TermA

Bool Operato

r TermBMMM

docMMM score

Paice doc

Paice score

P-norm Doc

P-Norm score

Bool Top Doc Score

field AND float 613 12.54 613 15.56 613 10.47 222 1 406 9.17 406 12.62 406 7.01 406 1 492 8.37 492 12.5 492 6.36 460 1 482 7.59 482 8.96 482 5.7 482 1 222 6.46 222 7.87 222 4.5 492 1 460 4.89 460 5.54 460 3.11 511 1 511 3.25 511 4.17 511 1.25 613 1

copy AND file 395 3.6 395 3 395 2.61 19 1 613 3.57 613 2.98 613 2.57 20 1 109 3.4 109 2.84 109 2.38 21 1 379 3.36 379 2.8 379 2.33 27 1 295 3.03 295 2.52 295 1.95 50 1 331 2.97 331 2.47 331 1.88 68 1 111 2.9 111 2.42 111 1.8 69 1 255 2.9 255 2.42 255 1.8 86 1 459 2.82 459 2.36 459 1.71 89 1 569 2.82 569 2.36 569 1.71 97 1

Table 4: Scores with file numbers returned

Because the Boolean search returned 1 for found and 0 for not found, the Boolean search

listed its files in descending order. The other searches listed their results in descending order by

66


score, then by file number. This is illustrated in table 4 above, the highlighted numbers, when

the scores match, the files are then sorted by file number in descending order. Because of this,

two files with the same similarity will be returned in alphabetical order not based on relevant to

the user, which proves that the MAP is a better measure of similarity for an IR study as it looks

not just at the ranked order of returned documents.

Quantitative Methodology Measurement

Table 5 shows the average precisions per query and the bottom contains the total average

precision divided by the number of queries which gives the mean average precision. As can be

seen the Boolean search returned a better average precision only 3/20 times. The other searches

returned similar results and the mixed, min, max (MMM) and the P-Norm search returned the

exact same results. Based on these results, it’s clear to see the Paice was the best performing

search with a mean average precision of .5575.

TermA

Bool Operat

or TermB mmm Paice pnormboolea

nField AND float 0.5 0.5 0.5 0.5Copy AND file 0.33 0.33 0.33 0.09 run OR execute 1 1 1 0.007

column AND height 1 1 1 1construct

or AND length 0 0 0 0addend AND sum 0 0 0 0 math AND library 0.59 0.84 0.59 1space AND gap 1 1 1 0.33

mergesort OR heapsort 1 1 1 1cut OR delete 1 1 1 0.33

oldest ANDcomman

d 1 1 1 0.567


sort ANDmergeso

rt 1 1 1 1deploy AND run 0 0 0 0find AND file 0.6 0.6 0.6 0.08list AND files 0.2 0.25 0.2 0.5sum AND add 0.1 0.13 0.1 0.25

subtract AND math 0.5 0.5 0.5 0.5add AND subtract 0 0 0 0

printer AND network 0 0 0 0mergesort AND print 1 1 1 1

Sum of the queries 10.82 11.15 10.82 8.087

Sum/20 0.541 0.5575 0.5410.4043

5

Table 5: Scores of average precision per and resulting MAP scores

Because the mixed, min, max (MMM), Paice and P-Norm models are considered fuzzy,

because we use term weights we can successfully say the fuzzy models returned a better result

over the Boolean search. Based on the MAP results, the list of searches ranked by best score is

the fuzzy model Paice, followed by a tie between the fuzzy models MMM and P-Norm and

finally the Boolean search. Table 6 shows the search models with their resulting MAP scores

and how they ranked.

Search Method Paice MMM P-Norm BooleanMAP 0.5575 0.541 0.541 0.40435

Ranking 1 2 2 3Table 6 Ranking of results of search performance

Hypothesis Testing68


The original hypothesis stated that using the term weights created in Lucene as the tf/idf,

a researcher developed search implementing the MMM, P-Norm and Paice algorithm, will result

in a more accurate list of files that meet the user’s needs than a Boolean search. Looking at the

degree of membership will yield a higher rate of successful match over the Boolean method and

a higher success rate in returned software components should result in a higher reuse rate. Table

5 shows the average precision for each query for each search algorithm. Table 7 shows the

resulting MAP scores and their percentage increase over the other search results as compared to

the Boolean search. Based on these results, there is a 27.6% increase in the Paice model over the

Boolean model. And a 25.25% increase in the MMM and P-Norm model over the Boolean

model. These results support the hypothesis that the fuzzy models will deliver a better search

result over a Boolean model.

MMM Paice P-Norm BooleanMAP 0.541 0.5575 0.541 0.40435% Increase over Boolean search 25.25% 27.6% 25.25% 0

Table 7: Resulting MAP scores for all searches and their % increase over the Boolean search

Conclusion

Based on the experimental quantitative data shown in this study, the fuzzy logic methods

do return a better match for a user’s query needs. The difference is clear to see and proves that

using a fuzzy method will have a better chance at matching a user’s needs by returning a better

ranked list of matched documents. Chapter 4 showed the results in detailed format and explained

the meaning of these results, Chapter 5 will detail where this study left off and what is to be done

next.

69


70


CHAPTER 5: DISCUSSION OF RESULTS AND FUTURE WORKS

Any information retrieval system has the same main components, indexer, data corpus

and a search algorithm. The literature has shown in chapter 2 that the way the data is indexed

can affect the way the data is searched but the most influential part of any IR system is the search

algorithm. Chapter 3 described our methodology in detail and compared the results to a Boolean

search. Chapter 4 proved that the fuzzy logic methods had a tremendous increase in mean

average precision over a Boolean search. This chapter will focus on the two research questions

that were not addressed in this study and also other areas of future research that came up during

this study.

Research Question 1

Research Question 1 asks if software can be retrieved using a fuzzy logic algorithm and

return a more accurate match to the user’s query? This question will be examined by looking at

resulting MAP scores from the different searches. The higher the MAP the more accurate the

search and based on the results, Paice was the best overall search at returning accurate match to

the user’s needs with a MAP of .5575, Table 7 shows that next was the MMM and P-Norm with

a MAP of .541 and lastly was the Boolean model with a MAP of .40435. This resulted in a

27.6% increase from Paice over Boolean and a 25.25% increase in the MMM, and P-Norm

model over the Boolean search.

71


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

0.2

0.4

0.6

0.8

1

1.2Average Precision by model

mmm paice pnorm boolean

Figure 6: Graph of average precision results from all 20 queries

Although not all searches were an improvement for the fuzzy logic methods, the ending

average was a better measure for this study showing the overall average higher than the

Boolean’s MAP, full results from the experts and the search is listed in Appendix A. The graph

in Figure 6 shows the average precision results for all 20 queries and how they mapped out per

search method. The MMM and P-Norm models returned similar scores for every query so their

line is one in the same.

Research Question 2

Research Question 2 asks if searches for software require different parameters than

searching for standard information. This question was not examined in this study but has the

possibility for future research. Searching for software and searching for regular text should only

differ in the number of similar words that may replace search terms. Searching for software

should lower the number of possible synonyms for a search term because there are only so many

72


programming terms to match a user’s intent. For example, ambiguous words such as bat or can,

and limit the ambiguity to only how one describes a specific programming command, for

example the java command system.out.println () which prints output to the console and returns a

line, could be searched by output, print, or even new line. The benefit to searching for software

would be the general idea or meaning behind all searched terms should be clear, vs. searching for

bat, which has multiple meanings and not only does a search algorithm have to decide what

documents match but also have to decide which meaning of the term the user is referring.

The number and order of terms in a query is something that is focused on in latent

semantic analysis (LSA). Deerwester, Dumai, Furnas, Landauer and Harshman look at a

different approach to indexing based vectors created by documents and indexed terms. By

creating a matrix of terms and documents, multiple terms will connect to the multiple documents

where they are found. Finding a match to the query is just a matter of finding a point in space,

but this also allows similar documents to be returned since the documents that are closer may be

indexed with different terms. Because users use the same term to query only 20% of the time

(Deerwester, Dumai, Furnas, Landauer and Harshman, 1990), this will help find synonymous

terms even if the user doesn’t know any. The results of early testing was successful in finding

synonymous terms but there is more work that needs to be done. The correct term for an idea is

not always what is indexed in a document or thesaurus so the authors say that including concept

based information would increase the success rate of matches. These systems are useful in web

searches to find ‘like’ content, and can be helpful in a search for software components. Finding

a component that will sum two numbers can also be found under add, math, etc. but the concept

73


of addition or plus, would include all applicable terms. Another topic of research related to the

LSA is the introduction of probabilistic semantic analysis. Using the same methodology of the

LSA the probabilistic model introduces an added element to help find matches based on the

statistical analysis of the vector, the PLSA introduces an added statistical model to help

introduce new language that may fit a user’s need better (Hofmann, 1999).

Research Question 3

Research Question 3 asks if using a fuzzy logic approach to searching for software

components reduce the amount of needed query words to find an appropriate match? The idea

of reducing or eliminating the need for multiple queries or including multiple synonymous words

in a search query to ensure all relevant data gets returned is another hot topic in information

retrieval. This study did not look at the structure of the query or the amount of words needed/not

needed to return all relevant files, but there is concurrent research in such areas. Sieg, Mobasher

and Burke (2004) did a study looking at a new web-based search algorithm that incorporates the

user’s profile information and based on a concept hierarchy, were able to incorporate certain

keywords without requiring the user to enter them explicitly in the search. This returned a better

search result list in their experiment but being web based and based on user profiles, to apply to

software components, the search would have to tie into the software IDE that was running at the

time. This may allow the search to pull out exactly what software component the user is working

on and can then fill in other keywords that the user may not even realize are important in order to

return all relevant documents.

74


Another topic concerning queries is the prediction of query difficulty. The difficulty of a

query is how hard it is for a system to find matches or where it’s difficult to agree on the most

relevant document for different versions of a term, and considers a query entered by a user’s

results compared to other sub-queries run in the background. Using the MAP or precision @ 10

calculations, Yom-Tov, Fine, Carmel and Darlow, devised a learning algorithm that will use the

most effective query words, whether entered by the user or created by the system in a sub-query,

to run a more effective query on a system (2005). This learning query system is great addition

for systems where query synonyms are prevalent and other contextual information is not

available (Yom-Tov, Fine, Carmel & Darlow, 2005). They also get sent to system

administrators in order for them to alter systems to create more tags, better identifiable

information, etc. (Yom-Tov, Fine, Carmel & Darlow, 2005). This type of system would be a

great addition to a software searching IR system. Different programming languages use different

key terms for different items and if the system can auto fill and run in the back ground a search

for similar terms the user’s results could improve. The issue is when the user knows exactly

what they want to find and the system thinks otherwise. In this case the query expansion option

would need to be shut off.

Future Research

There are many other areas for future research when it comes to information retrieval.

Applying the algorithms in this study to a real-life application would be one that we should look

at next and how we can improve any search for information. This study did not consider time to

execute or efficiency in the study but that would be something that this researcher would like to

75


inquire about. Because this study was not written in the most efficient manner, the next step

would be to re-write parts to make it more efficient for the computer and the user. Most of the

literature discussed in Chapter 2, include the element of time when discussing the success or

failure of a new search algorithm, this was not looked at in this study because of lack of

mainframe and real-world testing ability, but it is something that would be next for this study.

A measure of how many times a returned piece of software gets reused would also be a great

addition to the future research. Reusing a code component can only be judged by actually

programmers who use components therefore a panel of software engineers would be needed to

verify the reusability of a piece of software. It was shown in Chapter 2 that just being able to

easily find software components helps increase the reusability of a software component, so this

study, based on those results, should increase the reuse of software since the search methods

return a better match to a user’s query.

The weight and ranking of an information retrieval system is another area that can be

improved with future research but was not focused on in this study. How a system ranks

documents that match a query can be based on a similarity score, the documents term frequency,

or the document name. The basic concept of a ranking system is to assign a number or score to a

document that matches the user’s query and based on how well that document matches

determines the ranked score. Then the list of documents is displayed to the user in order of these

ranked scores. This was done in this research using the fuzzy similarity measures compared to

the Boolean score. Usunier, Buffoni and Gallinari (2009) say that only a few top documents that

match the query are really relevant to the user and those documents should have the highest

76


precision and be at the top of the list, which they are usually not, so they devised a new ranking

based on the Ordered Weighted Average (OWA). By looking at the number of relevant

documents retrieved as a pairwise function with the number of irrelevant number of documents

retrieved, they can determine which search’s similarity returns a better fit to the user’s query.

(Usunier, Buffoni, Gallinari, 2009). This new ranking would be great in order to return the most

precise documents first. From this study the most relevant documents were returned but not

necessarily in the best order, so testing this new OWA system on this study may prove a better

ranked list of returned documents.

One more area for future research that came up in doing this study is the idea of relevant.

Relevance in an information retrieval system can come from three different areas, users-centered,

systems-centered, or cognitive (Ingwersen, 2001). Deciding what is relevant to a user is difficult

and has forced most IR systems to create a network or mapping of information to related terms.

This matrix of information can be different per user per topic which is an area of study that was

not focused on in this research. These type of data maps require extensive background

information that is usually gather across time and through experience. Thinking about how the

human brain gathers knowledge about a topic, it usually is over the course of years, to gather that

kind of information would require systems to be able to grow at an infinite amount of space all

while connecting like terms. Although this is more of a data store topic, this will change how

users retrieve and even query a system. If a user understands how the data is stored they are

better equipped to use a more intuitive query. Data stores are another area of future research that

are of interest. How a system stores and connects data will greatly define how the data is

77


searched, how fast the data is searched, and the success of finding a match. Crestani and Lalmas

look at logic in an IR system and look at the relevance of documents as being true if a document

meets a user’s needs, and if not the document is irrelevant (2001). This logical approach to

relevance is the basis for their logical IR system, and to not sound like a straight Boolean IR

system, they incorporate other theories to devise a new IR system that they compare to

traditional IR systems.

Conclusion

There are many applications where an IR system can be effective and the most accurate

and most effective IR method is something that can benefit many. As we can see in this chapter,

the fuzzy logic methods were an improvement over the Boolean search results. Because of this

more research can be done to find the most efficient system. There are many future routes where

this research can be expanded, from query improvement to semantic vector space analysis but

it’s safe to say the fuzzy retrieval was as success over the Boolean model.

78


References

Addagada, S. (2007). Indexing and Searching Document Collections using Lucene. University of

New Orleans Theses and Dissertations. Paper 1070. Retrieved from

http://scholarworks.uno.edu/cgi/viewcontent.cgi?article=2051&context=td.

Agresti, W. (2011). Software Reuse: Developers’ Experiences and Perceptions. Journal of

Software Engineering and Applications. Doi: 10.4236. pp. 48-58.

Aziz, M., North, S. (2007). Retrieving Software Component using Clone Detection and Program

Slicing. The University of Sheffield.

Baeza-Yates, R., Ribeiro-Neto, B. (2011). Modern Information Retrieval the concepts and

technology behind search. Second edition. Addison Wesley; Harlow, England.

Barringer, H., Cheng, J. and Jones, C. (1984). A logic covering undefinedness in program proofs.

Acta Informatica, 21(3). Pp. 251-269.

Belkin, N. J., & Croft, W. B. (1992). Information filtering and information retrieval: two sides of

the same coin?. Communications of the ACM, 35(12), 29-38.

Binkley, D., & Lawrie, D. (2008). Applications of information retrieval to software

development. Encyclopedia of Software Engineering (P. Laplante, ed.),(to appear).

Bordogna, G., Carrara, G., Pasi, G. (1992). Extending Boolean Information Retrieval: A Fuzzy

Model Based on Linguistic Variables. In Fuzzy Systems, 1992 IEEE International

Conference on (pp. 769-776). IEEE.

Bordogna, G., Pasi, G. (1993). A Fuzzy Linguistic Approach Generalizing Boolean Information

Retrieval: A Model and Its Evaluation. JASIS, 44(2), 70-82.

79


Bookstein, A. (1980). Fuzzy requests: an approach to weighted Boolean searches. Journal of the

American Society for information Science, 31(4), 240-247.

Burton, B. A., Aragon, R. W., Bailey, S. A., Koehler, K. D., & Mayes, L. A. (1988). The

reusable software library. In Software reuse: emerging technology (pp. 129-137). IEEE

Computer Society Press.

Buell, D. A., & Kraft, D. H. (1981). A model for a weighted retrieval system. Journal of the

American Society for Information Science, 32(3), 211-216.

Carpenter, B., Morris, M., & Baldwin, B. (2011). Lucene Version 3.0 Tutorial. Draft of: March,

31.

Chau, R., Yeh, C. (2004). Fuzzy Conceptual Indexing for Concept-Based

Cross-Lingual Text Retrieval. IEEE Internet Computing. 2004. Pgs. 14-21.

Chowdhury, C. R., & Bhuyan, P. (2010, July). Information retrieval using fuzzy c-means

clustering and modified vector space model. In Computer Science and Information

Technology (ICCSIT), 2010 3rd IEEE International Conference on Vol. (1), pp. 696-700.

IEEE.

Crestani, F., & Lalmas, M. (2001). Logic and uncertainty in information retrieval. In Lectures on

information retrieval (pp. 179-206). Springer Berlin Heidelberg.

Croft, W., Metzler, D. & Strohman, T. (2010). Search Engines Information Retrieval in

Practice. Pearson Education; Boston, MA.

Deerwester, S., Dumais, S., Furnas, G., Landauer, T. & Harshman, R. (1990). Indexing by

Latent Semantic Analysis. Journal of the American Society for Information Science.

80


41(6). pp. 391-407.

Eckard, E., & Chappelier, J. C. (2007). Free Software for research in Information Retrieval and

Textual Clustering. Technical report, Ecole Polytechnique Federale de Lausanne.

Fox, E. A., & Sharan, S. (1986). A comparison of two methods for soft Boolean operator

interpretation in information retrieval.

Frakes, W.B., Baeza-Yates, R. (1992). Information Retrieval Data Structures & Algorithms.

Prentice Hall: Englewood Cliffs, New Jersey.

Frakes, W. B., & Pole, T. P. (1994). An empirical study of representation methods for reusable

software components. Software Engineering, IEEE Transactions on, 20(8), 617-630.

Gallardo-Valencia, R. E., & Elliott Sim, S. (2009, May). Internet-scale code search. In

Proceedings of the 2009 ICSE Workshop on Search-Driven Development-Users,

Infrastructure, Tools and Evaluation (pp. 49-52). IEEE Computer Society.

Gibb, F., McCartan, C., O’Donnell, R., Sweeney, N., & Leon, R. (2000). The integration of

information retrieval techniques within a software reuse environment. Journal of

Information Science, 26(4), 211-226.

Haefliger, S., Von Krogh, G., & Spaeth, S. (2008). Code reuse in open source software.

Management Science, 54(1), 180-193.

Henninger, S. (1994). Using Iterative Refinement to Find Reusable Software. Software,

IEEE, 11(5), 48-59.

Hofmann, T. (1999, July). Probabilistic latent semantic analysis. In Proceedings of the Fifteenth

conference on Uncertainty in artificial intelligence (pp. 289-296). Morgan Kaufmann

81


Publishers Inc.

Houhamdi, Z., & Ghoul, S. (2001). Classifying software for reusability. Mail of technical

and scientific knowledge, 41-47.

Huang, L., D. Milne, et al. (2012). Learning a concept-based document similarity measure.

Journal of the American Society for Information Science and Technology 63(8): 1593-

1608.

Ingwersen, P. (2001). Users in context. In Lectures on information retrieval (pp. 157-178).

Springer Berlin Heidelberg.

Isakowitz, T., Kauffman, R. (1996). Supporting Search for Reusable Software

Objects. IEEE Transactions on Software Engineering, 22(6): 407-422.

Keswani, R., Joshi, S., & Jatain, A. (2014, February). Software Reuse in Practice. In Advanced

Computing & Communication Technologies (ACCT), 2014 Fourth International

Conference on (pp. 159-162). IEEE.

Khalifa, H., Khayati, O., Ghezala, H. (2008) A Behavioral and Structural Components

Retrieval Technique for Software Reuse. Advanced Software Engineering and Its

Applications. IEEE, pp. 134- 137.

Klir, G. J., & Yuan, B. (1995). Fuzzy sets and fuzzy logic (pp. 487-499). New Jersey: Prentice

Hall.

Kokkoras, F., Ntonas, K., Kritikos, A., Kakarontzas, G., & Stamelos, I. (2012). Federated Search

for Open Source Software Reuse. In Software Engineering and Advanced Applications

(SEAA), September 2012 38th EUROMICRO Conference on (pp. 200-203). IEEE.

82


Kraft, D., Bordogna, G., Pasi, G. (1998) Information Retrieval Systems: Where is the Fuzz? In

Fuzzy Systems Proceedings, 1998. IEEE World Congress on Computational

Intelligence.

The 1998 IEEE International Conference on (Vol. 2, pp. 1367-1372). IEEE.

Krueger, C. (1992). Software reuse. ACM Comput. Surv. 24, 2 (June 1992), 131-183.

DOI=10.1145/130844.130856 http://doi.acm.org/10.1145/130844.130856.

Maarek, Y., Berry, D., & Kaiser, G. (1991). An Information Retrieval Approach For

Automatically Constructing Software Libraries. IEEE Transactions on Software

Engineering. Vol 17(8). P. 800-813.

Manning, C., Raghavan, P., & Schutze, H. (2008). Introduction to Information Retrieval.

Cambridge University Press.

Marri, M. R., Thummalapenta, S., & Xie, T. (2009, May). Improving software quality via code

searching and mining. In Proceedings of the 2009 ICSE Workshop on Search-Driven

Development-Users, Infrastructure, Tools and Evaluation (pp. 33-36). IEEE Computer

Society.

Maron, M. E., & Kuhns, J. L. (1960). On relevance, probabilistic indexing and information

retrieval. Journal of the ACM (JACM), 7(3), 216-244.

McIlroy, M. (1968) Mass produced software components. In Software Engineering; Report on

a conference by the NATO Science Committee (Garmisch, Germany) Oct. pp. 138-150.

Mili, H., Mili, F., Mili, A. (1995). Reusing Software: Issues and Research Directions.

IEEE Transactions on Software Engineering. 31(6): 528-562.

83

http://doi.acm.org/10.1145/130844.130856


Miller, B. (1996). Fuzzy Logic. Electronics Now. May 1996. pp. 29-30, 56-60.

Miyamoto, S. (1990). Fuzzy Sets in Information Retrieval and Cluster Analysis. Kluwer

Academic Publishers. Dordrecht, Netherlands.

Mockus, A., (2007). Large-scale code reuse in open source software. International Workshop

on Emerging Trends in FLOSS Research and Development, 0:7, 2007.

Morisio, M., Ezran, M., & Tully, C. (2002). Success and failure factors in software reuse.

Software Engineering, IEEE Transactions on, 28(4), 340-357.

Nomoto, K., Kubo, T., Kosuge, Y. (1995). Fuzzy Thesaurus Generation Based on Cross-Index

Matrix for Case-Based Reasoning. IEEE. January, 1995. 4033-4038.

Pasi, G., Bordogna, G. (2013) The Role of Fuzzy Sets in Information Retrieval. On Fuzziness,

Vol 2, R. Seising et al. Springer –Verlag Berlin Heidelberg. P. 525-532.

Prieto-Diaz, R. (1991). Implementing Faceted Classification for Software Reuse.

Communications of the ACM. 34(5).

Prieto-Diaz, R., Freeman, P.. (1987). Classifying Software for Reusability. IEEE

Software. January, 1987. P. 6-16. ProQuest database.

Radecki, T. (1979). Fuzzy set theoretical approach to document retrieval. Information

Processing & Management, 15(5), 247-259.

Reiss, S. P. (2009, May). Semantics-based code search. In Software Engineering, 2009. ICSE

2009. IEEE 31st International Conference (pp. 243-253). IEEE.

Robertson, S. (2004). Understanding inverse document frequency: on theoretical arguments for

IDF. Journal of documentation, 60(5), 503-520.

84


Robertson, S., Zaragoza, H. & Taylor, M. (2004). Simple BM25 Extension to Multiple

Weighted Fields. ACM Conference on Information Knowledge 2004. Nov. 8-13.

Rothenberger, M. A., Dooley, K. J., Kulkarni, U. R., & Nada, N. (2003). Strategies for software

reuse: A principal component analysis of reuse practices. Software Engineering, IEEE

Transactions on, 29(9), 825-837.

Salton, G., Fox, E., Wu, H. (1983). Extended Boolean Information Retrieval. Communications

of the ACM. 26 (12). pp. 1022 – 1036.

Sandhu, P. S., Kaur, H., & Singh, A. (2009). Modeling of Reusability of Object Oriented

Software System. World Academy of Science, Engineering and Technology, 56(32).

Sandhu, P., Singh, H. (2007). Automatical Reusability Appraisal of Software Components using

Neuro-fuzzy Approach. World Academy of Science, Engineering and Technology. Vol. 8.

Sieg, A., Mobasher, B., & Burke, R. (2004). Inferring user’s information context from user

profiles and concept hierarchies. In Classification, Clustering, and Data Mining

Applications (pp. 563-573). Springer Berlin Heidelberg.

Sim, S. E., Clarke, C. L., & Holt, R. C. (1998, June). Archetypal source code searches: A survey

of software developers and maintainers. In Program Comprehension, 1998. IWPC'98.

Proceedings., 6th International Workshop on (pp. 180-187). IEEE.

Singhal, A. (2001). Modern information retrieval: A brief overview. IEEE Data Eng. Bull.,

24(4), 35-43.

Sojer, M., & Henkel, J. (2010). Code reuse in Open Source Software development: Quantitative

evidence, drivers, and impediments. Journal of the Association for Information Systems,

85


11(12), 868-901.

Srinivasan, P., Ruiz, M., Kraft, D., Chen, J. (2001). Vocabulary mining for information

retrieval: rough sets and fuzzy sets. Information Processing and Management Vol. (37)

pp. 15-38.

Smart, J. (2006). Lucene Tutorial. http://oak.cs.ucla.edu/cs144/projects/lucene/.

Swain, M., Anderson, J. A., Swain, N., & Korrapati, R. (2005, April). Study of information

retrieval using fuzzy queries. In SoutheastCon, 2005. Proceedings. IEEE (pp. 527-533).

IEEE.

Thummalapenta, S., (2011). Improving Software Productivity and Quality via Mining Source

Code. (Doctoral dissertation) UMI Dissertation Publishing:3442531.

Triantafyllos, G., Vassiliadis, S., Pechanek, G. (1994). A Fuzzy Information Retrieval System.

IEEE World Congress on Computational Intelligence., Proceedings of the Third IEEE

Conference on Fuzzy Systems. IEEE. (p. 150-155).

Turpin, A., & Scholer, F. (2006, August). User performance versus precision measures for

simple search tasks. In Proceedings of the 29th annual international ACM SIGIR

conference on Research and development in information retrieval (pp. 11-18). ACM.

Usunier, N., Buffoni, D., & Gallinari, P. (2009, June). Ranking with ordered weighted pairwise

classification. In Proceedings of the 26th annual international conference on machine

learning (pp. 1057-1064). ACM.

Verma, R., Sharma, B. (2013). Fuzzy Generalized Prioritized Weighted Average Operator and

its Application to Multiple Attribute Decision Making. International Journal of

86

http://oak.cs.ucla.edu/cs144/projects/lucene/


Intelligent Systems, Vol. 00 (1-24).

Vishal, Subhash, C., Kunda, J. (2012). An Effective Retrieval Scheme for Software Component

Reuse. International Journal on Computer Science and Engineering. Vol 4(7). ISSN:

0975-3397.

Xu, J., Croft, B. (1996). Query Expansion Using Local and Global Document Analysis. As

presented at SIGIR 1996, Zurich, Switzerland, ACM.

Yao, H., Etzkorn, L., Virani, S. (2008). Automated Classification and Retrieval of

Reusable Software Components. Journal of the American Society for Information

Science and Technology, 59(4): 613-627.

Yom-Tov, E., Fine, S., Carmel, D., & Darlow, A. (2005, August). Learning to estimate query

difficulty: including applications to missing content detection and distributed information

retrieval. In Proceedings of the 28th annual international ACM SIGIR conference on

Research and development in information retrieval (pp. 512-519). ACM

Zadeh, L.A., (1994). Soft Computing and Fuzzy Logic. IEEE Software. November 1994. pp.

48-56.

Zhang, Q., Wu, Y., Ding, Z., Huang X. (2012) Learning Hash Codes for Efficient Content Reuse

Detection. School of Computer Science, As presented at SIGIR 2012, ACM.

87


Appendix A

This is the resulting list from each search with the first 10 files returned for large searches and

the far right column includes the experts list of relevant files, the files in red text are the files that

both experts deemed relevant therefore those were the files deemed relevant for this search.

Highlighted yellow file names are those that match the experts relevant file list. Highlighted in

green are the files the expert deemed relevant but no search returned.

TermABool

Operator

TermB Rank

MMM doc

Paice doc

P-norm Doc

Boolean Doc

Expert1

Relevant List

Expert 2

Rel. List

field AND float 1 whatis Whatis whatis grn.1 printf printf 2 printf.1 printf.1 printf.1 printf.1 sort sort 3 stat.1 stat.1 stat.1 seq.1 seq awk 4 sort.1 sort.1 sort.1 sort.1 perl 5 grn.1 grn.1 grn.1 stat.1 6 seq.1 seq.1 seq.1 tcsh.1 7 tcsh.1 tcsh.1 tcsh.1 whatis

copy AND File 1 pax.1 pax.1 pax.1 addftinfo.1 cp cp

2 whatis Whatis whatis addr2line.1 vi pax

3 cp.1 cp.1 cp.1 afmtodit.1 cpio

4 objcopy.1

objcopy.1

objcopy.1 as.1 tar

5 lex++.1 lex++.1 lex++.1 bsdtar.1 6 mail.1 mail.1 mail.1 c89.1 7 cpio.1 cpio.1 cpio.1 c99.1 8 install.1 install.1 install.1 chmod.1

9 sendbug.1

sendbug.1

sendbug.1 ci.1

10 vi.1 vi.1 vi.1 co.1

run OR execut 1 tcsh.1 tcsh.1 tcsh.1 addr2line tcsh tsch88


e .1

2 objcopy.1

objcopy.1

objcopy.1

afmtodit.1 make atq

3 strip.1 strip.1 strip.1 apply.1 clang++ gbb

4 whatis Whatis whatis as.1 bash 5 lex++.1 lex++.1 lex++.1 atq.1 zsh 6 make.1 make.1 make.1 atrm.1.gz 7 clang++.1 clang++.1 clang++.1 biff.1

8 gcov.1 gcov.1 gcov.1 brandelf.1

9 gdb.1 gdb.1 gdb.1 bsdtar.1

10 id.1 id.1 id.1 bsnmpd.1

column AND height 1 dialog.1 dialog.1 dialog.1 dialog.1 dialog dialog

2 eqn.1 less.1 eqn.1 eqn.1 less

3 less.1 mdocml.1 less.1 less.1

4 mdocml.1 eqn.1 mdocml.

1mdocml.

1 5 units.1 units.1 units.1 units.1

constructor AND length 1 id.1 id.1 id.1 Id.1

id not relevant

g++

addend AND sum 1 whatis Whatis whatis Whatiswhatis not relev.

NONE

math AND library 1 whatis bc.1 whatis bc.1 bc clang++

2 bc.1 Whatis bc.1 clang++ clang++ bc

3 clang++.1 clang++.1 clang++.1 Whatis

space AND gap 1 pr.1 pr.1 pr.1 id.1 tcsh pr 2 objcopy. objcopy. objcopy. objcopy. pr objco

89


1 1 1 1 py 3 tcsh.1 tcsh.1 tcsh.1 pr.1 4 id.1 id.1 id.1 tcsh.1

mergesort OR heapso

rt 1 sort.1 sort.1 sort.1 sort.1 sort.1 sort

2 whatis Whatis whatis Whatis

cut OR delete 1 cut.1 cut.1 cut.1 bsnmpwalk.1 cut cut

2 bsnmpwalk.1

bsnmpwalk.1

bsnmpwalk.1 colrm.1 paste awk

3 whatis whatis whatis cut.1 perl 4 file.1 file.1 file.1 file.1 5 gperf.1 gperf.1 gperf.1 gperf.1 6 last.1 last.1 last.1 last.1 7 idd.1 idd.1 idd.1 idd.1 8 paste.1 paste.1 paste.1 paste.1 9 stat.1 stat.1 stat.1 ssh.1 10 unxz.1 unxz.1 unxz.1 unxz.1

oldest AND command 1 tcsh.1 tcsh.1 tcsh.1 pgrep.1 tcsh tsch

2 pgrep.1 pgrep.1 pgrep.1 tcsh.1 bash

history

sort AND mergesort 1 sort.1 sort.1 sort.1 sort.1 sort sort

2 whatis whatis whatis Whatis

deploy AND run 1 clang++.1 clang++.1 clang++.1 atq.1 clang++.1 atq

2 atq.1 atq.1 atq.1 clang++.1 rsh ssh

find AND file 1 find.1 find.1 find.1 afmtodit.1 find find

2 whatis whatis whatis apropos. locate locat

90


1 e 3 lex++.1 lex++.1 lex++.1 as.1 less less

4 less.1 less.1 less.1 bsdgrep.1 vi

5 tcsh.1 tcsh.1 tcsh.1 bsdtar.1 6 cpio.1 cpio.1 cpio.1 bzip2.1 7 unxz.1 unxz.1 unxz.1 chflags.1 8 vi.1 vi.1 vi.1 ci.1 9 id.1 id.1 id.1 clang++.1 10 locate.1 locate.1 locate.1 cpio.1 list AND files 1 limits.1 limits.1 limits.1 limits.1 tcsh Sh 2 sh.1 sh.1 sh.1 sh.1 ls Tcsh 3 tcsh.1 tcsh.1 tcsh.1 tcsh.1 Ls Find

Locate

sum AND add 1 whatis whatis whatis id.1 nawk Nawk 2 id.1 unxz.1 id.1 nawk.1 expr Perl 3 unxz.1 id.1 unxz.1 ps.1 4 ps.1 nawk.1 ps.1 unxz.1 5 nawk.1 ps.1 nawk.1 Whatis

subtract AND math 1 expr.1 expr.1 expr.1 expr.1 expr expr nawk nawk bc perl

add AND subtract 1 objcopy.

1objcopy.

1objcopy.

1 expr.1 expr objcopy

2 id.1 id.1 id.1 id.1 nawk bc

3 expr.1 expr.1 expr.1 objcopy.1 perl

nawk

printer AND network 1 whatis whatis whatis tcsh.1 telnet lp

91


2 telnet.1 telnet.1 telnet.1 telnet.1 lp pr 3 tcsh.1 tcsh.1 tcsh.1 whatis

mergesort AND print 1 sort.1 sort.1 sort.1 sort.1 sort sort

2 whatis whatis whatis whatis awk perl

92


Appendix B

Function to calculate MMM, Paice, P-Norm and Boolean similarity, sorts the results and prints it

to the console.

////////////////////////////////////////////////////////////////////////////////////////////////

//*** GET ALL THE IDFS, Calculate MMM, PAICE, PNORM and BOOLEAN ***/

////////////////////////////////////////////////////////////////////////////////////////////////

static Map<String, Float> getIdfs(IndexReader reader, String field, String TermA, String TermB, int BoolTerm,

BM25Similarity MySim) throws IOException

{

Float [][] MatchingListPaice = new Float [10000][2];

Float [][] MatchingListPNorm = new Float [10000][2];

Float [][] MatchingList = new Float [10000][2];

Float [][] MatchingListBoolean = new Float [10000][2];

Integer [][] StatsListA = new Integer [100000][4];

Integer [][] StatsListB = new Integer [100000][4];

int MAXA = 0;

int MAXB = 0;

int MAXfreqA = 0;

int MAXfreqB = 0;

Fields fields = MultiFields.getFields(reader); //Get the Fields of the index

Bits liveDocs = MultiFields.getLiveDocs(reader);

//HERE, getting all the data but dynamically, need to search through the data for match of terms and then

calculate //values and display results

93


int b = 0;

int x = 0;

int k = 0;

int TOTAL_NUM_MATCHING_DOCS = 0;

int TOTAL_NUM_DOCS = reader.numDocs();

for (String field2 : fields) {

TermsEnum termEnum = MultiFields.getTerms(reader, field2).iterator(null);

Terms newTermEnum = MultiFields.getTerms(reader, "contents");

BytesRef bytesRef;

while ((bytesRef = termEnum.next()) != null) {

if (termEnum.seekExact(bytesRef, true)) {

if (bytesRef.utf8ToString().equalsIgnoreCase(TermA) ||(bytesRef.utf8ToString().equalsIgnoreCase(TermB))){

DocsEnum docsEnum = termEnum.docs(liveDocs, null);

if (docsEnum != null) {

int doc;

while ((doc = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS)) {

if (bytesRef.utf8ToString().equalsIgnoreCase(TermA)) {

Term termInstanceA = new Term("contents", TermA);

long termFreqA = reader.totalTermFreq(termInstanceA);

long docCountA = reader.docFreq(termInstanceA);

StatsListA[k][0] = b;

StatsListA[k][1] = doc;

StatsListA[k][2] = docsEnum.freq();

StatsListA[k][3] = (int) docCountA;

k++;

94


}

else {

Term termInstanceB = new Term("contents", TermB);

long docCountB = reader.docFreq(termInstanceB);

StatsListB[x][0] = b;

StatsListB[x][1] = doc;

StatsListB[x][2] = docsEnum.freq();

StatsListB[x][3] = (int) docCountB;

x++;

}

MAXA = k;

MAXB = x;

}

}

}

}

}

}

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

////getting documents, term frequency, calculating similarity

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

float MAX1 = 0;

float MAX2 = 0;

float MAX3 = 0;

float MAX4 = 0;

95


long temp1 = 0;

long temp2 = 0;

int LAST_CHECK = 0;

int AValue = 0;

int BValue = 0;

float ScoreCalcA;

float ScoreCalcB;

int numDocs = 0;

if (MAXA <= MAXB) {

for (int r = 0; r < MAXA; r++) {

for (int t = 0; t < MAXB; t++) {

AValue = StatsListA[r][1];

BValue = StatsListB[t][1];

//If the document ID match, there is a matching file

if (BValue == AValue) { //compare document numbers, if they match, and the query is an AND go here

TOTAL_NUM_MATCHING_DOCS ++;

MatchingList[numDocs][0] = (float)StatsListA[r][1];

MatchingListPaice[numDocs][0] = (float)StatsListA[r][1];

MatchingListPNorm[numDocs][0] = (float)StatsListA[r][1];

MatchingListBoolean[numDocs][0] = (float) StatsListA[r][1];

MatchingListBoolean[numDocs][1] = (float) 1;

ScoreCalcA = StatsListA[r][2];

ScoreCalcB = StatsListB[t][2];

temp1 = fileList[AValue];

temp2 = fileList[BValue];

96


if (ScoreCalcA > 0) {

MAX1 = (float) Math.log(ScoreCalcA)+ 1;// gets the tF

}

Else {

MAX1 = 1; // gets the tF

}

if (ScoreCalcB > 0) {

MAX2 = (float) Math.log(ScoreCalcB) + 1;

}

Else {

MAX2 = 1; // / StatsListB[t][3]);

}

MAX3 = (float) (Math.log((TOTAL_NUM_DOCS+1) / (MAXA+1))); //gets the IDF

MAX4 = (float) (Math.log((TOTAL_NUM_DOCS+1) / (MAXB+1)));

ScoreCalcA = (float) MAX1 * MAX3;

ScoreCalcB = (float) MAX2 * MAX4;

if (BoolTerm == 1) { //OR

MatchingList[numDocs][1] = (float) (.7 * Math.min(ScoreCalcA, ScoreCalcB)+ .3 *

Math.max(ScoreCalcA, ScoreCalcB));

MatchingListPNorm[numDocs][1] = (float) Math.sqrt(( Math.pow(1,2)*

Math.pow(ScoreCalcA, 2) + Math.pow(1,2)* Math.pow(ScoreCalcB, 2))/

(Math.pow(1,2) + Math.pow(1,2)));

//sort in descending order for Paice OR

if (ScoreCalcA >= ScoreCalcB) {

MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(.7, 0) * ScoreCalcA +

97


Math.pow(.7, 1) * ScoreCalcB /(Math.pow(.7, 0) + Math.pow(.7, 1)));

}

Else {

MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(.7, 0) * ScoreCalcB +

Math.pow(.7, 1) * ScoreCalcA /(Math.pow(.7, 0) + Math.pow(.7, 1)));

}

}

Else { //AND

MatchingList[numDocs][1] = (float) (.4 * Math.min(ScoreCalcA, ScoreCalcB) + .6 *


MatchingListPNorm[numDocs][1] = Math.abs((float) (1 - Math.sqrt(( Math.pow(1,2)*

Math.pow(1 - ScoreCalcA, 2) + Math.pow(1,2)* Math.pow(1 - ScoreCalcB, 2))/

(Math.pow(1,2) + Math.pow(1,2)))));

if (ScoreCalcA <= ScoreCalcB) {

MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(1, 0) * ScoreCalcA

Math.pow(1, 1) * ScoreCalcB /(Math.pow(1, 0) + Math.pow(1, 1)));

}

Else {

MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(1, 0) * ScoreCalcB + Math.pow(1, 1)

* ScoreCalcA /(Math.pow(1, 0) + Math.pow(1, 1)));

}

}

numDocs++;

98


LAST_CHECK = AValue;

} //End if of if A==B

//if the A list is exhausted but still documents left in B, do those, or if B is exhasuted, and A still has files

else if ((BValue < AValue && BoolTerm == 1 && BValue > LAST_CHECK) || (BoolTerm == 1 &&

AValue < BValue && (r == (MAXA -1)))) {


MatchingList[numDocs][0] = (float)StatsListB[t][1];

MatchingListPaice[numDocs][0] = (float)StatsListB[t][1];

MatchingListPNorm[numDocs][0] = (float)StatsListB[t][1];

MatchingListBoolean[numDocs][0] = (float) StatsListB[t][1];


ScoreCalcA = 0;


temp1 = 0;

temp2 = fileList[BValue];


MAX1 = (float) Math.log(ScoreCalcA) + 1;// gets the tF

}

Else {

MAX1 = 0;

}


MAX2 = (float) Math.log(ScoreCalcB ) + 1;

}

Else {

99


MAX2 = 1;

}

MAX3 = (float) Math.log((TOTAL_NUM_DOCS+1) / (MAXA+1));//gets the IDF

MAX4 = (float) Math.log((TOTAL_NUM_DOCS+1) / (MAXB+1));///

ScoreCalcA = (float) MAX1 * MAX3;


LAST_CHECK = StatsListB[t][1];




MatchingListPNorm[numDocs][1] = Math.abs((float) Math.sqrt(( Math.pow(1,2)*

Math.pow(ScoreCalcA, 2) + Math.pow(1,2)* Math.pow(ScoreCalcB, 2))/ (Math.pow(1,2) +

Math.pow(1,2))));



MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(.7, 0) * ScoreCalcA +

Math.pow(.7, 1) * ScoreCalcB /(Math.pow(.7, 0) + Math.pow(.7, 1)));

}

Else {

[numDocs][1] = (float) ((float) Math.pow(.7, 0) * ScoreCalcB + Math.pow(.7, 1) * ScoreCalcA

/(Math.pow(.7, 0) + Math.pow(.7, 1)));

}

}

Else { //AND

100




MatchingListPNorm[numDocs][1] = Math.abs((float) (1 - Math.sqrt(( Math.pow(1,2)* Math.pow(1 –

ScoreCalcA, 2) + Math.pow(1,2)* Math.pow(1 - ScoreCalcB, 2))/ (Math.pow(1,2) +

Math.pow(1,2)))));


MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(1, 0) * ScoreCalcA + Math.pow(1, 1)

* ScoreCalcB /(Math.pow(1, 0) + Math.pow(1, 1)));

}

Else {

MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(1, 0) * ScoreCalcB + Math.pow(1, 1) *

ScoreCalcA /(Math.pow(1, 0) + Math.pow(1, 1)));

}

}

numDocs++;

}

}

}

}

else if (MAXB < MAXA) { //should come here if the number of docs in A is greater than number of docs in B

for (int r = 0; r < MAXB; r++) {

for (int t = 0; t < MAXA; t++){

AValue = StatsListA[t][1];

BValue = StatsListB[r][1];

if (AValue == BValue) {

101



MatchingList[numDocs][0] = (float)StatsListA[t][1];

MatchingListPaice[numDocs][0] = (float)StatsListA[t][1];

MatchingListPNorm[numDocs][0] = (float)StatsListA[t][1];

MatchingListBoolean[numDocs][0] = (float) StatsListA[t][1];


ScoreCalcA = StatsListA[t][2];

ScoreCalcB = StatsListB[r][2];


MAX1 = (float) Math.log(ScoreCalcA) + 1; // gets the tF

}

Else {

MAX1 = 1;

}


MAX2 = (float) Math.log(ScoreCalcB ) + 1;

}

Else {

MAX2 = 1;

}

MAX3 = (float) ((float) Math.log((TOTAL_NUM_DOCS +1) / (MAXA +1) ));

MAX4 = (float) Math.log((TOTAL_NUM_DOCS + 1) / (MAXB+1));

ScoreCalcA = (float) MAX1*MAX3;



102




MatchingListPNorm[numDocs][1] = (float) Math.abs(Math.sqrt(( Math.pow(1,2)* Math.pow(ScoreCalcA,

2) + Math.pow(1,2)* Math.pow(ScoreCalcB, 2))/ (Math.pow(1,2) + Math.pow(1,2))));


MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(.7, 0) * ScoreCalcA + Math.pow(.7, 1) *

ScoreCalcB /(Math.pow(.7, 0) + Math.pow(.7, 1)));

}

Else {

MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(.7, 0) * ScoreCalcB + Math.pow(.7, 1) *

ScoreCalcA /(Math.pow(.7, 0) + Math.pow(.7, 1)));

}

}

Else { //AND



MatchingListPNorm[numDocs][1] = (float) Math.abs((1 - Math.sqrt(( Math.pow(1,2)* Math.pow(1 –

ScoreCalcA, 2) + Math.pow(1,2)* Math.pow(1 - ScoreCalcB, 2))/ (Math.pow(1,2) +

Math.pow(1,2)))));


MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(1, 0) * ScoreCalcA + Math.pow(1, 1) *

ScoreCalcB /(Math.pow(1, 0) + Math.pow(1, 1)));

}

Else {


103



}

}

numDocs++;

LAST_CHECK = BValue;

} //ENd if avalue == bvalue

else if ((AValue < BValue && BoolTerm == 1 && AValue > LAST_CHECK) || (BoolTerm == 1 && BValue <

AValue && r == (MAXB-1))) {








ScoreCalcB = 0;


MAX1 = (float) Math.log(ScoreCalcA) + 1; // gets the tF

}

Else {

MAX1 = 1;

}


MAX2 = (float) Math.log(ScoreCalcB ) + 1; ///(fileList[BValue]); // / StatsListB[t][3]);

}

104


Else {

MAX2 = 0;

}

MAX3 = (float)Math.log( (TOTAL_NUM_DOCS + 1) / (MAXA + 1));

MAX4 = (float) Math.log((TOTAL_NUM_DOCS + 1) / (MAXB + 1));



LAST_CHECK = StatsListA[t][1];


MatchingList[numDocs][1] = (float) (.7 * Math.min(ScoreCalcA, ScoreCalcB) + .3 * Math.max(ScoreCalcA,

ScoreCalcB));

MatchingListPNorm[numDocs][1] = Math.abs((float) Math.sqrt(( Math.pow(1,2)* Math.pow(ScoreCalcA, 2) +

Math.pow(1,2)* Math.pow(ScoreCalcB, 2))/ (Math.pow(1,2) + Math.pow(1,2))));


MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(.7, 0) * ScoreCalcA + Math.pow(.7,

1) * ScoreCalcB /(Math.pow(.7, 0) + Math.pow(.7, 1)));

}

Else {

MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(.7, 0) * ScoreCalcB + Math.pow(.7,

1) * ScoreCalcA /(Math.pow(.7, 0) + Math.pow(.7, 1)));

}

}

else{ //AND



105


MatchingListPNorm[numDocs][1] = (float) (1 - Math.sqrt(( Math.pow(1,2)* Math.pow(1 - ScoreCalcA, 2)

+ Math.pow(1,2)* Math.pow(1 - ScoreCalcB, 2))/ (Math.pow(1,2) + Math.pow(1,2))));




}

Else {



}

}

numDocs++;

}

}

}

}

//if one term has 0 matched files and it's an OR query, loop through the other terms matching files for score

if (MAXA == 0 && MAXB > 0 && BoolTerm == 1) {

for (int t=0; t< MAXB; t++) {


AValue = 0;

BValue = StatsListB[t][1];

MatchingList[numDocs][0] = (float)StatsListB[t][1];

MatchingListPaice[numDocs][0] = (float)StatsListB[t][1];

MatchingListPNorm[numDocs][0] = (float)StatsListB[t][1];

106


MatchingListBoolean[numDocs][0] = (float) StatsListB[t][1];


ScoreCalcA = 0;



MAX1 = (float) Math.log(ScoreCalcA) + 1 ; // gets the tF

}

Else {

MAX1 = 0;

}


MAX2 = (float) Math.log(ScoreCalcB ) + 1; ///(fileList[BValue]); // / StatsListB[t][3]);

}

Else {

MAX2 = 1;

}

MAX3 = (float) Math.log(1) ; //gets the IDF

MAX4 = (float) Math.log((TOTAL_NUM_DOCS + 1) / (MAXB + 1));

ScoreCalcA = (float) 0;


LAST_CHECK = StatsListB[t][1];




107


MatchingListPNorm[numDocs][1] = Math.abs((float) Math.sqrt(( Math.pow(1,2)* Math.pow(ScoreCalcA,

2) + Math.pow(1,2)* Math.pow(ScoreCalcB, 2))/ (Math.pow(1,2) + Math.pow(1,2))));





}

Else {



}

}

else //AND

{


ScoreCalcB));

MatchingListPNorm[numDocs][1] = (float) (1 - Math.sqrt(( Math.pow(1,2)* Math.pow(1 - ScoreCalcA, 2) +

Math.pow(1,2)* Math.pow(1 - ScoreCalcB, 2))/ (Math.pow(1,2) + Math.pow(1,2))));


MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(1, 0) * ScoreCalcA + Math.pow(1, 1) * ScoreCalcB

/(Math.pow(1, 0) + Math.pow(1, 1)));

}

Else {

MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(1, 0) * ScoreCalcB + Math.pow(1, 1) * ScoreCalcA

/(Math.pow(1, 0) + Math.pow(1, 1))) ;

108


}

}

numDocs++;

}

}

//if one term has 0 matched files and it's an OR query, loop through the other terms matching files for score

else if (MAXB == 0 && MAXA > 0 && BoolTerm == 1) {

for (int t =0; t < MAXA; t++){

AValue = StatsListA[t][1];

BValue = 0;








ScoreCalcB = 0;

if (ScoreCalcA > 0){

MAX1 = (float) Math.log(ScoreCalcA) + 1;// gets the tF

}

Else {

MAX1 = 1;

}


109


MAX2 = (float) Math.log(ScoreCalcB) + 1;

}

else

{

MAX2 = 0;

}

MAX3 = (float)Math.log((TOTAL_NUM_DOCS + 1) / (MAXA + 1));

MAX4 = (float) Math.log(1) ;


ScoreCalcB = (float) 0;

LAST_CHECK = StatsListA[t][1];



ScoreCalcB));

MatchingListPNorm[numDocs][1] = (float) Math.sqrt(( Math.pow(1,2)* Math.pow(ScoreCalcA, 2) +

Math.pow(1,2)* Math.pow(ScoreCalcB, 2))/ (Math.pow(1,2) + Math.pow(1,2)));

if (ScoreCalcA >= ScoreCalcB){



}

else{



}

}

110


Else { //AND


ScoreCalcB));

MatchingListPNorm[numDocs][1] = (float) (1 - Math.sqrt(( Math.pow(1,2)* Math.pow(1 - ScoreCalcA, 2) +

Math.pow(1,2)* Math.pow(1 - ScoreCalcB, 2))/ (Math.pow(1,2) + Math.pow(1,2))));

if (ScoreCalcA <= ScoreCalcB){



}

else{



}

}

numDocs ++;

}

}

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

//Sort and Print

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

for (int i = 0; i < numDocs ; i++) {

float currentMax = MatchingList[i][1];

111


float currentPaiceMax = MatchingListPaice[i][1];

int currentPaiceIndex = i;

int currentMaxIndex = i;

float currentPNormMax = MatchingListPNorm[i][1];

int currentPNormIndex = i;

for (int j = i; j < numDocs; j++) {

if (currentMax < MatchingList[j][1] && MatchingList[currentMaxIndex][1] < MatchingList[j][1])

{

currentMax = MatchingList[j][1];

currentMaxIndex = j;

}

if (currentPaiceMax < MatchingListPaice[j][1] && MatchingListPaice[currentPaiceIndex][1] <

MatchingListPaice[j][1])

{

currentPaiceMax = MatchingListPaice[j][1];

currentPaiceIndex = j;

}

if (currentPNormMax < MatchingListPNorm[j][1] &&

MatchingListPNorm[currentPNormIndex][1] < MatchingListPNorm[j][1])

{

currentPNormMax = MatchingListPNorm[j][1];

currentPNormIndex = j;

}

}

112


if (currentMaxIndex != i) {

float temp0 = MatchingList[currentMaxIndex][0];

float temp11 = MatchingList[currentMaxIndex][1];

MatchingList[currentMaxIndex][0] = MatchingList[i][0];

MatchingList[currentMaxIndex][1] = MatchingList[i][1];

MatchingList[i][0] = temp0;

MatchingList[i][1] = temp11;

}

if (currentPaiceIndex != i) {

float temp12 = MatchingListPaice[currentPaiceIndex][0];

float temp3 = MatchingListPaice[currentPaiceIndex][1];

MatchingListPaice[currentPaiceIndex][0] = MatchingListPaice[i][0];

MatchingListPaice[currentPaiceIndex][1] = MatchingListPaice[i][1];

MatchingListPaice[i][0] = temp12;

MatchingListPaice[i][1] = temp3;

}

if (currentPNormIndex != i) {

float temp4 = MatchingListPNorm[currentPNormIndex][0];

float temp5 = MatchingListPNorm[currentPNormIndex][1];

MatchingListPNorm[currentPNormIndex][0] = MatchingListPNorm[i][0];

MatchingListPNorm[currentPNormIndex][1] = MatchingListPNorm[i][1];

MatchingListPNorm[i][0] = temp4;

MatchingListPNorm[i][1] = temp5;

}

}

113


int w = 0;

if (numDocs > 150){

w = 150;

}

Else {

w = numDocs;

}

for (int p = 0; p < w; p++) {

System.out.println("Match " + p + 1);

System.out.println(MatchingList[p][0] + " MMM " + MatchingList[p][1]);

System.out.println(MatchingListPaice[p][0] + " Paice " + MatchingListPaice[p][1]);

System.out.println(MatchingListPNorm[p][0] + " P-Norm " + MatchingListPNorm[p][1]);

System.out.println(MatchingListBoolean[p][0] + " Boolean " + MatchingListBoolean[p][1]);

System.out.println("+++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++");

}

System.out.println("Number of docs found that match: " + TOTAL_NUM_MATCHING_DOCS);

System.out.println("Number of docs for Term 1: " + MAXA);

System.out.println("Number of docs for Term 2: " + MAXB);

return docFrequencies;

}

}

114

abstract - storage.googleapis.comstorage.googleapis.com/.../erin_colvin_dissertation_pa… · web...

Documents