abstract - storage.googleapis.comstorage.googleapis.com/.../erin_colvin_dissertation_pa… · web...
TRANSCRIPT
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
A dissertation presented in partial fulfillment of the requirements for the degree of Doctor of Computer Science
By
Erin Colvin
Colorado Technical University
December, 2014
© Erin Colvin, 2014
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Committee
Dr. Donald Kraft, Ph.D., Chair
Dr. Caroline Howard, Ph.D., Committee Member
Dr. Bo Sanden, Ph.D., Committee Member
Date Approved
5/30/2014
ii
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Abstract
This is a dissertation that will provide an analysis of the use of fuzzy sets for retrieval of software
for the purpose of reuse. The need for quality software components is growing exponentially but
the ability of software developers to meet this need is not. One major goal of any programmer is
to develop quality software in an efficient amount of time. By reusing an already created and
tested piece of code, programmers can do just that. Most development teams hesitate from the
reuse realm because they cannot find quality software quickly. Software needs to be easily
accessible and the results of a query to find quality software need to meet a user’s expectation.
Most software search algorithms are based on a Boolean search, where a term used to describe a
given software component either matches or doesn’t match a queried term. Here, using fuzzy
logic, a term is given a weight based on its degree of membership, which will result in a
weighted list of matches while maintaining the semantics of Boolean logic. The result is a list of
matched documents in descending order by how well the document matches the queried term.
Various methods of information retrieval implementation, analysis of fuzzy set retrieval,
and benefits of its use for software reuse will be examined and presented in this dissertation. A
deeper explanation of the fundamentals of designing a fuzzy information retrieval system for
software look up will also be examined. Future research options and necessary data storage
systems will be explained as well.
iii
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Acknowledgements
I would like to express the deepest appreciation to my mentor and committee chair Dr. Donald
Kraft, who has shown the attitude and the substance of nothing short of a saint and a genius: he
continually encouraged me and conveyed a spirit that nothing is impossible in regard to my
research. He opened many doors to other researchers and always knew the right thing to say to
encourage me to continue on and finish this dissertation. His wit and humor always put a smile
on my face and without his constant help this dissertation would not have been possible.
I would also like to thank my readers and other committee members, Dr. Howard and Dr.
Sanden, you both have been inspirational faculty to me while at Colorado Technical University
and I thank you for your time and advice that you have given me over the past 3 years. I hope
one day I will be as amazing and inspiring as you have been as part of the faculty at CTU.
iv
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Dedication
This paper and degree would not have been possible without the continued support of my
wonderful husband, thank you for all you do serving our country and our family. Thank you for
stepping up when my head was in a book, I love you.
To my parents, who taught me that I could do anything I set my mind to, I love you mom
and dad, thanks for all the support. Lastly, to my kids who I hope learn that any goal can be
accomplished if you set your mind to it.
v
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Table of Contents
Abstract..........................................................................................................................................iii
Acknowledgements........................................................................................................................iv
Dedication........................................................................................................................................v
List of Figures.................................................................................................................................ix
CHAPTER 1: INTRODUCTION....................................................................................................1
BACKGROUND OF THE PROBLEM......................................................................................2
Software Reuse............................................................................................................................2
Information Retrieval..................................................................................................................3
Fuzzy Retrieval............................................................................................................................4
Purpose of the Study....................................................................................................................5
Statement of the Problem............................................................................................................5
Research Questions......................................................................................................................6
Hypothesis...................................................................................................................................6
Theoretical Framework................................................................................................................6
Objectives of the Study................................................................................................................8
Assumptions and Limitations......................................................................................................8
Summary......................................................................................................................................9
CHAPTER 2: REVIEW OF THE LITERATURE........................................................................11
Introduction to Software Reuse.................................................................................................11
Searching for Software..............................................................................................................18
Information Retrieval - Boolean Logic.....................................................................................26
vi
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Fuzzy Sets and Extended Boolean Logic..................................................................................27
Document Pre-processing..........................................................................................................40
Information Retrieval Software.................................................................................................42
Measures of similarity...............................................................................................................43
Literature Review Summary......................................................................................................46
CHAPTER 3: METHODOLOGY................................................................................................48
Lucene.......................................................................................................................................48
Procedure...................................................................................................................................50
Data............................................................................................................................................51
Sample Size...............................................................................................................................57
Instrumentation..........................................................................................................................57
Validity and Reliability.............................................................................................................58
Conclusion.................................................................................................................................59
CHAPTER 4: RESULTS...............................................................................................................60
Data Collection Reviewed.........................................................................................................60
Presentation and Discussion of Findings...................................................................................62
Limitations of the Study............................................................................................................62
Measure of Similarity................................................................................................................63
Quantitative Methodology Measurement..................................................................................67
Hypothesis Testing....................................................................................................................68
Conclusion.................................................................................................................................69
CHAPTER 5: DISCUSSION OF RESULTS AND FUTURE WORKS......................................70
vii
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Research Question 1..................................................................................................................70
Research Question 2..................................................................................................................71
Research Question 3..................................................................................................................73
Future Research.........................................................................................................................74
Conclusion.................................................................................................................................77
References......................................................................................................................................78
Appendix A....................................................................................................................................87
Appendix B....................................................................................................................................92
viii
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
List of Figures
Table 1: Values calculated by Boolean logic search for term A and B………………………..27
Table 2: Values calculated by vector-processing logic search for term A and B……………...33
Table 3: Returned data with file scores to show MAP calculation…………………………….65
Table 4: Similarity scores with file numbers to show ranking…………….…………………...66
Table 5: AP and MAP scores for all searches………………………………………………….68
Table 6: Search MAP scores and the ranking of best to worst search………....……………….68
Table 7: MAP scores for all searches and % increase over Boolean search……………………69
Figure 1: Indexing Process……………………………………………………………………..41
Figure 2: List of files found in the UNIX data corpus…………………………………………52
Figure 3: The alarm.1 file found in the data corpus……………………………………………53
Figure 4: User interface for custom created software………………………………………….56
Figure 5: List of files found in the UNIX directory……………………………………………61
Figure 6: Graph of AP among all searches for all queries……………………………………...71
Equation 1: RSV for an AND query…………………………………………………………..35
Equation 2: MMM similarity for an OR query………………………………………………..38
Equation 3: MMM similarity for an AND query……………………………………………...38
Equation 4: Paice similarity……………………………………………………………………39
Equation 5: P-Norm similarity for an OR query………………………………………………..39
Equation 6: P-Norm similarity for an AND query……………………………………………..39
Equation 7: Average precision………………………………………………………………….45ix
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Equation 8: Mean average precision or MAP………………………………………………….45
x
CHAPTER 1: INTRODUCTION
The main goal of any programmer is to develop and deliver high quality software
applications that meet a customer’s needs effectively and efficiently. For years, programmers
have searched the web and other open source libraries for software components to reuse instead
of creating an application from scratch (Thummalapenta, 2011). The number of open-source
software libraries on the web has increased as well. Software reuse has been a proven effective
tool for developers to meet time-to-market deadlines and produce a solid, error-free piece of
software (Krueger, 1992). Software libraries are full of software components that have been
created and stored with proven working track records. Mockus found that 50% of the code being
created for production was using code from previous programs (2007). The trick is being able to
find an already-written piece of software when needed. When searching for data, most users try
to best quantify the terms they are looking for in as little words as possible and the system yields
the results that it determines to be the best match (Bordogna, Carrara & Pasi, 1992). Searching
for data can be done in many ways; searching for software requires an alternative approach since
the programming language, the behavior of the software, or the intended purpose may require
different search terms.
Many information retrieval (IR) applications today use a Boolean match; a document
either contains the searched term, or it doesn’t. The IR system returns a list of documents in no
particular order requiring the user to discern, which match is the closest fit. Fuzzy logic attaches
a weighted value to those matches based on the degree of match, yielding a more accurate
chance of meeting the user’s goals (Bordogna, Carrara & Pasi, 1992).
By using software that is already created, tested and verified to work, programmers can
reduce development and test time. Less development and test time means a faster time to
market. Software reuse can include one of four types of reusable artifacts, data reuse,
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
architecture reuse, design reuse, and program reuse (Aziz & North, 2007). If software returned
from a search can be listed by the degree of match, a user will have a better choice of which
software component to use. Using a fuzzy method, the query will return a weighted list of
results, versus a Boolean search. The goal of this study is to implement algorithms using fuzzy
logic that will have a higher success rate of returning a better match of software that can be
reused.
BACKGROUND OF THE PROBLEM
Software Reuse
Software reuse was introduced in 1968 by McIlroy at a NATO conference on software
engineering. McIlroy explained that the industry needed to be standardized in order to
successfully utilize the software that is currently stored to create an ease of finding compatible
components that are already created and available to use (McIlroy, 1968). This seminal work
has been cited many times. Krueger explained that software reuse was a great way for software
engineers to reutilize software components in order to save company money. He stated that
reuse of software has many benefits like minimized rework, more time for development and
more stabilized applications (Krueger, 1992).
Sojer and Henkel in 2010 found that developers are more inclined to reuse software
components if they were easily accessible or easily found in a search. Looking at open-source
software applications, they also found that with no standard for software implementation and
most programs lacking the descriptive comments that help determine the use of the software
component, finding the right software components was difficult to impossible (Sojer & Henkel,
2
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
2010). Software reuse has been around since the 1960s but has not yet gained an industry
standard. One reason for this is the lack of standard for software storage. Software can be
indexed or stored by behavior, function, text and comment text (Vishal, Chander & Kundu,
2012). This makes searching for software difficult and requires a search algorithm to conform to
multiple attributes. This research will not look at how software is stored or at the comments (if
any), but the point is made that software can be stored and accessed in multiple ways.
Information Retrieval
Information retrieval (IR) has been the topic of many research studies since the advent of
information retrieval algorithms in the 1950s to facilitate the data stores in the new computers of
the time, to the present day Internet (Kraft, Bordogna & Pasi, 1998) (Pasi & Bordogna, 2013)
(Singhal, 2001). Each search has a primary target set of data, from a simple web search for
general information to specifics like restaurants near a certain location; each can utilize different
contextual information to help with the search (Baeza-Yates & Ribeiro-Neto, 2011) (Belkin &
Croft, 1992).
The basic structure of any information retrieval system is an archive where documents or
data are held and a search engine which will retrieve matches of a search query (Kraft, Bordogna
& Pasi, 1998). Matching in a Boolean information retrieval system doesn’t allow the relevance
of document matches to affect the outcome of the search (Bordogna & Pasi, 1993). If a
document contains the searched word once, that document is returned as a match that is equal to
a document that contains the word one hundred times. This limitation doesn’t always return the
most desired results. One adaptation to this is to add weighted values to documents with more
3
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
relevant matches. The main limitation to this is in knowing which term to deem a higher value
(Bordogna & Pasi, 1993). If searching for a sentence or multiple terms in a phrase, deciding
which term gets the highest weighted value is hard to decipher and when a document is returned
as a match what is the best way to quantify a qualitative return (Bordogna & Pasi, 1993).
Another adaptation is adding weights for specific query terms. For example, if a user searches
the Java library for a function that adds two numbers, the keyword in Java is sum, but if the user
isn’t quite sure if it is the keyword or not and is afraid they may miss documents if they only use
sum, they can put a weight on the word sum and search for similar words or expand their query
to include sum OR add OR plus. This will give preference to documents that contain sum which
is what one would expect being a keyword. This inability to express the importance of terms in a
desired document is the main limitation of Boolean search systems (Bordogna, Carrara & Pasi,
1992).
Fuzzy Retrieval
Using fuzzy logic for information retrieval is not a new concept; it was considered by
Radecki and Kraft in 1979, among others, and helped establish the application of fuzzy set
theory (Pasi & Bordogna, 2013). The three basic models of IR today are Boolean, the vector and
the probabilistic models (Baeza-Yates & Ribeiro-Neto, 2011).
A fuzzy set used for information retrieval contains the same two parts as most IR
systems, an archive of documents and a search engine. Often, a thesaurus can be used to
consider other terms related to the original terms in the document or the query. The system then
assigns a value 0 for no match at all, 1 for exact match and the numbers in between are used as
4
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
weights that are assigned based on the degree of match (Kraft, Bordogna & Pasi, 1998). This
study looks at using term weights for the information retrieval of software components in an in-
house software library for reuse.
Purpose of the Study
There are many applications that are utilized in this study, the first is a look at software
reuse and why it is not more widely used and one way this can be overcome. This study also
looks at information retrieval techniques and more specifically utilizing fuzzy logic in
combination with information retrieval looking for software to increase the chance of reuse.
This study looks at an enhanced fuzzy set using index term weights, which should increase the
likelihood of a match when searching for software components. By utilizing many of the tools
involved with fuzzy set theory, this study enhances a search for in-house data on software
components. With a more successful return rate of software from a software library search, the
likelihood of reuse also increases.
Statement of the Problem
Current information retrieval algorithms are ineffective in finding software components
because they are either done on the Internet utilizing open source software found on the web or
are constructed using a Boolean logic based matching system or both. It is well documented in
the literature that in order for software to be reused, it has to be found and that the current
information retrieval algorithms are ineffective in finding quality software components in an
efficient manner (Prieto-Diaz, 1991) (Yao, Etzkorn & Virani, 2007).
5
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Research Questions
The inability of any current information retrieval algorithm to accurately and efficiently
find matching software raises three questions:
Q1: Can software be retrieved using a fuzzy logic algorithm and return a more accurate match to
the user’s query?
Q2: Which algorithm, mixed, min and max (MMM), p-norm, or Paice provide the most precise
match results and are they all better than the Boolean method?
Q3: Can using a fuzzy logic approach to searching for software components reduce the amount
of needed query words to find an appropriate match?
Hypothesis
Using term weights calculated in the Lucene software, a researcher developed search
implementing the MMM, P-norm and Paice algorithms will result in a better matched list of
returned values over the standard Boolean logic method. Looking at the degree of membership
should yield a higher rate of successful match over the Boolean method and a higher success rate
in returned software components should result in a higher reuse rate.
Theoretical Framework
Information retrieval can be designed using Boolean logic or fuzzy logic, for example,
the vector space and probabilistic models. It has been shown in the literature that Boolean logic
is the easiest, cleanest, most common method for creating an information retrieval system but not
the most successful because it lacks a ranking method. Searching for software has primarily been
6
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
done on the Internet via open-source code data banks which can store the data in many different
formats and can be hard to search. Based on the literature, there have been no software
information retrieval systems implemented with fuzzy logic; this study will use fuzzy logic for
information retrieval for software components to increase the reuse of software. If software can
easily be accessible and retrieved it can be reused thereby saving programmer’s time and
development efforts and increasing time available for new program development (Mili, Mili &
Mili, 1995).
This study utilizes and executes a researcher developed application that uses Lucene 4.6.1
software’s indexing plug-in and then executes a researcher written search application using the
Lucene search plug-in. The search used the tf/idf (term frequency/inverse document frequency)
calculation as the term weights that were then used in the calculations for similarity of a query
using the mixed, min and max algorithm (MMM), P-Norm, Boolean and Paice algorithms all
written in Java using the Eclipse development environment. Searched “documents” can be
anything from a text files to a song to a .JPEG, but for the purpose of this study we will be using
text files that contain software help instructions for the UNIX operating system. The Lucene
indexer can handle all types of files and indexes them into one indexed file, no matter the starting
format. The system will also be utilizing the stemming functionality built into Lucene to
combine versions of words that stem to the same root, like running, runs, etc. The resulting data
was presented and analyzed using standard precision, recall and the mean average precision
(MAP) measure.
7
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Objectives of the Study
The objective of the study was to create a methodology that can be used to determine if
using fuzzy logic can improve the return of software components for the intended purpose of
reuse. By increasing the return of software match via a search, software reuse has been proven
to increase via the literature presented in this study. Increasing the reuse of software can
decrease the developer’s time spent on projects which will allow for new projects to be created,
and create a more steady and reliable product.
Assumptions and Limitations
There are several assumptions and limitations for this research. The first limitation is that
the indexing is done using Lucene software application, with more detailed information is found
on their website, http://www.apache.lucene.org. This research is also using the UNIX help
library for its data corpus. The man help files will be downloaded for the latest version of the
FreeBSD software. No other software library was used. There were no software libraries
available on the TREC website, or any other website, using man pages instead of software files
may change the way the data is searched or the type of data that can be searched but should not
affect the search itself. There is no test data available to show this difference. The data will be
compared to the results of the study done by Maarek to show that the results achieved are in fact
the results expected. The number of relevant files should be similar to the study by Maarek,
since the data corpus is similar. Other exceptions will be noted and documented.
To determine what is relevant and aide the calculation of the recall and precision, a panel
of UNIX experts was asked to rank the returned files in the order they deem most relevant to 8
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
least. Ideally there would be a panel of software users and experts to help determine relevance of
UNIX files.
Summary
Chapter 1 has looked at the background issues associated with information retrieval and
software reuse. Also discussed has been the need for a good information retrieval system that
will successfully retrieve software components to help software be reused. Software can only be
reused if it is readily available and the cost to reuse must out weight the cost to create from
scratch. Software components that are previously used are tested and usually well documented,
reducing the need for new development. Information retrieval can be executed in a number of
ways to include Boolean or fuzzy logic like the vector space and probabilistic models. Boolean
logic will return all documents that contain the queried word; fuzzy logic will return a varying
degree of match based on the documents use of the queried word.
The purpose of this study is to create a search for software in a document corpus local to
the computer versus searches in the literature that look at software on the internet. This study
will utilize the Lucene indexing application to create the index and then a researcher modified
search using term weights. Recall, precision and MAP will be calculated and used to compare
the search algorithms to each other; the higher the MAP value the better the algorithm is at
retrieving successful matches. Chapter 2 looks at the current literature that discusses the topic of
software reuse and information retrieval along with other studies that have used information
retrieval for software. Chapter 3 will explain the methodology used in this study and how the
system is created and executed.
9
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Chapter 4 will discuss the details of this study and the data that was produced by the
experiment. The data will be discussed in detail as to what it means for this study and for other
research. Chapter 4 will answer the research questions and discuss how this study either
supports of refutes the study’s hypothesis. Chapter 5 goes into detail about future research that
can expand on this study or take this study in a new direction.
10
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
CHAPTER 2: REVIEW OF THE LITERATURE
Chapter 1 described the need for software reuse in today’s programming industry and
explained some of the reasons why it is not widely used in everyday practice. One reason is the
need for an accurate look-up or retrieval system that can quickly and accurately return a query
with an appropriate software component. There are many methods to implement information
retrieval, the one at the focus of this study uses fuzzy sets. Because fuzzy set theory offers a
wider range of possible match with its degree of match, it should deliver the most accurate result
to a user’s query.
This chapter will examine the literature that supports the need for software reuse from
when it was introduced in 1968 to today. Information retrieval algorithms come in three major
forms, Boolean, vector space and probabilistic; those applications will be looked at with a
comparison of a Boolean search vs. term-weighted applications. Finally, this chapter explores
the literature of fuzzy sets and how they can be used for information retrieval and their benefit
over other methods.
Introduction to Software Reuse
Software reuse was first introduced by McIlroy at a NATO conference in 1968 (McIlroy,
1968). McIlroy (1968) understood the importance of creating a solid software component and
the need to create an inventory system that will allow these components to be widely accessible
to different machines and users. McIlroy said then that in order for reuse to be widely used,
there is a need for a standardized library to store and index software components (McIlroy,
11
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
1968). This issue has been widely discussed through the literature as the primary issue that is
hindering software reuse today from becoming an industry standard (Mili, Mili & Mili, 1995).
There are two major benefits to software reuse, as noted by Gibb, McCartan,
O'Donnell, Sweeney and Leon, 1.”Those components that have already been tested provide
higher guarantees of robustness and reliability in any future implementation and 2. Component
reuse should lead to faster development times and lower costs” (2000, p.212). With the
increasing demands for software development and the inability of programmers to keep up,
software reuse is a practice that can reduce the development time and lead to the increased
stability of a system (Yao, Etzkorn & Virani, 2008). Software reuse was first introduced as a
way to minimize creation time and help build a more stable system with components that have
been previously created and tested (Krueger, 1992). Charles Krueger states that although this
practice was introduced in the 1960’s it is still a practice that is not widely used today in software
engineering. Krueger goes on to say that software reuse can be defined as the “direct reuse of
components or code, the abstraction of ideas, or adaptation of software to fit the needs of others”
(1992, p.131). Software reuse also reduces effort and development time which decreases time to
market for certain systems. By decreasing time to market, companies can increase work load
and projects that they would normally not be able to handle and the reuse of quality tested
software means less down time and repairs for network maintenance people (Vishal, Chander &
Kundu, 2012).
In 2012, it was reported that software reuse had a 91% effect on lowering development
time, shortened the testing time by 83%, increased the overall product quality by 76% and
12
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
shortened time to market by 72% (Kokkoras, Ntonas, Kritikos, Kakarontzas & Stamelos, 2012).
Hewlett-Packard has shown improvements in product quality from code reuse from 24%-76%
and improved time to market from 12% - 42% (Keswani, Joshi, Jatain, 2014). Using a reused
software component will also improve the quality of that component; the more times a
component gets reused, the more chances bugs can be found and fixed and also the initial cost of
creating the component can be made up in just a few reuses (Vishal, Chander & Kundu, 2012).
In their 1995 paper, Mili, Mili and Mili point out that in 1984, 60% of software created
could have been standardized and reused. Quantifying the amount of software that is actually
reused has posed a problem for researchers. Mockus says that more than 50% of the open source
software files that were available to his study in 2007 have been used in more than one program,
and this was based on a file being present in more than one program. But this is just open source
code that is widely available on the internet; Mockus’s 2007 study didn’t look at in-house
software libraries within a company. For a quantifiable measure of reuse, Mockus used a
previous empirical study by NASA that used an algorithm that searched for directories of source
code files and uses the fraction of total files shared over the total number of files (Mockus,
2007). In total, Mockus (2007) looked through 13.2 million open source code files and found
that .52 or 52% of the files had been shared at least once. This means that any file shared on an
open source website, has a 50% chance of being reused (Mockus, 2007).
One issue with software reuse is the vast amount of information that is considered
software. Software in general, consists of coded lines, comment lines and possibly a behavior
diagram (Aziz & North, 2007). Software reuse can include one of four types of reusable
13
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
artifacts: data reuse, architecture reuse, design reuse, and program reuse (Aziz & North, 2007)
(Mili, Mili & Mili, 1995). Each of these four categories contain items that, if reused, qualify as
software reuse and need to be considered when items are being stored in a repository. Data reuse
is the reuse of standardized data formed or used by a software component, architecture reuse is
the reuse of a standardized set of design and programming techniques dealing with the
organization of the software. Design reuse is the reuse of standardized layout of software
components and lastly program reuse is the reuse of any or all pieces of code used in a software
application (Mili, Mili & Mili, 1995). Rothenberger, Dooley, Kulkarni and Nada (2003) did a
study of 71 software development groups asking different questions about the reuse of software.
They found that the biggest hurdle is finding software that matches their current architecture that
can be reused and that even if code or components are found that don’t exactly match, they do
reuse some part of the code anyway whether rewriting it or adapting to fit their system. The
demand for quality software and the sustainability of code that has already been tested is the
number one driving factor they found in their study.
In a 2012 survey, 87% of software engineers said code was the most important reusable
artifact, where 80% said design was the most reusable and 75% said documentation was the most
reusable of software artifacts (participants were allowed to vote for more than one) (Kokkoras,
Ntonas, Kritikos, Kakarontzas & Stamelos, 2012). Because of the vast diversity in ideas on most
important coding artifact, it generates difficult challenges to creating a reusable repository of
software components.
14
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Before software can be reused, it has to be stored so it can be found and easily found; this
is where programmers fluctuate in opinions (Mili, Mili & Mili, 1995). You have to be able to
find the correct software components before they can be reused, says Barringer (1984). This has
lead researchers to look for an effective storage and retrieval method for finding quality
software. This issue has led to the creation of software libraries that can be easily searched and
retained. Burton, Aragon, Bailey, Koehler and Mayes (1987) designed a software library for just
that topic, the storage and reuse of software components. Their library was specifically for the
Ada programming language components that supports the reuse of components from legacy
systems with its use of generic and packaged components. The reusable software library stored
an attribute value based on software components function ability, complexity, structure, quality
of documentation, and level of testing. These attribute values were then used as a compared
number when a query was made (Burton, Aragon, Bailey, Koehler & Mayes, 1987).
Sandhu, Kaur and Singh (2009) looked at reusing software from currently active systems
which they said had not previously been looked at, most software reusability is based on older
versions of software that is stored and no longer in use. They observed that in order for
programmers to reuse software, they must be able to find it useful first (Sandhu, Kaur & Singh,
2009). They found that to get a more accurate measure of reuse, the domain must also be taken
into consideration and devised a neural network that would automatically evaluate the reusability
of object oriented software components. Their metric was successful in measuring the
reusability of software but was not the best as compared to other similar studies (Sandhu, Kaur &
Singh, 2009).
15
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
It has been shown that if programmers can find good quality software they are more
likely to reuse the code segment. In a study, programmers were asked specifically what drives
them to reuse code and what means are they most likely to use to find the code, programmers
conclusively said that the main means for finding software was searching, either the internet or
current databases they had access to like SourceSafe, a source code check in and check out
application where developers can maintain their finished product (Sim, Clarcke & Holt, 1998).
The biggest drivers of searching for software components included not wanting to rewrite large
sections of code like a sort, or search, to understand current code implementations, for code
repair, the “desire to work on preferred tasks should lead developers to reuse code that they
prefer not to write on their own”, and lastly resource constraints like lack of time and testing
resources encouraged developers to reuse code and monetary incentives (Haefliger, von Krogh &
Spaeth, 2008, p. 183) (Sim, Clarcke & Holt, 1998). In agreement with the study by Haefliger et
al, a study by Agresti (2011) found that programmers discovered a 26% increase in productivity
by reusing old code, and found that overall programmers are willing to reuse code no matter the
time constraint, good code is good code. An interesting discovery from this study found that in
fact programmers did not believe in the if you want something done right do it yourself mentality
but that one of the biggest deterrents for not-reusing code is lack of documentation to what the
code actually does, or overly difficult code (Agresti, 2011). So far the studies examined have
dealt with technical obstacles for reuse of software, but Morisio, Ezran & Tully (2002) did a
study and found that non-technical issues played just as much of a role in reasons why
companies don’t reuse code. By interviewing those persons involved with certain projects, they
16
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
deducted that overall companies were reluctant to change when they had a system that worked,
and although there are many advantages to reusing code, the process seemed to halt at the top
management level. Executives need to get on board with the process and procedures and they
need to do more than set up a repository of software in order for software reuse to become a
widely practiced method of software development (Morisio, Ezran & Tully, 2002). Other non-
technical issues discussed in a 2014 paper by Keswani, Joshi and Jatain include economic
barriers, many organizations aren’t able to afford the development of reuse groups,
administrative impediments, sometimes reuse across different business units isn’t feasible,
political impediments, often programmers and management are weary about people in other
groups or organizations and skeptical of using their code, and lastly psychological impediments,
programmers want code that they understand and even though time consuming, they are willing
to write all the code themselves. The biggest technical hurdle that Keswani, Joshi and Jatain
discovered in their paper, was the lack of technical skill of programmers. With such a high
demand of software developers, companies are willing to overlook the college or fundamental
training of good programmers and when it comes to understanding other people’s code, they lack
the knowledge of design patters and proper framework in order to understand how to reuse code.
Whatever the motivation for reusing software, this paper will show that not only is code search
easy but it is also effective and the concept that software, if easily found and accessible, can be
reused is the driving reason for this research.
17
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Searching for Software
One of the biggest hurdles of storing software is the lack of industry specifications. How
a library is set up determines how it’s searched and different people store different items as
indices or keys, making any search difficult. Maarek et al, implemented an automatic way of
constructing software libraries that can be used for information. Their goal was to construct a
library of software that “provides a sufficient number of components that offer a spectrum of
domains that can be reused as is, or black box reuse, and is organized such that the code closest
to the user’s query is easy to locate” (Maarek, Berry & Kaiser, 1991, p.800). This system
classified software based on the code, internal documentation and contextual information by
creating an index which is then stored with the software as a tag. When a search is made on the
library, the search term is then transformed into an index using the same algorithm that classified
the library and the indices are compared for a match. If there is no exact match the system finds
a similar or functionally similar in nature match (Maarek, Berry & Kaiser, 1991).
Since software can be searched and stored by many different attributes, behavior,
function or structure, a study by Frakes and Pole looked at allowing software to be searched by
any of these attributes (Frakes & Pole, 1994) (Prieto-Diaz & Freeman, 1987). Using a database
of UNIX commands and the Proteus software application, Frakes and Pole (1994) not only
looked at recall and precision but also search time, user preference and helpfulness of the
methods. Using the test set based on the study by Maarek, these authors allowed 35 employees
of the Software Productivity Consortium to query the system seven times using the four
classifications of software storage, keyword, faceted, enumerated and attribute value.
18
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
As has been seen in previous studies, it has been concluded that because different
searches can be done, recall and precision are usually not significantly different from study to
study. Frakes and Pole had similar results, although the studies returned different documents, the
recall and precision was not significantly different between searches based on faceted,
enumerated, keyword or attribute based search. The difference came in the search times. These
authors found that enumerated searches resulted in the biggest gap between expected search
times and actual, the others were close to predicted. An enumerated classified system is one
where an item is classified by subject and that subject is then broken down into “mutually
exclusive, usually hierarchical, classes” for example in the UNIX command list classification,
UNIX -> Directory -> Create -> Mkdir; the Dewey Decimal system is another system that is an
example of an enumerated classification (Frakes & Pole, 1994, p. 619). This study proves that
searches don’t matter if searching for attribute, behavior or keyword; this is one reason this
research will focus on the setup of the search versus focusing on the searched item (Frakes &
Pole, 1994).
Zhang et al, proposed a hash function for detecting reusable content (Zhang, Wu, Ding &
Huang, 2012). They created a signature of a sentence and store that value with the software
content in the document. When a query is made, the sentence signature is compared to the
distance of the query sentence for a certain threshold value, if that value is less than the threshold
that content is deemed a match. This study was successful but with the increased data stored
with the software component it wasn’t the most efficient way to evaluate software (Zhang, Wu,
Ding & Huang, 2012).
19
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Houhamdi and Ghoul designed and implemented the Reuse Description Formalism
(RDF). The RDF code categorization indexing is “capable of representing not only software
component at the code level, but it is also capable of representing more abstract or complex
software entities” for example the capability to represent relationships like an is-a and
component-of relationship (Houhamdi & Ghoul, 2001, p. 41). The RDF is also flexible enough
to represent a new object into the library without having to re-identify all preceding objects, etc.,
and lastly the RDF “provides a consistency verification mechanism” (Houhamdi & Ghoul, 2001,
p. 41). The RDF is a great tool that seems to overcome all shortfalls of other software libraries
but getting it implemented into the mainstream is a big hurdle. Until there is a standardized
library for software there will be no need to standardize the library.
There are many ways to search for software components but they all start with how the
data is stored. Software can be stored based on free-text keywords, faceted index and semantic-
net based (Aziz & North, 2007) (Khalifa, Khayati & Ghezala, 2008). Software retrieval schemes
fall into groups which include “keyword search, faceted classification, signature matching,
behavioral matching and semantic-based method” (Khalifa, Khayati & Ghezala, 2008, p. 134).
Prieto-Diaz discussed how to implement a faceted classification scheme in order for
software to be searched it has to be organized correctly and implementing a faceted classification
system is one way to do that. In a faceted classification keywords are selected from a predefined
list and assigned to the classes, class gets a group of terms and those terms are attributes from
selected facets (Prieto-Diaz & Freeman, 1987). Software can have attributes from groups like
objects, system type, and functional area, giving each class a triple description, Prieto-Diaz and
20
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Freeman say that most of the software can be individually grouped in that method. With this
system, software searched has a higher rate of match and when compared to a database with no
organization scheme, had a 100% increase in precision and a 50% reduction in recall (Prieto-
Diaz & Freeman, 1987).
Khalifa, Khayati and Ghezala look at the behavioral matching model in their software
search algorithm. By creating and storing Unified Modeling Language (UML) diagrams based
on a piece of code’s behavior, they say that software searches have a better chance of finding a
match. After the UML document is created it is then parsed into a first order logic which is
stored external to the software. Searching for software can be by keyword or behavior in this
setup and with the introduction of storing a UML diagram the user doesn’t have to know prior
knowledge of how the software behaves in order to search successfully (Khalifa, Khayati &
Ghezala, 2008).
Reiss did another study looking at semantic-based matching. Semantics, is similar to
behavior, as it means what can the system do? but differs from behavior traits in that it is usually
found in the test cases and documentation. Reiss said this type of software searching was
common in the 1990’s, where most searches would look at the signatures of functions and
components. Reiss took this technique one step further by including security
requirements/prerequisites plus the parameters of functions and uses them when searching for
semantic based code. Using Eclipse to build the initial syntax tree, he added a semantic analyzer
that read over the tree and added annotations where needed for each node. The biggest hurdle
21
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
with this implementation is that each function has to be compiled before its result set can be
searched, and if there is a massive software library this could take some time (Reiss, 2009).
Yao, Etzkorn and Virani also did a study looking at the semantic properties of software in
their 2008 study. They said their study would increase the reuse of software by creating an
automated classification system of software components. Using a tagging mechanism, Yao,
Etzkorn and Virani created a description of the semantics and attached to each software
component. Then a natural language description is assigned to the component and a simple
search engine is used to match a user’s query to either the simple description or semantic
descriptor using a modified version of RDF (Yao, Etzkorn & Virani, 2008).
Marri, Thummalapenta and Xie proposed a new approach to using a code search engine
or CSE. They say that the CSE’s that are available on the internet can only accomplish a simple
task at a time, for example find a function for sum in C++. With the addition of an API that
assists with three common software development tasks, 1) to learn about a common API and its
programming rules, 2) to use those rules to detect flaws in a program and 3) to infer a fix for the
detected defect the authors assumed that they could not only find software but make it better and
increase its chances for reuse (Marri, Thummalapenta & Xie, 2009). Their code search life-cycle
model was able to assist developers with development, maintenance and verification by
searching and fixing software components on the web.
Using the CSE’s available on the internet, Kokkoras, Ntonas, Kritikos, Kakarontzas and
Stamelos created a new system that fed on two or more CSE’s available on the internet. Their
system would have one interfaced page and would direct their search queries to other CSE’s on
22
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
the web, such as Krungle or Koders thereby eliminating the need to collect and store their own
data (Kokkoras, Ntonas, Kritikos & Kakarontzas, 2012). This study was successful in searching
and finding code on the internet but again was only using code search engines and offered no
increase to the speed or reliability of these searches.
Suresh Thummalapenta introduced a web based software search system to find software
that could be reused. In her dissertation she says that the amount of open source software
libraries has grown exponentially and with it the number of searches of those libraries has also
grown. Open source libraries on the web like SourceForge.net host approximately 230,000
projects. Implementing a WebCrawler to search the internet and also a parser, Thummalapenta
was able to effectively search and find software components using an XOR pattern only
(Thummalapenta, 2011). Although Thummalapenta’s study was done using the web, the way the
documents were searched looking for software components is similar to this study.
Isakowitz and Kauffman propose a method of search for software components that
utilizes hypertext technology. The idea of keyword matching for any information retrieval
system requires the manual labeling and storing of similar or keywords for every software object.
This idea is tedious and not plausible for large software libraries, the authors argue. The other
method introduced by Prieto-Diaz (Faceted Classification) uses keyword associations made by
software components by introducing the commonality of the domain for example in the
Computer Aided Software Engineering (CASE) environment. They purpose using hypertext
links from software stored objects to similar objects; this allows only storage of one link for each
object. Once a link is stored, that links matching link will ultimately lead a user to multiple
23
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
matches of a searched word. The study concluded with three contributions, a successful
illustration of an approach to automated classification of object repository in a CASE system;
they were able to show that hypertext technology provides a useful set of capabilities when
combined with an “repository-based application meta-model”; and they showed how all this
together can be combined to make a working reuse search support tools (Isakowitz & Kauffman,
1996, p. 421).
Rosalva Gallardo-Valencia and Susan Elliot Sim said that searching over the internet and
searching in a development environment pose two different types of problems, instead of them
both being the same issue on different domains. Searching for code in an integrated
development environment (IDE), programmers are usually looking for one particular piece of
code to, “defect repair, reuse code, understand the problem or impact analysis” (Gallardo-
Valencia & Sim, 2009, p. 50). The author’s say that internet based code search is another topic
because code on the internet can be returned in the form of a code object or component, a
reference to how a program works, a completed program, or an http web address link to the
developers homepage (Gallardo-Valencia & Sim, 2009). This difference in motivation leads to
difference sizes and different looks of query result sets, this, the authors say is why internet-scale
code search is a new topic and should be treated as one, and not compared to IR (Gallardo-
Valencia & Sim, 2009).
Whether a code search is done on the internet or in an IDE, they all use a query to search
for a match. In the article Using Iterative Refinement to Find Reusable Software, Henninger
looks at creating the query and how it can be changed to make code searching easier and faster.
24
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Henninger’s study looks at taking a simple query from a user and using CodeFinder’s interface
that gets populated with the hierarchical structure of the current selection as the user continues to
click, this is an extended version of query expansion. As the graph populates and as the user
continues to select terms that fit their query, the system builds the query in the background and
when the user finally selects search the query might have gone from print to print macs on a
Lisp application (Henninger, 1994). This method worked for the purposes of the article but for
the research in this paper, building a query builder is out of scope although query expansion will
be used if applicable in this study.
Sandhu and Singh implemented a nuero-fuzzy approach to find the reusability of
software components that will automatically create a value for a software component based on
reusability, reliability and quality of development. Their approach creates a COM object using
Visual Basic coding language that will run as a service on a user’s machine. This service will
extract a value for a software component based on the “nearest-neighbor-based, agglomerative,
hierarchical, unsupervised conceptual clustering” (Sandhu & Singh, 2007, p. 357). Using a
complete linkage algorithm, the entire document’s similarity value is compared to the value of
the query using the document similarity matrix based on the cophenetic distance (Sandhu &
Singh, 2007). Using a neural network with a fuzzy inference system incorporated, through much
iteration, the authors were able to refine a match successfully.
Finding content for reuse purposes has added challenges according to Zhang, Wu and
Ding. They say that there are added challenges to detecting content for reuse because “reuse may
happen at different levels” and a positive match may not be enough to indicate definite reuse
25
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
(Zhang, Wu, Ding, and Huange, 2012, p. 405). Thummalapenta (2011) discovered this obstacle
otherwise known as a false positive, or when a document matches but the use of the content is
not what the user intended. This issue will be documented in the future work section of this
paper.
Information Retrieval - Boolean Logic
In 1959 Maron and Kuhns introduced a new novel technique to solve the library indexing
issue by defining an index for a document as a unique tag that identifies the information in that
document. Using this index, data is easier and faster to search through, although they did not
have electronic versions of documents, Maron and Kuhns’s system was still effective in
retrieving information from the library. This system of searching for information using only a
small piece of data representative of the entire entry has been the foundation for the Dewey
Decimal system and other information retrieval systems (Maron & Kuhns, 1959).
Boolean logic is simple and clean, the results are returned in an ordered list all having the
same chance of matching the query as the next. The problem with that in today’s world of data
storage, is the lists that are returned can be very large (Baeza-Yates & Ribeiro-Neto, 2011).
Even though a weighted system would return a more accurate list, the Boolean retrieval model is
still the most popular among search based algorithms (Bordogna & Pasi, 1993). Most searches
today incorporate some form of weighted system for data searches to reduce the amount of
returned data and to limit the number of relevant returns to only a certain percentage of matches.
A Boolean system gives a value of 1 to a document that contains the queried term and a
value of 0 to a document that does not contain the queried term. This is based on only one
26
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
occurrence of the term, so documents containing the term different multiple times get returned
the same way (Baeza-Yates & Ribeiro-Neto, 2011). This means a document that contains the
searched term 100 times gets returned with the same priority as a document that contains the
searched term 1 time. Although this is a fast way to search through documents, when listing the
matches for the user, the resulting list can be deceiving (Baeza-Yates & Ribeiro-Neto, 2011).
Although it is a simple and clean way to search Boolean logic does have a number of
disadvantages. Salton, Fox and Wu (1983) say that the size of the output is hard to control or
even predict because if the matching is done regardless of number of times a term is located in a
document, the output is not initially ranked based on how a document matches the queried term
so choosing a document that best meets the user’s query is up to the use.
Fuzzy Sets and Extended Boolean Logic
It has been shown that in the Boolean logic model, if one searches for two words
connected by the Boolean AND, a document with only one word match will be discarded just as
a document with no word matches as shown in Table 1 (Baeza-Yates & Ribeiro-Neto, 2011)
(Bookstein, 1979) (Fox & Sharan, 1986) (Verma & Sharma, 2013).
27
A B OR Value
T T T 1
T F T 1
F T T 1
F F F 0
A B AND Value
T T T 1
T F F 0
F T F 0
F F F 0
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Table 1: Results of Boolean logic on a document searching for terms A and B
Miyamoto (1990) says that simply, an IR system using fuzzy logic takes a query as input, returns
a list of documents as output and how the documents are scored is a measure of the degree of
match of a document to a query.
Bordogna and Pasi created a fuzzy linguistic approach with generalized Boolean IR
which they say incorporates the imprecision of a Boolean match with a fuzzy model with the
accuracy of a Boolean match. In their study, Bordogna and Pasi replaced the fuzzy weights with
linguistic values, for example replacing .8, and .9 with very important, not important, etc. They
say by replacing the numeric value for a qualitative descriptor, they can get a better sense of how
a term matches a user’s query. They found that this allowed the user to be able to calculate the
recall and precision much easier and to not have to quantify a specific number for the degree of
importance of a term in a document (Bordogna & Pasi, 1993).
There are three basic elements of any information retrieval system: its sets of documents
and terms, an indexing system and perhaps a filter to limit the number of responses. A fuzzy
thesaurus will determine the matching result of documents by its membership value. With a
certain range of membership values assigned to a set, a search will return a broad range of
matches. Nomoto, Kubo and Kosuge tested the use of fuzzy thesaurus generation by searching
for document matches in their 1995 paper. By creating their own fuzzy thesaurus and
implementing a cross-index matrix which was done by assigning the index of each document to a
set and narrowing, broadening, and then taking a cross value of the resulting matrix, they were
able to match terms more accurately, all depending on user’s preference (Nomoto, Kubo &
Kosuge, 1995). 28
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
There is a difference in data gathering for information retrieval and artificial intelligence.
Maarek, Berry and Kaiser (1991) say indexing for IR is text specific versus indexing for AI
which is knowledge based. One looks at text only while the other looks at context and referring
knowledge. The issue with text only is comparing natural language elements. To tackle this
issue, Chau and Yeh (2004) suggest a generic language independent domain to which every
document is converted before searched for like terms. This helps combat missing matches due to
different terms not translating but adds another step to the process: translation and storing the
translation (Chau & Yeh, 2004).
There are two options to consider when indexing a document. Free text indexing says
there are no limits to the number of indexes that are allowed, versus the controlled indexing
where only a limited number of indexes are allowed. Both indexing ways have the same effect
on outcomes but when looking for software an uncontrolled method or free-text indexing is the
best option for reasons of cost, and performance (Maarek, Berry & Kaiser, 1991). This research
will use a free text method to index the files.
A study performed by Bordogna, Carrara and Pasi extended the Boolean information
retrieval methodology to help satisfy a user’s query better using a weighted system (Bordogna,
Carrara & Pasi, 1992). The author’s used the Retrieval Status Value which is obtained by
combining the resulting weights from the function F : DxT -> [0,1] (Bordogna, Carrara & Pasi,
1992). D is the set of all documents, T is the set of terms and F is a function of the occurrences
of term T in D (Bordogna, Carrara & Pasi, 1992. This RSV is used to represent the closeness to
the ideal document. A value of 0 for an index term means there is no match for the indexed term
29
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
and a value of 1 indicates a perfect match (Kraft, Bordogna & Pasi, 1998). Based on this value a
constraint system is set up that affirms the value as degree of relevance to the desired ideal. The
results are then listed in descending order from perfect match.
The other statistical models are the vector and probabilistic models (Srinivasan, Ruiz,
Kraft & Chen, 2000). The vector model assigns a non-binary weight to indexed terms and a
degree of similarity is calculated between each stored document in the system. The resulting list
of matched documents is sorted and presented in descending order; this allows documents that
are only partially a match to get returned to the user. The formal definition of the vector model
is such that “the weight w i , j associated with a term-document pair (k i , d j) is non-negative and
non-binary” (Baeza-Yates & Ribeiro-Neto, 2011, p.77). The probabilistic model looks at the
documents and once it finds its matches it assigns a value of probability that the user will find the
document relevant. Instead of assigning a degree of match, it looks at the statistics of
probability. Baeza-Yates and Ribeiro-Neto say “given a query q, the probabilistic model assigns
to each document d j, as a measure of its similarity to the query, the ratio P(d j, relevant-to q)/ P(
d j non-relevant-to q), which computes the odds of the document d j, being relevant to the query
q” (2011, p. 80).
Lofti Zadeh is credited with the creation of fuzzy-set theory back in 1965 but parts of the
theory can be traced back to the 1920’s (Miller, 1996). Fuzzy-set theory is described as a cross
between Boolean logic and multi-valued set theory (Miller, 1996). Zadeh says of fuzzy-set
theory that “as a system becomes more complex, the need to describe it with precision becomes
30
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
less important” (Miller, 1996, p. 29). Fuzzy sets allow a degree of match or varying scale of
relationship that is otherwise not included in mathematics.
A fuzzy set is defined by Klir and Yuan as:
“mathematically by assigning to each possible individual in the universe of discourse a
value representing its grade of membership in the fuzzy set. For example, a fuzzy set
representing our concept of sunny might assign a degree of membership of 1 to a cloud
cover of 0%, .8 to a cloud cover of 20%, .4 to a cloud cover of 30% and 0 to a cloud
cover of 75%” (1995, p.491).
By assigning a degree of match, there is more flexibility in data searches for finding a better
match.
Using the Klir analysis of degree of match, Triantafyllos, Vassiliadis and Pechanek
developed a database system that would answer natural language questions. The system was
developed to help evaluate the bookkeeping library in the IBM 4381 system. Using a degree of
confidence, they set a limit of acceptable response and anything below that confidence level was
dismissed. Using this system, they were able to get results that were close to manual evaluation
of information retrieval on the same system.
Zadeh (1994) says there are two central concepts to fuzzy logic, the linguistic variable
and the fuzzy if-then rule. The linguistic variable is any variable whose value can be found in
natural language like sentences or words (Zadeh, 1994). The if-then rule says that the
“antecedent and consequents are propositions containing linguistic variables.” (Zadeh, 1994, p.
49) Fuzzy logic is set up in a way that helps group similar terms together, the way the human
31
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
mind summarizes data. For example if the linguistic values are young, old, infant, the logistic
variable would be age. The other way to describe this way of grouping is by referring to it as a
membership function. Using linguistic values or membership functions, a series of rules can be
defined. These rules are fundamental to the Fuzzy Dependency and Command Language or
FDCL. The FDCL, unlike Fuzzy Prolog, is not a fuzzified version of a standard programming
language and like all languages FDCL is defined by its semantics and syntax (Zadeh, 1994).
FDCL allows many different rules from fuzzy if-then to simple fuzzy. According to
Zadeh, a typical rule links “m antecedent variables X 1,… Xm to n consequent variables, Y1,…,Yn
and has the form: if X1 is A1 and …Xm is Am, then Y1 is B1 and … Yn is Bn, where X = (X1, …,
Xm) and Y = (Y1,…,Yn) are linguistic variables and (A1,…An) and (B1,…Bn) their respective
linguistic values” (Zadeh, 1994, p. 51). For example, if Temperature is low and Pressure is low
then Volume is large. Rules can have two structures, surface or deep. The surface structure is a
rule in its symbolic form, if X is A then Y is B, and the deep structure contains all the
dependencies that define the membership function of a rule (Zadeh, 1994).
In a typical Boolean retrieval model, a document that is queried with term A and term B,
using the AND will result in only the document with both terms present given the value of 1, and
the remaining options will be assigned 0. If the OR operator is used, the document that has
neither term will be assigned the value of 0 and the remaining a value of 1, as shown in Table 1.
Salton, Fox and Wu look at the vector-processing retrieval model and assigning similarity
values based on the Euclidian distance calculation for those documents that meet one but not
both of the matched terms. Using the Euclidian distance from the point (1,1) for AND queries,
32
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
because (1,1) is the point where a document contains both terms so that is the ideal location for a
perfect match, those document that contain one term but not both would be calculated by using
term weights d A for term A and d B for term B, √(1−d A)2+(1−dB)
2 and from the point (0,0) for
the OR operator the equation, √(dA−0)2+(dB−0)2 (Salton, Fox & Wu, 1983). With a maximum
distance possible of √2, Table 2 shows the new calculations for documents that may fall
between (0,0) and (1,1). Although this study will not use the vector-processing retrieval model,
it is clear to see that using a weighted system will provide a closer match than a Boolean logic
model. The results of Salton, Fox and Wu’s vector-processing retrieval model showed a 172%
improvement in recall and precision from a Boolean logic retrieval model on the same data.
A B OR Value
T T T 1
T F T 1/ √2
F T T 1/ √2
F F F 0
A B AND Value
T T T 1
T F F 1-1/ √2
F T F 1-1/ √2
F F F 0
Table 2: Values calculated using vector-processing retrieval model for term A and term B.
Table 2 also shows that given a query with only one word match in a document, that
document will be lower in the returned list with both queried word matches, which would be of
value 1. Also if only one queried term is found in a document it receives a higher value than the
nonexistence of both terms in an AND query (Salton, Fox & Wu, 183). The vector-processing
model is effective in finding a successful match with more precision than a Boolean system, but
33
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
if there are more than one term it becomes impossible to determine which term gets the highest
precedence. For example if the search is Information AND Retrieval AND Software AND Reuse,
any document with the word reuse will get the same degree of match as a document containing
the word retrieval. So words that have other uses, like reuse which can be used to describe
anything not just software, will be included in the query results. Bookstein introduced a model
that added weights to not only the searched result list but also to the queried words. Bookstein
suggests to reduce the retrieved set value (RSV) by the reduced term weight, for example if the
term reuse has the membership values of {(d1, 1) (d2 , 0.8¿¿, 0)} where d is a document and the
next number is the membership or how well the document contains the indexed word reuse. If
the request becomes reuse0.5, then the reteived set now becomes {(d1, 0.5) (d2 , 0.4¿¿, 0)}
(Bookstein, 1980).
Brookstein’s approach is the foundation for Buell and Kraft’s 1981 paper that looks at a
weighted retrieval model. Buell and Kraft look at a model that replaces the standard 0 and 1
values assigned to an index term to a continuous value for the membership value which is
calculated by a set of documents (D*) times the set of indexed keywords (I*) (indexed keywords
are keywords from a document that get added to the index file with a corresponding number of
occurrences and location of occurrence added to the index file as well) resulting in a membership
value between 0 and 1: F: D* × I* [0,1] (Buell & Kraft, 1981). Then the calculations for
membership become how much a document is about a query term, vs, if the term exists in the
document or not. The issue of how the system handles the membership of multiple query terms
is taken care of as well and if the Boolean AND is the joining term, taking the maximum
34
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
membership of all queried terms in the documents will give the membership for the entire query,
or Max[F(d,T), F(d,S)] = F(d,T) + F(d,S) – F(d,T)*F(d,S). As an example, if the query looks
like this C++ AND sum, where C++ becomes T and sum becomes S then the F(d,T) = .8 and
F(d,S) = .2 then
Max [.8, .2] = .8 + .2 - .8*.2
Max[.8, .2] = .8 + .2 - .16
Max[.8, .2] = 1.0 - .16
Max[.8 ,.2] = .84
Maximum of .2 and .8 is .8 = .84 rounds to .8
The model for the Boolean OR is: Min[F(d,T), F(d,S)] = F(d,T)*F(d,S)
Using the same values of F(d,T) = .8 and F(d,S) = .2, then
Min [.8, .2] =.8*.2
Minimum of .8 and .2 is .2 = .16 rounds to .2
Using this alone will yield a better match over a discrete Boolean 0,1 system, but if the query
terms are C++ AND sum, and if a document is all about C++ with no mention of the second
query term that will be returned as the top query match. Buell and Kraft looked at putting
weights on the query terms but found that if a document has a 0 RSV (retrieved set value), the
entire query’s RSV becomes 0. The query (T,a) AND (S, b) where a an b are the values that
indicate the relevance each term has in the query the equation becomes
Max[(F(d,T),a), (F(d,S),b)] = (F(d,T),a)+ (F(d,S),b) - (F(d,T),a)* (F(d,S),b). (1)
35
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
If a or b = 0, the Max will always be 0 (Buell & Kraft, 1981). For this reason, the query will not
contain a weighted value in this study, but the terms in the documents will be weighted based on
the number of occurrences of the word per document. If both terms have the same weighted
value to a query, but the documents don’t contain any match for either term, the query should
result in no results found. This takes an exception to be included in the algorithm to look for a
returned RSV of 0. Buell and Kraft present another option for this situation, by setting a
threshold value or a value that must be met in order for document or set of documents to be
about a query enough to be returned. This threshold value is used like a checksum: if the
returned value is equal or greater than the threshold then the document is relevant, if not the
document is not about the query enough to present to the user. This is a good approach if users
want to make sure the returned set of documents meet a minimum requirement that is greater
than 0.
Another popular model of IR systems is the BM25, the BM25 model works great with
plain text documents and has been proven itself at TREC. The BM25 model looks at how often
a term is in a document and the average document length in the corpus (Robertson, Zaragoza,
Taylor, 2004). The BM25 also has two boosting variables that are commonly set at k = 2 and b
= .75 for best results. For this research, the BM25 was used as the benchmark from which to
judge the success of the other methods. Lucene integrates the BM25 into a similarity with
default values already set at b = 2 and k = .75 which will not be changed in this study.
Entering queries can be a tricky task that can change the outcome of a search. Query
expansion is the manipulation of similar words or placement of words in the query to search for
36
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
the same idea just using different terminology that may match the corpus better. Query
expansion is usually discussed along with relevance feedback as feedback methods (Baeza-
Yates, Ribeiro-Neto, 2011). For purposes of this research, query expansion will be explored in
the sense of modifying a query to include a synonym or words or phrases with similar meanings.
This has been shown to increase a search results when a thesaurus is not used (Xu & Croft,
1996). An example of a query expansion looks like this: sum OR add OR plus, words with the
same meaning are OR’d together to return any document that may include them.
Just searching for matching terms may not always yield the best result. By modifying the
term frequency-inverse document frequency scoring and by calculating the overlap between
documents with similar topics, Chowdhury and Bhuyan implemented a fuzzy information
retrieval model using clustering. By grouping documents by inter-document similarity, searches
could move from document to document quicker and had a 10% increase in both recall and
precision over BM25 and tf-idf calculations. Once a matching document was found, any
document in the same cluster would be compared to find a match, and as soon as a document
was not a match the search stopped (Chowdhury & Bhuyan, 2010).
Term frequency and inverse document frequency are calculated to determine the most
frequent words in a document and which document contains a word the most (Fox & Sharan,
1986). This indexing value helps search algorithms quickly look up a term and find the
correlating documents that contain that term. This term weighting function was first introduced
in 1972 as a way to rank documents for information retrieval systems. The basic formula for idf
says that given a set of documents, N, and a term t i occurs in ni of the documents, the idf (t i) =
37
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
log (N /ni ¿ (Robertson, 2004). The frequency of a term in a given document is then multiplied
by the idf to get the tf-idf number (Robertson, 2004). The tf/idf value will be used as the term
weights in this research. The number of relevant matches returned will be compared to the study
by Maarek (1991) and also verified by the researcher.
Three other models used for information retrieval that are considered an extended
Boolean approach are the MMM (mixed, min and max), Paice and the p-norm model. The
MMM model is based on work by Zadeh and says “an element has a varying degree of
membership to a given set instead of the traditional membership choice”, but only looks at the
min and max document weights for the index term (Frakes & Baeza-Yates, 1992, p. 395). The
MMM is based on the fuzzy set theory and says that each indexed term has a fuzzy set associated
with it and the weight of a document with respect to an index term is considered to be the degree
of membership of the document in the fuzzy set associated with it.
Using the term frequency calculation (tf/idf) as the term weight, the MMM is calculated by:
SIM (orQ¿ , D) = C ¿1* max (tf/idf of queried terms) + C ¿2* min (tf/idf of queried terms) (2)
SIM (orQ¿ , D) = C ¿1* min (tf/idf of queried terms) + C ¿2* max (tf/idf of terms) (3)
Where Q is the query with an OR or with an AND, D is the document with index-term weights
tf/idf, and C is a coefficient for “softness”, Frakes, Ribeiro-Neto says “since we would like to
give the maximum of the document weights more importance while considering an or query and
the minimum more importance while considering an and query” and usually C ¿2 is just 1 - C ¿1
and C ¿2 is calculated as 1 - C ¿1(1991, p. 396). They found that C ¿1 performed best at 0.6 and
38
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
C ¿1 performed best at 0.3. For purposes of this research we will use the values of C ¿1= 0.6, C ¿2 =
0.4, C ¿1=0.3 , C ¿2=0.7 .
The Paice model was proposed by Paice in 1984and is also based on the fuzzy set theory.
Similar to the MMM model, the Paice model looks at the weighted indexes in the document but
doesn’t stop at the min and the max like the MMM model does, it considers all of the weights of
the document. The Paice value is calculated by
SIM (Q, D) = ∑i=1
n
ri−1d i /∑i=1
n
r i−1 (4)
Where n = number of queries, Q is the query and d is the tf/idf for the document for an OR
query, D = ( A1 or A2 or … or An ¿ where A is the tf/idf for query term 1, etc. and for an AND
query, D = ( A1 and A2 and … and An ¿.
The p-norm model adds another angle to the Paice model by considering the weight of
the query as well as the weights of the documents (Frakes & Baeza-Yates, 1992). In the research
it has been found that p= 2 gives good results, for this research, the weight used will be 2. The p-
norm model for an OR’d query is
SIM (Q¿ p , D ¿ = p√¿¿¿ (5)
Where Q is the query, D is the document, a is the term weight d is the document weight,
p is set to 2, and A is the term for which the document weight is corresponding. For an AND’d
query the model is:
SIM (Q¿ p , D ¿ = p√¿¿¿
(6)
39
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
With a p value greater than 1 the computational toll is high, but to get a better result
computational expense will be second to results for this study.
Based on the literature, there were no previous studies that looked at how effective using
an extended Boolean approach (MMM, Paice, or P-norm) to information retrieval for software
has been done. For this research, the MMM, Paice, Boolean and p-norm model will be used to
search a document for software components using the tf/idf value as weights calculated in the
software. Using a level deeper than the simple term weights of tf/idf, this study should provide a
better, more accurate match to a user’s search.
Document Pre-processing
Before a document is searched it is usually processed, stripped of white space, common
words removed and indexed. Indexing is for faster lookup and faster search times. There are
many different methods used to index files, the most common is to index a word by the number
of times it appears in a document or the frequency. During the indexing, terms are parsed out
and stored in an index file along with the frequency or some other ranking number that will help
quickly identify the term in a ranked list (Croft, Metzler & Strohman, 2010). The most common
indexing is the inverted index which, for every unique indexed term, contains a list of the
documents that contain that term (Croft, Metzler & Strohman, 2010). This research will be using
the inverted indexing that is included with the Lucene software.
40
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Figure 1. The indexing process
Stop words are common words that are found in most documents but not needed as
indexed terms. The most common words are the, and and to and removing them can increase the
speed of a search and decrease the size of the indexed file (Croft, Metzler & Strohman, 2010).
This research will be searching software files, and a stop word algorithm will be used but the list
of stop words will be researcher created to remove software specific common terms like =, ;, for,
while, etc.
Stemming is another helpful action that is done in the indexer and will speed up search
times. Stemming removes versions of words into one word, usually the smallest version (Croft,
Metzler & Strohman, 2010). For example the words run, running, and runs will stem to the
same word, run. This eliminates the need for extra space to store all versions of the indexed
words. The process of stemming takes a words base, after removing prefixes and postfixes and
stores that word and also searches for matches of that base. This allows a word like running to
41
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
be matched to run and runs, this not only increases the chance of match but also increase the
compression factor by 50% (Frakes & Baeza-Yates, 1992). The most common stemming
algorithm is Porter Stemming algorithm which is included in the Lucene’s SnowballAnalyzer
which will be used in this research.
How a system matches like terms is critical to any system. A part of most information
retrieval system is the thesaurus. The thesaurus is comprised of indexed terms, their
relationships to each other and the design of the how the relationships are laid out. The
relationship design can range from lists to multidirectional graph (Baeza-Yates & Ribeiro-Neto,
2011). By saving the relationship between like terms, a system can be searched without having
to worry about other terms. For example a user may search for a sum function but some may
save their functions using keywords add or plus, a good thesaurus will include all versions in a
search. Other methods include lexical analysis of the text, which turns streams of characters into
streams of words and usually includes removing of hyphens, apostrophes, and punctuation
marks.
Information Retrieval Software
There are open source IR applications available on the internet that were researched for
this study. A couple which were considered for this research but for different reasons were not
further considered. Some of them include Lemur, WorldNet, Terrier, idSearch, Zettair Sphinx
and SMART. Lemur has a self-contained library which would not work with this research, and it
also didn’t allow a software library to be substituted in place of its built in library. WorldNet
was not user friendly and was difficult to install on Windows 8, the same was true with Zettair.
42
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
SMART too was difficult to install on Windows 8 and also implements a vector model. Terrier
worked with the TREC dataset of which contains no software libraries so this was not applicable
to this research. There was very little documentation available for Zettair Sphinx so no further
time was spent exploring those systems. There are many other free information retrieval
software applications; they all work similarly but do differ in size, some can handle large
amounts of data while others are for small data systems (Eckard & Chappelier, 2007). There are
a growing number of systems that work with the TREC system of data, these systems are being
studied in academic fields everywhere. With the increasing amount of data on the web, search
engines need to be able to handle massive amounts of data in a short amount of time (Eckard &
Chappelier, 2007). Lucene is an open source application that is built on the Apache framework
of free software. The Lucene application contains an indexer and searcher plug in. The Lucene
search does include multiple different calculations of similarity but none allow for a custom
similarity measure to be created, therefore a researcher written similarity will be written using
the Lucene indexer implement the MMM, Paice and P-Norm models.
Measures of similarity
There have been a number of measures created and used in the industry to measure the
quality of retrieval in an information retrieval system, the most well-known is recall and
precision, which will be the measure used in this study. Recall is defined by Croft, Metzler and
Strohman (2010, p. 309) as “the proportion of relevant documents that are retrieved” and
precision is “proportion of a retrieved set of documents that are actually relevant”. These
measures are inversely related, as precision goes up, recall goes down and vice versa (Binkley &
43
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Lawrie, 2008). Recall = |A ∩B|/ |A| and precision = |A ∩B|/ |B|, where A is the number of
relevant document and B is the number of retrieved documents. The degree of precision is not a
number that is easily calculated, one way precision can be calculated is to look at specific cutoff
points in the returned ranked list. For this study, the number of relevant documents found should
match those found in the study by Maarek since the same data corpus is used. The study by
Maarek will be used as a standard to which the data in this study should match. Precision will be
considered only based on the first ten items in the list, or the precision at ten documents received.
Ten is a popular base because it is the number of items returned on a single result page used by
most search engine web pages (Turpin & Scholer, 2008).
The more precise the match is, or the more specific the query gets, the less recall will be,
for example if a search is for sum function there should be plenty of matches returned, low
precision, high recall. If the term integer sum function gets added to the query, the more precise
the returned solutions will be, but the number of items returned will be less. Because it has been
shown, that in general, the higher the recall the lower the precision. Precision will be calculated
similar to the Maarek study. By looking at different points of recall, the precision will be
graphed and average precision extrapolated when there is no exact precision calculated. This
research will follow a common procedure to calculate average precision, for each relevant file,
the precision is calculated dependent upon the location of each relevant file in the returned list of
documents (Maarek, 1991, p. 811). This way of calculating precision is common in IR. This
study also uses the mean average precision (or MAP) to calculate the overall effectiveness of the
system and use this value to compare to other systems (Manning, Raghavan & Schutze, 2008).
44
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
The MAP is just the average of all the individual precisions divided by the total number of
queries.
Recall and precision are the most commonly used measures of information retrieval
algorithms, although Raghavan, Jung and Bollmann (1989) say there are issues with using these
two measures in an IR system. Recall and precision are not suitable measures for multiple
queries that eventually will get averaged, precision values can be off and are based on a user’s
interpretation of relevant (Raghavan, Jung & Bollmann, 1989). Even with the given issues, this
research will utilize the recall and precision along with the MAP measurement for data
evaluation.
As good as recall and precision are in showing the number of relevant documents and the
degree of precision or match an IR system can return, to compare to each other this study will
use the mean average precision measure. The mean average precision (or MAP) is a widely used
measure that results in a single numerical figure that represents the effectiveness of a system
(Turpin & Scholer, 2006). With a single measure of quality across recall values multiple
systems can now be compared to each other (Manning, Raghavan & Schutze, 2008). MAP is
defined as
Average Precision = ∑r=1
N
( P (r ) X rel (r ) )
¿Relevant documents(7)
Mean Average Precision = APQ (8)
45
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Where Q is the total number of queries, N is the number of retrieved documents, r is the rank in
the sequence of retrieved documents, P(r) is the precision at rank r, rel(r) is 1 if the item at r is
relevant and 0 if it is not. Average precision is calculated by “taking the mean of the precision
scores obtained after each relevant document is retrieved, with relevant documents that are not
retrieved receiving a precision score of zero. MAP is then the mean of average precision scores
over a set of queries” (Turpin & Scholer, 2006, p. 12). MAP is a popular metric used in IR
system comparisons and has shown to be “stable across query set size and variations in
relevance” (Turpin & Sholer, 2006, p. 12). Because of its stability and ability to be used as a
comparison for IR systems that use different search algorithms, MAP will be the calculation used
to compare the IR systems in this study.
Literature Review Summary
Review of the literature revealed many articles closely related to the topic of study. For
example, from the literature it was found that software that is easily found has a higher
probability of reuse and that there are many ways to search for data including Boolean and fuzzy
logic. Also that software for reuse is a big issue that is not utilized to its fullest capacity and
could benefit companies if software was easier and more accurately retrievable.
For the purpose of this study, it was important to understand the benefits of software
reuse, the many applications of information retrieval and their successes and failures, and how
fuzzy sets work and how they can benefit an information retrieval algorithm for software
components. The literature showed a wide range of applications of information retrieval of data
and the use of fuzzy sets for information retrieval but none showed an effective fuzzy set of
46
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
information retrieval specific to the retrieval of software. Software reuse has proven its
importance in the software development community and with the use of a fuzzy set to retrieval a
wider range of components this research hopes to increase the success of finding software for the
purpose of reuse.
Chapter 3 introduces the methodology used in this research including the algorithm and
test data sets. This chapter explores the quantitative research methodology and the reasons this
methodology was chosen over a qualitative approach. Chapter 4 discusses the study in more
detail and detailed explanation of what the results mean. Chapter 4 will explain how the data
either confirms or refutes the hypothesis from Chapter 1. Chapter 5 will discuss the items that
were not discussed in this study, and answer the research questions that were not able to be
answered in Chapter 4. Other possible research that includes this study’s algorithms is discussed
in Chapter 5.
47
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
CHAPTER 3: METHODOLOGY
In Chapter 1 we explained the background of the study, the reason for the study and the
significance to the industry that this study will provide. In Chapter 2 we explored the literature
surrounding software reuse and the many possible ways to effectively retrieve information.
Fuzzy sets were defined and explored more as a way to obtain desired information. In this
chapter we look at the methodology used for employing fuzzy logic for information retrieval of
software, and the statistical evaluation to determine if that selected model was significant. We
also discuss the makeup of the data, how it is gathered, the instrumentation used to gather data
and the limitations of the study. Also discussed is the validity and reliability of the data that is
used for this research.
Lucene
Lucene is a Java based, open source, free software application that is developed on the
Apache framework which is an open-source software group that includes over 150 software
projects. The Lucene application includes an indexer and a searcher. For the purpose of this
application, the Lucene indexer will be used as is to index the data corpus and we developed a
search algorithm to search the data. Lucene will be used in an application we created in the
Eclipse environment using Java. Eclipse is a free integrated development environment (IDE)
that includes a full Java compiler. Lucene has four English language indexers that can parse
documents into the index. The StandardAnalyzer is a general purpose analyzer, the
WhiteSpaceAnalyzer parses data separated by white space, the StopAnalyzer parses out stop
words which are common English language words that usually don’t help in indexing and the
48
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
SnowballAnalyzer parses out words based on the Porter Stemming algorithm, which parses
words to their roots, for example, the words stopping and stopped will be indexed to stop to
reduce the redundant indexing (Smart, 2006). Frakes, Harman and Candela all observed that in
studies where stemming was done, it resulted in better results, therefore in this study the
SnowballAnalyzer was used to stem and index the documents (Frakes, Baeza-Yates, 2012).
Lucene 4.6.1 treats each document added to the index list as a collection of fields, one
field stores the contents of the data, one stores the path to the document, and the third stores the
time/date stamp of the last update to the index. The IndexConfig class creates an index file, if
there is one already created, it will get overwritten. The IndexWriter class accepts data from
files as fields, which can be changed depending on the files, for this research the default is used
which is the content field, which contains the data from the file. The SnowballAnalyzer is the
analyzer used to parse the data first, stemming words using the Porter’s Algorithm. Then the
data is written to the index file using the IndexWriter. If there is an error or problem in this
process an exception will be thrown in Java writing to the error log file, the data will be skipped
and indexing will continue. Once the index is created, the data can be searched. There are
applications that can read the index file and allow searches to be done, there are also search plug
in’s in Lucene to allow a user to create their own search. Luke is one of the third party
applications that is designed to read a Lucene index. The issue with Luke and the other
applications available was the inability to change the search algorithm to allow for custom
algorithms. For this reason, we have written an adapted search in Java.
49
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
To search the indexed file, a QueryParser class must be instantiated; this will parse the
search terms entered by the user, into the correct format required by Lucene. For example, if the
query is sum AND add, Lucene requires the search to be in the form +sum +add, the ‘+’
indicates required field. Since Lucene includes a parser, there was no need to change the way
the query was inputted, and we just allowed the software to do the alterations needed. The field
to be searched must also be entered when the QueryParser is created; for this research we will
access the contents field. For scoring, Lucene uses the Similarity class which contains four
options for similarity measure calculations. The DefaultSimilarity implements the tf/idf ranking,
BM25Similarity implements the BM25 ranking algorithm, MultiSimilarity implements the
CombSUM algorithm and a PerFieldSimilarityWrapper allows a different ranking method per
field to be specified (Carpenter, Morris & Baldwin, 2011). Because Lucene does not use a strict
tf/idf but rather a modified tf/idf implemented with boosting factors, like how often a term is
found in the entire corpus, the total number of words in a corpus and the total number of words
in a document, and because there was no way to change those values or implement a custom
similarity into Lucene, we have written a search algorithm to run the calculations needed. This
research will use Boolean, Mixed, Min and Max (MMM), Paice and P-norm model (calculations
in Chapter 2, equations 4-6) (Carpenter, Morris & Baldwin, 2011). The returned data will be
evaluated by experts and used to calculate recall, precision and MAP (calculations defined in
Chapter 2, equations 7-8). The higher the MAP, the more precise a system is.
Procedure
50
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Using the UNIX help libraries downloaded as .txt files from the FreeBSD UNIX website,
the Lucene indexing application is used as the corpus of files. The data files will not only be
indexed but also stemmed to reduce redundant words in the index. A developer created search
application with algorithms to calculate the MMM (Min, Mixed, Max), the p-norm, the Boolean
and the Paice values was executed with researcher developed queries that are common UNIX
help commands, for example print, file, move, etc. The returned list of matched files were then
presented in descending order by the resulting similarity scores calculated by the implemented
algorithms. Using a quantitative correlation experimental model, which is used to infer truth in
theories by comparing quantitative data collected from experiments, each method’s returned list
of matches will be compared to the others to determine if correlation is present and if there is
clear evidence based on the results to show the hypothesis is true or not true. Based on similar
studies listed in the literature review, the Boolean method should yield the least precise result
list, or the smallest MAP value.
Data
Using a quantitative methodology approach, we index all the data files in the corpus then
run a search for software components in the UNIX library. The UNIX help files were
downloaded from the FreeBSD UNIX website on to the researcher’s computer. FreeBSD is the
free version of the Berkeley Software Distribution, which is a popular free version of UNIX.
The data will be downloaded from http://www.freebsd.org/cgi/man.cgi/help.html. The UNIX
library is divided into 8 categories, category one is for user commands, 2 for system calls, 3 for
library functions, etc. Since we are trying to design an effective search for software, category
51
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
one will be the only files used in this study, which results in 681 files. The files based on this
data store, the files will be indexed and searched. The data files were simple text files with a
variety of sizes depending on the functionality of the command. Figure 2 shows the directory of
files and figure 3 shows the contents of the alarm.1 file. It’s clear to see in Figure 2 that the files
all relate to specific commands that are found in the manual pages. The files contain the details
of the command, and any useful information that a user may need, such as last update, license,
terms of use, etc.
Figure 2: Screenshot of the files in the directory used for the data corpus
52
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Figure 3: Screenshot of the alarm.1 man file from the FreeBSD data corpus
Using the Lucene 4.6.1 IndexWriter and SnowballAnalyzer the index has been created from a
corpus of files that contain the individual terms, the path where they are located, and the total
number of files. These statistics have been calculated when indexed and became available to
create a custom search. A user created application using the Lucene 4.6.0 search plug-in has
been run on the indexed data file to search for the user entered query.
We have developed searches that were run on the Lucene created index, after the queries
were parsed into separate query terms, and returned matching documents that included the
queried terms. If the query includes a Boolean “AND”, only the documents that include both
queried terms were considered; if the query includes an “OR” all files that contain either term
and both terms was considered.
53
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
An array was created that stored the document ID’s of the documents that fit the query.
If the query was an “AND” only the files that contain both terms were added to the result array;
if the query was an “OR”, all files were added to the array. This became the final list of
documents that matched the query. To calculate similarity score, the tf/idf was first to be
calculated. The tf/idfA is the term frequency inverse document frequency for search term A, and
tf/idfB is the term frequency inverse document frequency of search term B. To calculate each of
these, first add one to the log of the frequency per document of a term, then multiply that number
by the log of the total number of files in the corpus divided by the total number of documents the
term is located in. For example, if there are 681 files in the corpus, 30 of them contain TermA,
and TermA is found 4 times in document 1, the tf/idf for document 1, for termA would be 1+
log(4) * log(681 / 30). Although there are many other variations to the tf/idf calculation, Baeza-
Yates and Ribeiro-Neto say this calculation is the most frequently used and the most effective.
Then those scores were used as term weights in the three extended Boolean algorithms. To
calculate the Min, Max and Mixed similarity the pseudo code is below:
If the search is an AND then (1)
MMM = .4 * Min(tf/idfA, tf/idfB) + .6 Max(tf/idfA, tf/idfB))
Else the search is an OR then
MMM = .7 * Min(tf/idfA, tf/idfB) + .3 * Max(tf/idfA, tf/idfB)
End if
.
54
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
This number was then stored in the MMM array. Using 2 for p and 1 for the document weights,
the pseudo code for the P-Norm calculation is: (the full code for similarity calculations is
available in Appendix B)
If the search is an AND then (2)
PNorm = 1 – √ (1 )2∗(1−tfidfA )2+¿¿¿
Else it must be an OR search then
PNorm = √ (1 )2∗(tfidfA )2+¿¿¿
End if
The Paice was calculated using the recommended values of 1 for an AND query and .7 for an
OR query (Frakes, Baeza-Yates, 2012).
If the search is an AND then
Paice = (10 * MIN(tfidfA,tfidfB) + 11 * MAX(tfidfA, tfidfB))/ ¿ + 11) (3)
Else the search must be an OR then
Paice = (.70 * MAX(tfidfA,tfidfB) + .7 * MIN(tfidfA, tfidfB))/ ¿ + .71)
End if
There were many different situations that first needed to be taken into consideration in
order to find all matched data. The conditions for AND queries included, 1) there is one or more
file that contains both queried terms, then just calculate similarity as normal 2) There are files
that contain only one queried term, in this case, abort similarity calculation. For an OR query, 1)
there are one or more files that contain both queried terms, plus other files that contain only one
queried term, run similarity calculations as normal, 2) there are one or more files that contain the
first term, and one or more files that contain the second term, run similarity calculations using 0
55
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
for the term not found in the files. And lastly 3) there are one or more files that contain only one
searched term, the other search term is not found in any files, calculate similarity normally using
0 for the other term similarity.
Other conditions that needed to be accounted for include, if no documents match term A
and list of documents matching term B is greater than 1, the check for empty set cannot stop
when a list is 0, both lists need to be checked and both must be 0. In this case, the term A
similarity needs to be 0 while calculating the similarity for term B. And vice versa.
After the MMM, P-Norm and Paice algorithms are calculated, they are stored in an array
and sorted in descending order and displayed on the screen. The data returned from the experts
was used to calculate the recall, precision and MAP.
Figure 4: user interface for researcher created software
56
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Figure 4 shows the researcher developed output of the search for a query (addend AND sum) and
the results as calculated by the MMM, P-Norm, Paice and Boolean implementations. The
returned files in ranked order by score, and at the bottom the total number of files that match
each term. Since the query was an AND query, only the files that match both queried terms will
get returned. Term 2, sum, returned 13 files that contain that term while addend only returned 1
file. Because a Boolean search matches 1 for terms being present and 0 for terms not being
present, the Boolean score of 1 indicates the search terms are both found in file number 613. The
other searches return a score based on the similarity equation described in Chapter 2 (equations
2-7, pp. 38 - 39).
Sample Size
Using the ratio of queries to documents that have been used to evaluate IR systems in the
literature, “MED (collection of medical abstracts, 30 queries for 1033 documents) or CISI
(information science abstracts, 35 queries for 1460 information abstracts”, the number of queries
will be between 2-3% of the total number of files to be tested (Maarek, Berry & Kaiser, 1991,
p.811). With a corpus of 681 files, that results in 20 queries (http://www.freebsd.org/cgi/man.
cgi/help.html). Queries are two word minimum queries joined by a Boolean expression AND or
OR. The queries do not contain any keywords, to mimic a user searching for the command that
does a certain action. For example: copy AND file, move OR delete, etc.
Instrumentation
The data corpus is stored and the queries have been run. A program using the Lucene
4.6.1 plug in application has been created in Java using the Eclipse development environment.
57
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
An index has been created using the Free BSD Unix category 1 data files and queries were run
on that index to gather the results. After the queries were run and the MMM, Paice, P-norm and
Boolean similarities calculated, the results were evaluated by a panel of UNIX experts who
decide if the returned files are relevant and if so in which order. The two UNIX experts who
participated in this study include one personnel from Colorado Technical University and one
system administrator for the U.S. Missile Defense System at Peterson Air Force Base.
A software application has been created that first creates the index using the Lucene
IndexWriter. The user is then prompted for a search query and once the query is entered, the
program parses it out, and searches for each term from the query using the IndexReader which is
part of the IndexConfig class, to read through the index. Once matching documents are found the
document ID’s are stored in an array for later use. The program then loops through the matching
list of documents, calculates the similarity score and stores them in another array. Then the
arrays are sorted largest to smallest and printed.
Recall is calculated as the number of returned relevant documents over the total number
of documents and precision is calculated based on the ratio of relevant documents retrieved over
the number of documents in the database. A report was generated in Excel to show the returned
list of documents per query and sent to the panel of experts. Once the experts return their list of
relevant files, the mean average precision can be calculated. Using an in-house data set is
different than an online data set and results in different values. As the literature has shown, the
Boolean search should yield the worst results.
Validity and Reliability
58
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Data retrieved from each search is compared using precision, recall and the MAP
calculated value. To determine precision and the relevance of the files found in the search to the
queried terms, a panel of 3 UNIX experts has been established. Out of the 3 only 2 were able to
participate and return data. The panel has decided if the data returned was relevant and if so how
relevant by ranking the files in the order they deemed most relevant to least. If there were files
not returned by the searches, they were to write them in as well. The files that were not returned
but deemed relevant were then used to calculate recall. Precision was calculated by how well the
searches returned list of files matched the experts. Once the recall, precision and MAP values
were calculated the MAP value was then used to compare the different search methods to each
other, the higher the MAP, the more precise a system is.
Conclusion
This chapter looked at the methodology and procedure used to collect data for analysis by
running multiple queries on a software data bank using the weighted retrieval calculation to
match terms. By running multiple queries the data collected will be analyzed and will be further
discussed in Chapter four. The methodology defined in this study is based on the assumptions
that software in the UNIX help library has similar qualities to other software, and when
searching for software response time is not an issue. By comparing MAP calculated values, the
search with the best MAP value will be concluded as the most precise search of software.
59
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
CHAPTER 4: RESULTS
In this chapter, the data results will be discusses, the search methods will be reviewed and
the results will be analyzed as to how they answer the research questions. The data collection
will first be reviewed along with any issues encountered while collecting the data. Second, a
thorough review of the results will be presented. Third, a look at each of the IR methods used
will be discussed along with their individual results. Last, the research questions will be looked
at with answers from the results.
Data Collection Reviewed
The data corpus for this study was initially intended to be a software library easily
available for search. The literature had shown that the UNIX online manual pages accessed in
UNIX via the man command have worked successfully as a corpus to simulate a software library
(Maarek, Berry & Kaiser, 1991). Using the FreeBSD version of UNIX manual pages, the man
pages contain not only user commands but also system calls, etc. so it was decided by the
researcher and the team of Unix experts to only use the category 1 files in the Unix manual
pages. Category 1 contains the commands users would enter if using the online help system in a
UNIX environment, and contains fields for each command describing what the command does
and what other commands are required or parameters needed to use the command correctly.
Figure 5, shows the directory of category 1 and there the commands that the file refers to can
easily be identified according to the file name.
60
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Figure 5: Directory of category 1 UNIX manual pages
Once the other categories were removed, category 1 was left with 681 files. The other
categories include system calls, function calls, methods and their parameters, etc. that are used
internally to UNIX, therefore since those files are not accessible to the user they were removed
from the corpus to save indexing time and space.
Using the Lucene SnowBallAnalyzer with the Lucene Indexer, the data files were all read
and parsed into an index. The Snowball Analyzer was written by Martin Porter and is a
stemming algorithm that reduces words to their lowest stem in order to reduce the size of the
index and find more relatable matches (McCandless, Hatcher, Gospodnetic, 2010). For example
the terms, stopping, stopped and stops all stem to the term stop. The Lucene index was created
with three fields, filename, path, and contents. The contents field contains the data in the files,
this was the field used to search for matching terms.
61
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Presentation and Discussion of Findings
The goal of this dissertation is to test if implementing fuzzy logic into an information
retrieval system could result in a better search outcome. In these models, “a document has a
weight associated with each index term. This document weight is a measure of the degree to
which the document is characterized by that term” (Frakes & Baeza-Yates, 1992, p. 395). Zadeh
has defined a fuzzy set as any set whose elements have a degree of membership, and because the
three models in this study are measured by the degree to which a term belongs to a document, we
can therefore say the Paice, P-Norm and MMM are fuzzy models (Zadeh, 1994).
Typical IR effectiveness is based on search precision, and query recall. The precision is
based on how accurately a returned document matches what the user requests and recall is the
number of returned results based on the number of relevant files in the corpus. Because the
recall and precision vary per query per search, the mean average precision is the measure most
IR studies use to compare searches to each other. The rest of this section will discuss the
quantitative measures to include recall, precision and the mean average precision for four
different search algorithms and test the research hypothesis.
Limitations of the Study
The study was conducted on the UNIX man library pages. Because not all of the man
pages were relevant only the category 1 files were considered. Future research would expand the
corpus to determine if same quality of results is met. The size of the data corpus will also affect
the time of search, future research that compares the time of search should also look at different
size of data corpus to see if the time is affected.
62
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
The data was also verified by a panel of two experts, ideally there should be more experts
to verify the reliability of the data. A panel of users could also be included to determine if the
data returned matched a user’s needs specifically. By verifying the search is returning
information that is relevant to users would be an area for future study. Finding an algorithm that
can automatically determine what is considered relevant would also be a great topic for future
research. For software specific, the panel should be software engineers that use software
components on a daily basis. This way the reusability of software could be measured more
accurately.
The study also relied on the Lucene 4.6.0 indexer to index the data and read the index.
For future research a new index could be created that will be set up for software specifically.
Measure of Similarity
To calculate the MAP, the returned files were compared to the files returned by the
expert. The equation used to calculate MAP is
Average Precision = ∑r=1
N
( P (r )/rel (r ) )
¿Relevant documents(7)
Mean Average Precision = APQ (8)
Where Q is the total number of queries, N is the number of relevant documents, r is the rank in
the sequence of retrieved documents, P(r) is the precision at rank r, rel(r) is 1 if the item at r is
relevant and 0 if it is not. AP is average precision.
63
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Based on the location of the first relevant file in the search’s result list, the average
precision is then calculated. For example, the search for field AND float, returned the data in
table 3. To calculate the precision for the MMM in this search, the number of relevant files = 2;
2 files were returned by the experts as relevant. The files used as relevant were the files that
matched between experts, other returned files were disregarded and not used. To find the
precision, divide the position of each relevant file in the search returned list of files by the
number of relevant files found thus far. For the first relevant file returned, it is located in
position 2, so p1=¿1 /2¿= 0.5; the next relevant file is found in position 4, so second relevant file
(2) divided by position 4, p2=¿2 /4 ¿ = 0.5. Then to find the AP = 0.5+0.5
2 = .0.5 or divide each
relevance by the total number of relevant tiles so 0.5 + 0.5 / 2. Since the relevant files were
found in the same position for all searches, they all get an AP score of 0.5 for this query.
Another example is the query for sum AND add. To calculate the average precision for MMM,
there is only one relevant file, nawk.1, and that is found in position five in the MMM search, so
1/5 = 0.2; for the Paice model, the file is found in the fourth position so ¼ = 0.25; and for the
Boolean model, the file is in the second position, so ½ = 0.5. In this example, the Boolean
search with an AP of 0.5 resulted in a better match, compared to MMM and P-Norm of 0.2, and
Paice at 0.25.
TermABool
Operator
TermB Rank MMM doc
Paice doc
P-norm Doc
Boolean Doc
Expert1 Relevan
t List
Expert 2 Rel list
field AND float 1 whatis Whatis whatis grn.1 printf printf
64
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
2 printf.1 printf.1 printf.1 printf.1 Sort sort 3 stat.1 stat.1 stat.1 seq.1 Seq awk 4 sort.1 sort.1 sort.1 sort.1 perl 5 grn.1 grn.1 grn.1 stat.1 6 seq.1 seq.1 seq.1 tcsh.1 7 tcsh.1 tcsh.1 tcsh.1 whatis
sum AND add 1 whatis whatis whatis id.1 nawk nawk 2 id.1 unxz.1 id.1 nawk.1 expr perl 3 unxz.1 id.1 unxz.1 ps.1 4 ps.1 nawk.1 ps.1 unxz.1 5 nawk.1 ps.1 nawk.1 Whatis
find AND file 1 find.1 find.1 find.1 afmtodit.1 find find
2 whatis whatis whatis apropos.1 locate locate 3 lex++.1 lex++.1 lex++.1 as.1 less less 4 less.1 less.1 less.1 bsdgrep.1 vi 5 tcsh.1 tcsh.1 tcsh.1 bsdtar.1 6 cpio.1 cpio.1 cpio.1 bzip2.1 7 unxz.1 unxz.1 unxz.1 chflags.1 8 vi.1 vi.1 vi.1 ci.1 9 id.1 id.1 id.1 clang++.1 10 locate.1 locate.1 locate.1 cpio.1
16 test.1 test.1 test.1 find25 xargs.1 xargs.1 xargs.1 less29 bzip.1 bzip.1 bzip.1 locate.1
Table 3: Returned data from searches
The last example is the query for find AND file, for the MMM, Paice and P-Norm, the
three relevant files are located in the same position, so the first file is found at the first position,
1/1 = 1; the second found in the fourth position 2/4 = 0.5; and the third found in the tenth
position, 3/10 = 0.3. To find the AP we add, 1 + 0.5 + 0.3 = 1.8 and then divide by number of
65
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
relevant files, in this case, 3. 1.8/3 = 0.6. To find the AP for the Boolean search, we had to
include more than the top ten returned files, the three relevant files were found at positions, 16,
25 and 29 respectively. For this calculation, 1/16 = 0.0625; for the second file, 2/25 = 0.08; and
the last file is 3/29 = 0.103. To find the AP, add .0625 + .08 + .103 = 0.2455 then divide by the
number of relevant files, 0.2455/3 = 0.0813. In this example it is clear to see that an AP of 0.6
which was the AP of the MMM, P-Norm and Paice model, is much higher than the Boolean
score of 0.08.
TermA
Bool Operato
r TermBMMM
docMMM score
Paice doc
Paice score
P-norm Doc
P-Norm score
Bool Top Doc Score
field AND float 613 12.54 613 15.56 613 10.47 222 1 406 9.17 406 12.62 406 7.01 406 1 492 8.37 492 12.5 492 6.36 460 1 482 7.59 482 8.96 482 5.7 482 1 222 6.46 222 7.87 222 4.5 492 1 460 4.89 460 5.54 460 3.11 511 1 511 3.25 511 4.17 511 1.25 613 1
copy AND file 395 3.6 395 3 395 2.61 19 1 613 3.57 613 2.98 613 2.57 20 1 109 3.4 109 2.84 109 2.38 21 1 379 3.36 379 2.8 379 2.33 27 1 295 3.03 295 2.52 295 1.95 50 1 331 2.97 331 2.47 331 1.88 68 1 111 2.9 111 2.42 111 1.8 69 1 255 2.9 255 2.42 255 1.8 86 1 459 2.82 459 2.36 459 1.71 89 1 569 2.82 569 2.36 569 1.71 97 1
Table 4: Scores with file numbers returned
Because the Boolean search returned 1 for found and 0 for not found, the Boolean search
listed its files in descending order. The other searches listed their results in descending order by
66
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
score, then by file number. This is illustrated in table 4 above, the highlighted numbers, when
the scores match, the files are then sorted by file number in descending order. Because of this,
two files with the same similarity will be returned in alphabetical order not based on relevant to
the user, which proves that the MAP is a better measure of similarity for an IR study as it looks
not just at the ranked order of returned documents.
Quantitative Methodology Measurement
Table 5 shows the average precisions per query and the bottom contains the total average
precision divided by the number of queries which gives the mean average precision. As can be
seen the Boolean search returned a better average precision only 3/20 times. The other searches
returned similar results and the mixed, min, max (MMM) and the P-Norm search returned the
exact same results. Based on these results, it’s clear to see the Paice was the best performing
search with a mean average precision of .5575.
TermA
Bool Operat
or TermB mmm Paice pnormboolea
nField AND float 0.5 0.5 0.5 0.5Copy AND file 0.33 0.33 0.33 0.09 run OR execute 1 1 1 0.007
column AND height 1 1 1 1construct
or AND length 0 0 0 0addend AND sum 0 0 0 0 math AND library 0.59 0.84 0.59 1space AND gap 1 1 1 0.33
mergesort OR heapsort 1 1 1 1cut OR delete 1 1 1 0.33
oldest ANDcomman
d 1 1 1 0.567
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
sort ANDmergeso
rt 1 1 1 1deploy AND run 0 0 0 0find AND file 0.6 0.6 0.6 0.08list AND files 0.2 0.25 0.2 0.5sum AND add 0.1 0.13 0.1 0.25
subtract AND math 0.5 0.5 0.5 0.5add AND subtract 0 0 0 0
printer AND network 0 0 0 0mergesort AND print 1 1 1 1
Sum of the queries 10.82 11.15 10.82 8.087
Sum/20 0.541 0.5575 0.5410.4043
5
Table 5: Scores of average precision per and resulting MAP scores
Because the mixed, min, max (MMM), Paice and P-Norm models are considered fuzzy,
because we use term weights we can successfully say the fuzzy models returned a better result
over the Boolean search. Based on the MAP results, the list of searches ranked by best score is
the fuzzy model Paice, followed by a tie between the fuzzy models MMM and P-Norm and
finally the Boolean search. Table 6 shows the search models with their resulting MAP scores
and how they ranked.
Search Method Paice MMM P-Norm BooleanMAP 0.5575 0.541 0.541 0.40435
Ranking 1 2 2 3Table 6 Ranking of results of search performance
Hypothesis Testing68
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
The original hypothesis stated that using the term weights created in Lucene as the tf/idf,
a researcher developed search implementing the MMM, P-Norm and Paice algorithm, will result
in a more accurate list of files that meet the user’s needs than a Boolean search. Looking at the
degree of membership will yield a higher rate of successful match over the Boolean method and
a higher success rate in returned software components should result in a higher reuse rate. Table
5 shows the average precision for each query for each search algorithm. Table 7 shows the
resulting MAP scores and their percentage increase over the other search results as compared to
the Boolean search. Based on these results, there is a 27.6% increase in the Paice model over the
Boolean model. And a 25.25% increase in the MMM and P-Norm model over the Boolean
model. These results support the hypothesis that the fuzzy models will deliver a better search
result over a Boolean model.
MMM Paice P-Norm BooleanMAP 0.541 0.5575 0.541 0.40435% Increase over Boolean search 25.25% 27.6% 25.25% 0
Table 7: Resulting MAP scores for all searches and their % increase over the Boolean search
Conclusion
Based on the experimental quantitative data shown in this study, the fuzzy logic methods
do return a better match for a user’s query needs. The difference is clear to see and proves that
using a fuzzy method will have a better chance at matching a user’s needs by returning a better
ranked list of matched documents. Chapter 4 showed the results in detailed format and explained
the meaning of these results, Chapter 5 will detail where this study left off and what is to be done
next.
69
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
70
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
CHAPTER 5: DISCUSSION OF RESULTS AND FUTURE WORKS
Any information retrieval system has the same main components, indexer, data corpus
and a search algorithm. The literature has shown in chapter 2 that the way the data is indexed
can affect the way the data is searched but the most influential part of any IR system is the search
algorithm. Chapter 3 described our methodology in detail and compared the results to a Boolean
search. Chapter 4 proved that the fuzzy logic methods had a tremendous increase in mean
average precision over a Boolean search. This chapter will focus on the two research questions
that were not addressed in this study and also other areas of future research that came up during
this study.
Research Question 1
Research Question 1 asks if software can be retrieved using a fuzzy logic algorithm and
return a more accurate match to the user’s query? This question will be examined by looking at
resulting MAP scores from the different searches. The higher the MAP the more accurate the
search and based on the results, Paice was the best overall search at returning accurate match to
the user’s needs with a MAP of .5575, Table 7 shows that next was the MMM and P-Norm with
a MAP of .541 and lastly was the Boolean model with a MAP of .40435. This resulted in a
27.6% increase from Paice over Boolean and a 25.25% increase in the MMM, and P-Norm
model over the Boolean search.
71
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200
0.2
0.4
0.6
0.8
1
1.2Average Precision by model
mmm paice pnorm boolean
Figure 6: Graph of average precision results from all 20 queries
Although not all searches were an improvement for the fuzzy logic methods, the ending
average was a better measure for this study showing the overall average higher than the
Boolean’s MAP, full results from the experts and the search is listed in Appendix A. The graph
in Figure 6 shows the average precision results for all 20 queries and how they mapped out per
search method. The MMM and P-Norm models returned similar scores for every query so their
line is one in the same.
Research Question 2
Research Question 2 asks if searches for software require different parameters than
searching for standard information. This question was not examined in this study but has the
possibility for future research. Searching for software and searching for regular text should only
differ in the number of similar words that may replace search terms. Searching for software
should lower the number of possible synonyms for a search term because there are only so many
72
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
programming terms to match a user’s intent. For example, ambiguous words such as bat or can,
and limit the ambiguity to only how one describes a specific programming command, for
example the java command system.out.println () which prints output to the console and returns a
line, could be searched by output, print, or even new line. The benefit to searching for software
would be the general idea or meaning behind all searched terms should be clear, vs. searching for
bat, which has multiple meanings and not only does a search algorithm have to decide what
documents match but also have to decide which meaning of the term the user is referring.
The number and order of terms in a query is something that is focused on in latent
semantic analysis (LSA). Deerwester, Dumai, Furnas, Landauer and Harshman look at a
different approach to indexing based vectors created by documents and indexed terms. By
creating a matrix of terms and documents, multiple terms will connect to the multiple documents
where they are found. Finding a match to the query is just a matter of finding a point in space,
but this also allows similar documents to be returned since the documents that are closer may be
indexed with different terms. Because users use the same term to query only 20% of the time
(Deerwester, Dumai, Furnas, Landauer and Harshman, 1990), this will help find synonymous
terms even if the user doesn’t know any. The results of early testing was successful in finding
synonymous terms but there is more work that needs to be done. The correct term for an idea is
not always what is indexed in a document or thesaurus so the authors say that including concept
based information would increase the success rate of matches. These systems are useful in web
searches to find ‘like’ content, and can be helpful in a search for software components. Finding
a component that will sum two numbers can also be found under add, math, etc. but the concept
73
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
of addition or plus, would include all applicable terms. Another topic of research related to the
LSA is the introduction of probabilistic semantic analysis. Using the same methodology of the
LSA the probabilistic model introduces an added element to help find matches based on the
statistical analysis of the vector, the PLSA introduces an added statistical model to help
introduce new language that may fit a user’s need better (Hofmann, 1999).
Research Question 3
Research Question 3 asks if using a fuzzy logic approach to searching for software
components reduce the amount of needed query words to find an appropriate match? The idea
of reducing or eliminating the need for multiple queries or including multiple synonymous words
in a search query to ensure all relevant data gets returned is another hot topic in information
retrieval. This study did not look at the structure of the query or the amount of words needed/not
needed to return all relevant files, but there is concurrent research in such areas. Sieg, Mobasher
and Burke (2004) did a study looking at a new web-based search algorithm that incorporates the
user’s profile information and based on a concept hierarchy, were able to incorporate certain
keywords without requiring the user to enter them explicitly in the search. This returned a better
search result list in their experiment but being web based and based on user profiles, to apply to
software components, the search would have to tie into the software IDE that was running at the
time. This may allow the search to pull out exactly what software component the user is working
on and can then fill in other keywords that the user may not even realize are important in order to
return all relevant documents.
74
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Another topic concerning queries is the prediction of query difficulty. The difficulty of a
query is how hard it is for a system to find matches or where it’s difficult to agree on the most
relevant document for different versions of a term, and considers a query entered by a user’s
results compared to other sub-queries run in the background. Using the MAP or precision @ 10
calculations, Yom-Tov, Fine, Carmel and Darlow, devised a learning algorithm that will use the
most effective query words, whether entered by the user or created by the system in a sub-query,
to run a more effective query on a system (2005). This learning query system is great addition
for systems where query synonyms are prevalent and other contextual information is not
available (Yom-Tov, Fine, Carmel & Darlow, 2005). They also get sent to system
administrators in order for them to alter systems to create more tags, better identifiable
information, etc. (Yom-Tov, Fine, Carmel & Darlow, 2005). This type of system would be a
great addition to a software searching IR system. Different programming languages use different
key terms for different items and if the system can auto fill and run in the back ground a search
for similar terms the user’s results could improve. The issue is when the user knows exactly
what they want to find and the system thinks otherwise. In this case the query expansion option
would need to be shut off.
Future Research
There are many other areas for future research when it comes to information retrieval.
Applying the algorithms in this study to a real-life application would be one that we should look
at next and how we can improve any search for information. This study did not consider time to
execute or efficiency in the study but that would be something that this researcher would like to
75
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
inquire about. Because this study was not written in the most efficient manner, the next step
would be to re-write parts to make it more efficient for the computer and the user. Most of the
literature discussed in Chapter 2, include the element of time when discussing the success or
failure of a new search algorithm, this was not looked at in this study because of lack of
mainframe and real-world testing ability, but it is something that would be next for this study.
A measure of how many times a returned piece of software gets reused would also be a great
addition to the future research. Reusing a code component can only be judged by actually
programmers who use components therefore a panel of software engineers would be needed to
verify the reusability of a piece of software. It was shown in Chapter 2 that just being able to
easily find software components helps increase the reusability of a software component, so this
study, based on those results, should increase the reuse of software since the search methods
return a better match to a user’s query.
The weight and ranking of an information retrieval system is another area that can be
improved with future research but was not focused on in this study. How a system ranks
documents that match a query can be based on a similarity score, the documents term frequency,
or the document name. The basic concept of a ranking system is to assign a number or score to a
document that matches the user’s query and based on how well that document matches
determines the ranked score. Then the list of documents is displayed to the user in order of these
ranked scores. This was done in this research using the fuzzy similarity measures compared to
the Boolean score. Usunier, Buffoni and Gallinari (2009) say that only a few top documents that
match the query are really relevant to the user and those documents should have the highest
76
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
precision and be at the top of the list, which they are usually not, so they devised a new ranking
based on the Ordered Weighted Average (OWA). By looking at the number of relevant
documents retrieved as a pairwise function with the number of irrelevant number of documents
retrieved, they can determine which search’s similarity returns a better fit to the user’s query.
(Usunier, Buffoni, Gallinari, 2009). This new ranking would be great in order to return the most
precise documents first. From this study the most relevant documents were returned but not
necessarily in the best order, so testing this new OWA system on this study may prove a better
ranked list of returned documents.
One more area for future research that came up in doing this study is the idea of relevant.
Relevance in an information retrieval system can come from three different areas, users-centered,
systems-centered, or cognitive (Ingwersen, 2001). Deciding what is relevant to a user is difficult
and has forced most IR systems to create a network or mapping of information to related terms.
This matrix of information can be different per user per topic which is an area of study that was
not focused on in this research. These type of data maps require extensive background
information that is usually gather across time and through experience. Thinking about how the
human brain gathers knowledge about a topic, it usually is over the course of years, to gather that
kind of information would require systems to be able to grow at an infinite amount of space all
while connecting like terms. Although this is more of a data store topic, this will change how
users retrieve and even query a system. If a user understands how the data is stored they are
better equipped to use a more intuitive query. Data stores are another area of future research that
are of interest. How a system stores and connects data will greatly define how the data is
77
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
searched, how fast the data is searched, and the success of finding a match. Crestani and Lalmas
look at logic in an IR system and look at the relevance of documents as being true if a document
meets a user’s needs, and if not the document is irrelevant (2001). This logical approach to
relevance is the basis for their logical IR system, and to not sound like a straight Boolean IR
system, they incorporate other theories to devise a new IR system that they compare to
traditional IR systems.
Conclusion
There are many applications where an IR system can be effective and the most accurate
and most effective IR method is something that can benefit many. As we can see in this chapter,
the fuzzy logic methods were an improvement over the Boolean search results. Because of this
more research can be done to find the most efficient system. There are many future routes where
this research can be expanded, from query improvement to semantic vector space analysis but
it’s safe to say the fuzzy retrieval was as success over the Boolean model.
78
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
References
Addagada, S. (2007). Indexing and Searching Document Collections using Lucene. University of
New Orleans Theses and Dissertations. Paper 1070. Retrieved from
http://scholarworks.uno.edu/cgi/viewcontent.cgi?article=2051&context=td.
Agresti, W. (2011). Software Reuse: Developers’ Experiences and Perceptions. Journal of
Software Engineering and Applications. Doi: 10.4236. pp. 48-58.
Aziz, M., North, S. (2007). Retrieving Software Component using Clone Detection and Program
Slicing. The University of Sheffield.
Baeza-Yates, R., Ribeiro-Neto, B. (2011). Modern Information Retrieval the concepts and
technology behind search. Second edition. Addison Wesley; Harlow, England.
Barringer, H., Cheng, J. and Jones, C. (1984). A logic covering undefinedness in program proofs.
Acta Informatica, 21(3). Pp. 251-269.
Belkin, N. J., & Croft, W. B. (1992). Information filtering and information retrieval: two sides of
the same coin?. Communications of the ACM, 35(12), 29-38.
Binkley, D., & Lawrie, D. (2008). Applications of information retrieval to software
development. Encyclopedia of Software Engineering (P. Laplante, ed.),(to appear).
Bordogna, G., Carrara, G., Pasi, G. (1992). Extending Boolean Information Retrieval: A Fuzzy
Model Based on Linguistic Variables. In Fuzzy Systems, 1992 IEEE International
Conference on (pp. 769-776). IEEE.
Bordogna, G., Pasi, G. (1993). A Fuzzy Linguistic Approach Generalizing Boolean Information
Retrieval: A Model and Its Evaluation. JASIS, 44(2), 70-82.
79
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Bookstein, A. (1980). Fuzzy requests: an approach to weighted Boolean searches. Journal of the
American Society for information Science, 31(4), 240-247.
Burton, B. A., Aragon, R. W., Bailey, S. A., Koehler, K. D., & Mayes, L. A. (1988). The
reusable software library. In Software reuse: emerging technology (pp. 129-137). IEEE
Computer Society Press.
Buell, D. A., & Kraft, D. H. (1981). A model for a weighted retrieval system. Journal of the
American Society for Information Science, 32(3), 211-216.
Carpenter, B., Morris, M., & Baldwin, B. (2011). Lucene Version 3.0 Tutorial. Draft of: March,
31.
Chau, R., Yeh, C. (2004). Fuzzy Conceptual Indexing for Concept-Based
Cross-Lingual Text Retrieval. IEEE Internet Computing. 2004. Pgs. 14-21.
Chowdhury, C. R., & Bhuyan, P. (2010, July). Information retrieval using fuzzy c-means
clustering and modified vector space model. In Computer Science and Information
Technology (ICCSIT), 2010 3rd IEEE International Conference on Vol. (1), pp. 696-700.
IEEE.
Crestani, F., & Lalmas, M. (2001). Logic and uncertainty in information retrieval. In Lectures on
information retrieval (pp. 179-206). Springer Berlin Heidelberg.
Croft, W., Metzler, D. & Strohman, T. (2010). Search Engines Information Retrieval in
Practice. Pearson Education; Boston, MA.
Deerwester, S., Dumais, S., Furnas, G., Landauer, T. & Harshman, R. (1990). Indexing by
Latent Semantic Analysis. Journal of the American Society for Information Science.
80
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
41(6). pp. 391-407.
Eckard, E., & Chappelier, J. C. (2007). Free Software for research in Information Retrieval and
Textual Clustering. Technical report, Ecole Polytechnique Federale de Lausanne.
Fox, E. A., & Sharan, S. (1986). A comparison of two methods for soft Boolean operator
interpretation in information retrieval.
Frakes, W.B., Baeza-Yates, R. (1992). Information Retrieval Data Structures & Algorithms.
Prentice Hall: Englewood Cliffs, New Jersey.
Frakes, W. B., & Pole, T. P. (1994). An empirical study of representation methods for reusable
software components. Software Engineering, IEEE Transactions on, 20(8), 617-630.
Gallardo-Valencia, R. E., & Elliott Sim, S. (2009, May). Internet-scale code search. In
Proceedings of the 2009 ICSE Workshop on Search-Driven Development-Users,
Infrastructure, Tools and Evaluation (pp. 49-52). IEEE Computer Society.
Gibb, F., McCartan, C., O’Donnell, R., Sweeney, N., & Leon, R. (2000). The integration of
information retrieval techniques within a software reuse environment. Journal of
Information Science, 26(4), 211-226.
Haefliger, S., Von Krogh, G., & Spaeth, S. (2008). Code reuse in open source software.
Management Science, 54(1), 180-193.
Henninger, S. (1994). Using Iterative Refinement to Find Reusable Software. Software,
IEEE, 11(5), 48-59.
Hofmann, T. (1999, July). Probabilistic latent semantic analysis. In Proceedings of the Fifteenth
conference on Uncertainty in artificial intelligence (pp. 289-296). Morgan Kaufmann
81
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Publishers Inc.
Houhamdi, Z., & Ghoul, S. (2001). Classifying software for reusability. Mail of technical
and scientific knowledge, 41-47.
Huang, L., D. Milne, et al. (2012). Learning a concept-based document similarity measure.
Journal of the American Society for Information Science and Technology 63(8): 1593-
1608.
Ingwersen, P. (2001). Users in context. In Lectures on information retrieval (pp. 157-178).
Springer Berlin Heidelberg.
Isakowitz, T., Kauffman, R. (1996). Supporting Search for Reusable Software
Objects. IEEE Transactions on Software Engineering, 22(6): 407-422.
Keswani, R., Joshi, S., & Jatain, A. (2014, February). Software Reuse in Practice. In Advanced
Computing & Communication Technologies (ACCT), 2014 Fourth International
Conference on (pp. 159-162). IEEE.
Khalifa, H., Khayati, O., Ghezala, H. (2008) A Behavioral and Structural Components
Retrieval Technique for Software Reuse. Advanced Software Engineering and Its
Applications. IEEE, pp. 134- 137.
Klir, G. J., & Yuan, B. (1995). Fuzzy sets and fuzzy logic (pp. 487-499). New Jersey: Prentice
Hall.
Kokkoras, F., Ntonas, K., Kritikos, A., Kakarontzas, G., & Stamelos, I. (2012). Federated Search
for Open Source Software Reuse. In Software Engineering and Advanced Applications
(SEAA), September 2012 38th EUROMICRO Conference on (pp. 200-203). IEEE.
82
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Kraft, D., Bordogna, G., Pasi, G. (1998) Information Retrieval Systems: Where is the Fuzz? In
Fuzzy Systems Proceedings, 1998. IEEE World Congress on Computational
Intelligence.
The 1998 IEEE International Conference on (Vol. 2, pp. 1367-1372). IEEE.
Krueger, C. (1992). Software reuse. ACM Comput. Surv. 24, 2 (June 1992), 131-183.
DOI=10.1145/130844.130856 http://doi.acm.org/10.1145/130844.130856.
Maarek, Y., Berry, D., & Kaiser, G. (1991). An Information Retrieval Approach For
Automatically Constructing Software Libraries. IEEE Transactions on Software
Engineering. Vol 17(8). P. 800-813.
Manning, C., Raghavan, P., & Schutze, H. (2008). Introduction to Information Retrieval.
Cambridge University Press.
Marri, M. R., Thummalapenta, S., & Xie, T. (2009, May). Improving software quality via code
searching and mining. In Proceedings of the 2009 ICSE Workshop on Search-Driven
Development-Users, Infrastructure, Tools and Evaluation (pp. 33-36). IEEE Computer
Society.
Maron, M. E., & Kuhns, J. L. (1960). On relevance, probabilistic indexing and information
retrieval. Journal of the ACM (JACM), 7(3), 216-244.
McIlroy, M. (1968) Mass produced software components. In Software Engineering; Report on
a conference by the NATO Science Committee (Garmisch, Germany) Oct. pp. 138-150.
Mili, H., Mili, F., Mili, A. (1995). Reusing Software: Issues and Research Directions.
IEEE Transactions on Software Engineering. 31(6): 528-562.
83
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Miller, B. (1996). Fuzzy Logic. Electronics Now. May 1996. pp. 29-30, 56-60.
Miyamoto, S. (1990). Fuzzy Sets in Information Retrieval and Cluster Analysis. Kluwer
Academic Publishers. Dordrecht, Netherlands.
Mockus, A., (2007). Large-scale code reuse in open source software. International Workshop
on Emerging Trends in FLOSS Research and Development, 0:7, 2007.
Morisio, M., Ezran, M., & Tully, C. (2002). Success and failure factors in software reuse.
Software Engineering, IEEE Transactions on, 28(4), 340-357.
Nomoto, K., Kubo, T., Kosuge, Y. (1995). Fuzzy Thesaurus Generation Based on Cross-Index
Matrix for Case-Based Reasoning. IEEE. January, 1995. 4033-4038.
Pasi, G., Bordogna, G. (2013) The Role of Fuzzy Sets in Information Retrieval. On Fuzziness,
Vol 2, R. Seising et al. Springer –Verlag Berlin Heidelberg. P. 525-532.
Prieto-Diaz, R. (1991). Implementing Faceted Classification for Software Reuse.
Communications of the ACM. 34(5).
Prieto-Diaz, R., Freeman, P.. (1987). Classifying Software for Reusability. IEEE
Software. January, 1987. P. 6-16. ProQuest database.
Radecki, T. (1979). Fuzzy set theoretical approach to document retrieval. Information
Processing & Management, 15(5), 247-259.
Reiss, S. P. (2009, May). Semantics-based code search. In Software Engineering, 2009. ICSE
2009. IEEE 31st International Conference (pp. 243-253). IEEE.
Robertson, S. (2004). Understanding inverse document frequency: on theoretical arguments for
IDF. Journal of documentation, 60(5), 503-520.
84
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Robertson, S., Zaragoza, H. & Taylor, M. (2004). Simple BM25 Extension to Multiple
Weighted Fields. ACM Conference on Information Knowledge 2004. Nov. 8-13.
Rothenberger, M. A., Dooley, K. J., Kulkarni, U. R., & Nada, N. (2003). Strategies for software
reuse: A principal component analysis of reuse practices. Software Engineering, IEEE
Transactions on, 29(9), 825-837.
Salton, G., Fox, E., Wu, H. (1983). Extended Boolean Information Retrieval. Communications
of the ACM. 26 (12). pp. 1022 – 1036.
Sandhu, P. S., Kaur, H., & Singh, A. (2009). Modeling of Reusability of Object Oriented
Software System. World Academy of Science, Engineering and Technology, 56(32).
Sandhu, P., Singh, H. (2007). Automatical Reusability Appraisal of Software Components using
Neuro-fuzzy Approach. World Academy of Science, Engineering and Technology. Vol. 8.
Sieg, A., Mobasher, B., & Burke, R. (2004). Inferring user’s information context from user
profiles and concept hierarchies. In Classification, Clustering, and Data Mining
Applications (pp. 563-573). Springer Berlin Heidelberg.
Sim, S. E., Clarke, C. L., & Holt, R. C. (1998, June). Archetypal source code searches: A survey
of software developers and maintainers. In Program Comprehension, 1998. IWPC'98.
Proceedings., 6th International Workshop on (pp. 180-187). IEEE.
Singhal, A. (2001). Modern information retrieval: A brief overview. IEEE Data Eng. Bull.,
24(4), 35-43.
Sojer, M., & Henkel, J. (2010). Code reuse in Open Source Software development: Quantitative
evidence, drivers, and impediments. Journal of the Association for Information Systems,
85
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
11(12), 868-901.
Srinivasan, P., Ruiz, M., Kraft, D., Chen, J. (2001). Vocabulary mining for information
retrieval: rough sets and fuzzy sets. Information Processing and Management Vol. (37)
pp. 15-38.
Smart, J. (2006). Lucene Tutorial. http://oak.cs.ucla.edu/cs144/projects/lucene/.
Swain, M., Anderson, J. A., Swain, N., & Korrapati, R. (2005, April). Study of information
retrieval using fuzzy queries. In SoutheastCon, 2005. Proceedings. IEEE (pp. 527-533).
IEEE.
Thummalapenta, S., (2011). Improving Software Productivity and Quality via Mining Source
Code. (Doctoral dissertation) UMI Dissertation Publishing:3442531.
Triantafyllos, G., Vassiliadis, S., Pechanek, G. (1994). A Fuzzy Information Retrieval System.
IEEE World Congress on Computational Intelligence., Proceedings of the Third IEEE
Conference on Fuzzy Systems. IEEE. (p. 150-155).
Turpin, A., & Scholer, F. (2006, August). User performance versus precision measures for
simple search tasks. In Proceedings of the 29th annual international ACM SIGIR
conference on Research and development in information retrieval (pp. 11-18). ACM.
Usunier, N., Buffoni, D., & Gallinari, P. (2009, June). Ranking with ordered weighted pairwise
classification. In Proceedings of the 26th annual international conference on machine
learning (pp. 1057-1064). ACM.
Verma, R., Sharma, B. (2013). Fuzzy Generalized Prioritized Weighted Average Operator and
its Application to Multiple Attribute Decision Making. International Journal of
86
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Intelligent Systems, Vol. 00 (1-24).
Vishal, Subhash, C., Kunda, J. (2012). An Effective Retrieval Scheme for Software Component
Reuse. International Journal on Computer Science and Engineering. Vol 4(7). ISSN:
0975-3397.
Xu, J., Croft, B. (1996). Query Expansion Using Local and Global Document Analysis. As
presented at SIGIR 1996, Zurich, Switzerland, ACM.
Yao, H., Etzkorn, L., Virani, S. (2008). Automated Classification and Retrieval of
Reusable Software Components. Journal of the American Society for Information
Science and Technology, 59(4): 613-627.
Yom-Tov, E., Fine, S., Carmel, D., & Darlow, A. (2005, August). Learning to estimate query
difficulty: including applications to missing content detection and distributed information
retrieval. In Proceedings of the 28th annual international ACM SIGIR conference on
Research and development in information retrieval (pp. 512-519). ACM
Zadeh, L.A., (1994). Soft Computing and Fuzzy Logic. IEEE Software. November 1994. pp.
48-56.
Zhang, Q., Wu, Y., Ding, Z., Huang X. (2012) Learning Hash Codes for Efficient Content Reuse
Detection. School of Computer Science, As presented at SIGIR 2012, ACM.
87
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Appendix A
This is the resulting list from each search with the first 10 files returned for large searches and
the far right column includes the experts list of relevant files, the files in red text are the files that
both experts deemed relevant therefore those were the files deemed relevant for this search.
Highlighted yellow file names are those that match the experts relevant file list. Highlighted in
green are the files the expert deemed relevant but no search returned.
TermABool
Operator
TermB Rank
MMM doc
Paice doc
P-norm Doc
Boolean Doc
Expert1
Relevant List
Expert 2
Rel. List
field AND float 1 whatis Whatis whatis grn.1 printf printf 2 printf.1 printf.1 printf.1 printf.1 sort sort 3 stat.1 stat.1 stat.1 seq.1 seq awk 4 sort.1 sort.1 sort.1 sort.1 perl 5 grn.1 grn.1 grn.1 stat.1 6 seq.1 seq.1 seq.1 tcsh.1 7 tcsh.1 tcsh.1 tcsh.1 whatis
copy AND File 1 pax.1 pax.1 pax.1 addftinfo.1 cp cp
2 whatis Whatis whatis addr2line.1 vi pax
3 cp.1 cp.1 cp.1 afmtodit.1 cpio
4 objcopy.1
objcopy.1
objcopy.1 as.1 tar
5 lex++.1 lex++.1 lex++.1 bsdtar.1 6 mail.1 mail.1 mail.1 c89.1 7 cpio.1 cpio.1 cpio.1 c99.1 8 install.1 install.1 install.1 chmod.1
9 sendbug.1
sendbug.1
sendbug.1 ci.1
10 vi.1 vi.1 vi.1 co.1
run OR execut 1 tcsh.1 tcsh.1 tcsh.1 addr2line tcsh tsch88
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
e .1
2 objcopy.1
objcopy.1
objcopy.1
afmtodit.1 make atq
3 strip.1 strip.1 strip.1 apply.1 clang++ gbb
4 whatis Whatis whatis as.1 bash 5 lex++.1 lex++.1 lex++.1 atq.1 zsh 6 make.1 make.1 make.1 atrm.1.gz 7 clang++.1 clang++.1 clang++.1 biff.1
8 gcov.1 gcov.1 gcov.1 brandelf.1
9 gdb.1 gdb.1 gdb.1 bsdtar.1
10 id.1 id.1 id.1 bsnmpd.1
column AND height 1 dialog.1 dialog.1 dialog.1 dialog.1 dialog dialog
2 eqn.1 less.1 eqn.1 eqn.1 less
3 less.1 mdocml.1 less.1 less.1
4 mdocml.1 eqn.1 mdocml.
1mdocml.
1 5 units.1 units.1 units.1 units.1
constructor AND length 1 id.1 id.1 id.1 Id.1
id not relevant
g++
addend AND sum 1 whatis Whatis whatis Whatiswhatis not relev.
NONE
math AND library 1 whatis bc.1 whatis bc.1 bc clang++
2 bc.1 Whatis bc.1 clang++ clang++ bc
3 clang++.1 clang++.1 clang++.1 Whatis
space AND gap 1 pr.1 pr.1 pr.1 id.1 tcsh pr 2 objcopy. objcopy. objcopy. objcopy. pr objco
89
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
1 1 1 1 py 3 tcsh.1 tcsh.1 tcsh.1 pr.1 4 id.1 id.1 id.1 tcsh.1
mergesort OR heapso
rt 1 sort.1 sort.1 sort.1 sort.1 sort.1 sort
2 whatis Whatis whatis Whatis
cut OR delete 1 cut.1 cut.1 cut.1 bsnmpwalk.1 cut cut
2 bsnmpwalk.1
bsnmpwalk.1
bsnmpwalk.1 colrm.1 paste awk
3 whatis whatis whatis cut.1 perl 4 file.1 file.1 file.1 file.1 5 gperf.1 gperf.1 gperf.1 gperf.1 6 last.1 last.1 last.1 last.1 7 idd.1 idd.1 idd.1 idd.1 8 paste.1 paste.1 paste.1 paste.1 9 stat.1 stat.1 stat.1 ssh.1 10 unxz.1 unxz.1 unxz.1 unxz.1
oldest AND command 1 tcsh.1 tcsh.1 tcsh.1 pgrep.1 tcsh tsch
2 pgrep.1 pgrep.1 pgrep.1 tcsh.1 bash
history
sort AND mergesort 1 sort.1 sort.1 sort.1 sort.1 sort sort
2 whatis whatis whatis Whatis
deploy AND run 1 clang++.1 clang++.1 clang++.1 atq.1 clang++.1 atq
2 atq.1 atq.1 atq.1 clang++.1 rsh ssh
find AND file 1 find.1 find.1 find.1 afmtodit.1 find find
2 whatis whatis whatis apropos. locate locat
90
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
1 e 3 lex++.1 lex++.1 lex++.1 as.1 less less
4 less.1 less.1 less.1 bsdgrep.1 vi
5 tcsh.1 tcsh.1 tcsh.1 bsdtar.1 6 cpio.1 cpio.1 cpio.1 bzip2.1 7 unxz.1 unxz.1 unxz.1 chflags.1 8 vi.1 vi.1 vi.1 ci.1 9 id.1 id.1 id.1 clang++.1 10 locate.1 locate.1 locate.1 cpio.1 list AND files 1 limits.1 limits.1 limits.1 limits.1 tcsh Sh 2 sh.1 sh.1 sh.1 sh.1 ls Tcsh 3 tcsh.1 tcsh.1 tcsh.1 tcsh.1 Ls Find
Locate
sum AND add 1 whatis whatis whatis id.1 nawk Nawk 2 id.1 unxz.1 id.1 nawk.1 expr Perl 3 unxz.1 id.1 unxz.1 ps.1 4 ps.1 nawk.1 ps.1 unxz.1 5 nawk.1 ps.1 nawk.1 Whatis
subtract AND math 1 expr.1 expr.1 expr.1 expr.1 expr expr nawk nawk bc perl
add AND subtract 1 objcopy.
1objcopy.
1objcopy.
1 expr.1 expr objcopy
2 id.1 id.1 id.1 id.1 nawk bc
3 expr.1 expr.1 expr.1 objcopy.1 perl
nawk
printer AND network 1 whatis whatis whatis tcsh.1 telnet lp
91
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
2 telnet.1 telnet.1 telnet.1 telnet.1 lp pr 3 tcsh.1 tcsh.1 tcsh.1 whatis
mergesort AND print 1 sort.1 sort.1 sort.1 sort.1 sort sort
2 whatis whatis whatis whatis awk perl
92
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Appendix B
Function to calculate MMM, Paice, P-Norm and Boolean similarity, sorts the results and prints it
to the console.
////////////////////////////////////////////////////////////////////////////////////////////////
//*** GET ALL THE IDFS, Calculate MMM, PAICE, PNORM and BOOLEAN ***/
////////////////////////////////////////////////////////////////////////////////////////////////
static Map<String, Float> getIdfs(IndexReader reader, String field, String TermA, String TermB, int BoolTerm,
BM25Similarity MySim) throws IOException
{
Float [][] MatchingListPaice = new Float [10000][2];
Float [][] MatchingListPNorm = new Float [10000][2];
Float [][] MatchingList = new Float [10000][2];
Float [][] MatchingListBoolean = new Float [10000][2];
Integer [][] StatsListA = new Integer [100000][4];
Integer [][] StatsListB = new Integer [100000][4];
int MAXA = 0;
int MAXB = 0;
int MAXfreqA = 0;
int MAXfreqB = 0;
Fields fields = MultiFields.getFields(reader); //Get the Fields of the index
Bits liveDocs = MultiFields.getLiveDocs(reader);
//HERE, getting all the data but dynamically, need to search through the data for match of terms and then
calculate //values and display results
93
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
int b = 0;
int x = 0;
int k = 0;
int TOTAL_NUM_MATCHING_DOCS = 0;
int TOTAL_NUM_DOCS = reader.numDocs();
for (String field2 : fields) {
TermsEnum termEnum = MultiFields.getTerms(reader, field2).iterator(null);
Terms newTermEnum = MultiFields.getTerms(reader, "contents");
BytesRef bytesRef;
while ((bytesRef = termEnum.next()) != null) {
if (termEnum.seekExact(bytesRef, true)) {
if (bytesRef.utf8ToString().equalsIgnoreCase(TermA) ||(bytesRef.utf8ToString().equalsIgnoreCase(TermB))){
DocsEnum docsEnum = termEnum.docs(liveDocs, null);
if (docsEnum != null) {
int doc;
while ((doc = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS)) {
if (bytesRef.utf8ToString().equalsIgnoreCase(TermA)) {
Term termInstanceA = new Term("contents", TermA);
long termFreqA = reader.totalTermFreq(termInstanceA);
long docCountA = reader.docFreq(termInstanceA);
StatsListA[k][0] = b;
StatsListA[k][1] = doc;
StatsListA[k][2] = docsEnum.freq();
StatsListA[k][3] = (int) docCountA;
k++;
94
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
}
else {
Term termInstanceB = new Term("contents", TermB);
long docCountB = reader.docFreq(termInstanceB);
StatsListB[x][0] = b;
StatsListB[x][1] = doc;
StatsListB[x][2] = docsEnum.freq();
StatsListB[x][3] = (int) docCountB;
x++;
}
MAXA = k;
MAXB = x;
}
}
}
}
}
}
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
////getting documents, term frequency, calculating similarity
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
float MAX1 = 0;
float MAX2 = 0;
float MAX3 = 0;
float MAX4 = 0;
95
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
long temp1 = 0;
long temp2 = 0;
int LAST_CHECK = 0;
int AValue = 0;
int BValue = 0;
float ScoreCalcA;
float ScoreCalcB;
int numDocs = 0;
if (MAXA <= MAXB) {
for (int r = 0; r < MAXA; r++) {
for (int t = 0; t < MAXB; t++) {
AValue = StatsListA[r][1];
BValue = StatsListB[t][1];
//If the document ID match, there is a matching file
if (BValue == AValue) { //compare document numbers, if they match, and the query is an AND go here
TOTAL_NUM_MATCHING_DOCS ++;
MatchingList[numDocs][0] = (float)StatsListA[r][1];
MatchingListPaice[numDocs][0] = (float)StatsListA[r][1];
MatchingListPNorm[numDocs][0] = (float)StatsListA[r][1];
MatchingListBoolean[numDocs][0] = (float) StatsListA[r][1];
MatchingListBoolean[numDocs][1] = (float) 1;
ScoreCalcA = StatsListA[r][2];
ScoreCalcB = StatsListB[t][2];
temp1 = fileList[AValue];
temp2 = fileList[BValue];
96
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
if (ScoreCalcA > 0) {
MAX1 = (float) Math.log(ScoreCalcA)+ 1;// gets the tF
}
Else {
MAX1 = 1; // gets the tF
}
if (ScoreCalcB > 0) {
MAX2 = (float) Math.log(ScoreCalcB) + 1;
}
Else {
MAX2 = 1; // / StatsListB[t][3]);
}
MAX3 = (float) (Math.log((TOTAL_NUM_DOCS+1) / (MAXA+1))); //gets the IDF
MAX4 = (float) (Math.log((TOTAL_NUM_DOCS+1) / (MAXB+1)));
ScoreCalcA = (float) MAX1 * MAX3;
ScoreCalcB = (float) MAX2 * MAX4;
if (BoolTerm == 1) { //OR
MatchingList[numDocs][1] = (float) (.7 * Math.min(ScoreCalcA, ScoreCalcB)+ .3 *
Math.max(ScoreCalcA, ScoreCalcB));
MatchingListPNorm[numDocs][1] = (float) Math.sqrt(( Math.pow(1,2)*
Math.pow(ScoreCalcA, 2) + Math.pow(1,2)* Math.pow(ScoreCalcB, 2))/
(Math.pow(1,2) + Math.pow(1,2)));
//sort in descending order for Paice OR
if (ScoreCalcA >= ScoreCalcB) {
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(.7, 0) * ScoreCalcA +
97
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Math.pow(.7, 1) * ScoreCalcB /(Math.pow(.7, 0) + Math.pow(.7, 1)));
}
Else {
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(.7, 0) * ScoreCalcB +
Math.pow(.7, 1) * ScoreCalcA /(Math.pow(.7, 0) + Math.pow(.7, 1)));
}
}
Else { //AND
MatchingList[numDocs][1] = (float) (.4 * Math.min(ScoreCalcA, ScoreCalcB) + .6 *
Math.max(ScoreCalcA, ScoreCalcB));
MatchingListPNorm[numDocs][1] = Math.abs((float) (1 - Math.sqrt(( Math.pow(1,2)*
Math.pow(1 - ScoreCalcA, 2) + Math.pow(1,2)* Math.pow(1 - ScoreCalcB, 2))/
(Math.pow(1,2) + Math.pow(1,2)))));
if (ScoreCalcA <= ScoreCalcB) {
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(1, 0) * ScoreCalcA
Math.pow(1, 1) * ScoreCalcB /(Math.pow(1, 0) + Math.pow(1, 1)));
}
Else {
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(1, 0) * ScoreCalcB + Math.pow(1, 1)
* ScoreCalcA /(Math.pow(1, 0) + Math.pow(1, 1)));
}
}
numDocs++;
98
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
LAST_CHECK = AValue;
} //End if of if A==B
//if the A list is exhausted but still documents left in B, do those, or if B is exhasuted, and A still has files
else if ((BValue < AValue && BoolTerm == 1 && BValue > LAST_CHECK) || (BoolTerm == 1 &&
AValue < BValue && (r == (MAXA -1)))) {
TOTAL_NUM_MATCHING_DOCS ++;
MatchingList[numDocs][0] = (float)StatsListB[t][1];
MatchingListPaice[numDocs][0] = (float)StatsListB[t][1];
MatchingListPNorm[numDocs][0] = (float)StatsListB[t][1];
MatchingListBoolean[numDocs][0] = (float) StatsListB[t][1];
MatchingListBoolean[numDocs][1] = (float) 1;
ScoreCalcA = 0;
ScoreCalcB = StatsListB[t][2];
temp1 = 0;
temp2 = fileList[BValue];
if (ScoreCalcA > 0) {
MAX1 = (float) Math.log(ScoreCalcA) + 1;// gets the tF
}
Else {
MAX1 = 0;
}
if (ScoreCalcB > 0) {
MAX2 = (float) Math.log(ScoreCalcB ) + 1;
}
Else {
99
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
MAX2 = 1;
}
MAX3 = (float) Math.log((TOTAL_NUM_DOCS+1) / (MAXA+1));//gets the IDF
MAX4 = (float) Math.log((TOTAL_NUM_DOCS+1) / (MAXB+1));///
ScoreCalcA = (float) MAX1 * MAX3;
ScoreCalcB = (float) MAX2 * MAX4;
LAST_CHECK = StatsListB[t][1];
if (BoolTerm == 1) { //OR
MatchingList[numDocs][1] = (float) (.7 * Math.min(ScoreCalcA, ScoreCalcB) + .3 *
Math.max(ScoreCalcA, ScoreCalcB));
MatchingListPNorm[numDocs][1] = Math.abs((float) Math.sqrt(( Math.pow(1,2)*
Math.pow(ScoreCalcA, 2) + Math.pow(1,2)* Math.pow(ScoreCalcB, 2))/ (Math.pow(1,2) +
Math.pow(1,2))));
//sort in descending order for Paice OR
if (ScoreCalcA >= ScoreCalcB) {
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(.7, 0) * ScoreCalcA +
Math.pow(.7, 1) * ScoreCalcB /(Math.pow(.7, 0) + Math.pow(.7, 1)));
}
Else {
[numDocs][1] = (float) ((float) Math.pow(.7, 0) * ScoreCalcB + Math.pow(.7, 1) * ScoreCalcA
/(Math.pow(.7, 0) + Math.pow(.7, 1)));
}
}
Else { //AND
100
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
MatchingList[numDocs][1] = (float) (.4 * Math.min(ScoreCalcA, ScoreCalcB) + .6 *
Math.max(ScoreCalcA, ScoreCalcB));
MatchingListPNorm[numDocs][1] = Math.abs((float) (1 - Math.sqrt(( Math.pow(1,2)* Math.pow(1 –
ScoreCalcA, 2) + Math.pow(1,2)* Math.pow(1 - ScoreCalcB, 2))/ (Math.pow(1,2) +
Math.pow(1,2)))));
if (ScoreCalcA <= ScoreCalcB) {
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(1, 0) * ScoreCalcA + Math.pow(1, 1)
* ScoreCalcB /(Math.pow(1, 0) + Math.pow(1, 1)));
}
Else {
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(1, 0) * ScoreCalcB + Math.pow(1, 1) *
ScoreCalcA /(Math.pow(1, 0) + Math.pow(1, 1)));
}
}
numDocs++;
}
}
}
}
else if (MAXB < MAXA) { //should come here if the number of docs in A is greater than number of docs in B
for (int r = 0; r < MAXB; r++) {
for (int t = 0; t < MAXA; t++){
AValue = StatsListA[t][1];
BValue = StatsListB[r][1];
if (AValue == BValue) {
101
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
TOTAL_NUM_MATCHING_DOCS ++;
MatchingList[numDocs][0] = (float)StatsListA[t][1];
MatchingListPaice[numDocs][0] = (float)StatsListA[t][1];
MatchingListPNorm[numDocs][0] = (float)StatsListA[t][1];
MatchingListBoolean[numDocs][0] = (float) StatsListA[t][1];
MatchingListBoolean[numDocs][1] = (float) 1;
ScoreCalcA = StatsListA[t][2];
ScoreCalcB = StatsListB[r][2];
if (ScoreCalcA > 0) {
MAX1 = (float) Math.log(ScoreCalcA) + 1; // gets the tF
}
Else {
MAX1 = 1;
}
if (ScoreCalcB > 0) {
MAX2 = (float) Math.log(ScoreCalcB ) + 1;
}
Else {
MAX2 = 1;
}
MAX3 = (float) ((float) Math.log((TOTAL_NUM_DOCS +1) / (MAXA +1) ));
MAX4 = (float) Math.log((TOTAL_NUM_DOCS + 1) / (MAXB+1));
ScoreCalcA = (float) MAX1*MAX3;
ScoreCalcB = (float) MAX2 * MAX4;
if (BoolTerm == 1) { //OR
102
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
MatchingList[numDocs][1] = (float) (.7 * Math.min(ScoreCalcA, ScoreCalcB) + .3 *
Math.max(ScoreCalcA, ScoreCalcB));
MatchingListPNorm[numDocs][1] = (float) Math.abs(Math.sqrt(( Math.pow(1,2)* Math.pow(ScoreCalcA,
2) + Math.pow(1,2)* Math.pow(ScoreCalcB, 2))/ (Math.pow(1,2) + Math.pow(1,2))));
if (ScoreCalcA >= ScoreCalcB) {
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(.7, 0) * ScoreCalcA + Math.pow(.7, 1) *
ScoreCalcB /(Math.pow(.7, 0) + Math.pow(.7, 1)));
}
Else {
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(.7, 0) * ScoreCalcB + Math.pow(.7, 1) *
ScoreCalcA /(Math.pow(.7, 0) + Math.pow(.7, 1)));
}
}
Else { //AND
MatchingList[numDocs][1] = (float) (.4 * Math.min(ScoreCalcA, ScoreCalcB) + .6 *
Math.max(ScoreCalcA, ScoreCalcB));
MatchingListPNorm[numDocs][1] = (float) Math.abs((1 - Math.sqrt(( Math.pow(1,2)* Math.pow(1 –
ScoreCalcA, 2) + Math.pow(1,2)* Math.pow(1 - ScoreCalcB, 2))/ (Math.pow(1,2) +
Math.pow(1,2)))));
if (ScoreCalcA <= ScoreCalcB) {
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(1, 0) * ScoreCalcA + Math.pow(1, 1) *
ScoreCalcB /(Math.pow(1, 0) + Math.pow(1, 1)));
}
Else {
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(1, 0) * ScoreCalcB + Math.pow(1, 1) *
103
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
ScoreCalcA /(Math.pow(1, 0) + Math.pow(1, 1)));
}
}
numDocs++;
LAST_CHECK = BValue;
} //ENd if avalue == bvalue
else if ((AValue < BValue && BoolTerm == 1 && AValue > LAST_CHECK) || (BoolTerm == 1 && BValue <
AValue && r == (MAXB-1))) {
TOTAL_NUM_MATCHING_DOCS ++;
MatchingList[numDocs][0] = (float)StatsListA[t][1];
MatchingListPaice[numDocs][0] = (float)StatsListA[t][1];
MatchingListPNorm[numDocs][0] = (float)StatsListA[t][1];
MatchingListBoolean[numDocs][0] = (float) StatsListA[t][1];
MatchingListBoolean[numDocs][1] = (float) 1;
ScoreCalcA = StatsListA[t][2];
ScoreCalcB = 0;
if (ScoreCalcA > 0) {
MAX1 = (float) Math.log(ScoreCalcA) + 1; // gets the tF
}
Else {
MAX1 = 1;
}
if (ScoreCalcB > 0) {
MAX2 = (float) Math.log(ScoreCalcB ) + 1; ///(fileList[BValue]); // / StatsListB[t][3]);
}
104
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Else {
MAX2 = 0;
}
MAX3 = (float)Math.log( (TOTAL_NUM_DOCS + 1) / (MAXA + 1));
MAX4 = (float) Math.log((TOTAL_NUM_DOCS + 1) / (MAXB + 1));
ScoreCalcA = (float) MAX1*MAX3;
ScoreCalcB = (float) MAX2 * MAX4;
LAST_CHECK = StatsListA[t][1];
if (BoolTerm == 1) { //OR
MatchingList[numDocs][1] = (float) (.7 * Math.min(ScoreCalcA, ScoreCalcB) + .3 * Math.max(ScoreCalcA,
ScoreCalcB));
MatchingListPNorm[numDocs][1] = Math.abs((float) Math.sqrt(( Math.pow(1,2)* Math.pow(ScoreCalcA, 2) +
Math.pow(1,2)* Math.pow(ScoreCalcB, 2))/ (Math.pow(1,2) + Math.pow(1,2))));
if (ScoreCalcA >= ScoreCalcB) {
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(.7, 0) * ScoreCalcA + Math.pow(.7,
1) * ScoreCalcB /(Math.pow(.7, 0) + Math.pow(.7, 1)));
}
Else {
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(.7, 0) * ScoreCalcB + Math.pow(.7,
1) * ScoreCalcA /(Math.pow(.7, 0) + Math.pow(.7, 1)));
}
}
else{ //AND
MatchingList[numDocs][1] = (float) (.4 * Math.min(ScoreCalcA, ScoreCalcB) + .6 *
Math.max(ScoreCalcA, ScoreCalcB));
105
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
MatchingListPNorm[numDocs][1] = (float) (1 - Math.sqrt(( Math.pow(1,2)* Math.pow(1 - ScoreCalcA, 2)
+ Math.pow(1,2)* Math.pow(1 - ScoreCalcB, 2))/ (Math.pow(1,2) + Math.pow(1,2))));
if (ScoreCalcA <= ScoreCalcB) {
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(1, 0) * ScoreCalcA + Math.pow(1, 1) *
ScoreCalcB /(Math.pow(1, 0) + Math.pow(1, 1)));
}
Else {
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(1, 0) * ScoreCalcB + Math.pow(1, 1) *
ScoreCalcA /(Math.pow(1, 0) + Math.pow(1, 1)));
}
}
numDocs++;
}
}
}
}
//if one term has 0 matched files and it's an OR query, loop through the other terms matching files for score
if (MAXA == 0 && MAXB > 0 && BoolTerm == 1) {
for (int t=0; t< MAXB; t++) {
TOTAL_NUM_MATCHING_DOCS ++;
AValue = 0;
BValue = StatsListB[t][1];
MatchingList[numDocs][0] = (float)StatsListB[t][1];
MatchingListPaice[numDocs][0] = (float)StatsListB[t][1];
MatchingListPNorm[numDocs][0] = (float)StatsListB[t][1];
106
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
MatchingListBoolean[numDocs][0] = (float) StatsListB[t][1];
MatchingListBoolean[numDocs][1] = (float) 1;
ScoreCalcA = 0;
ScoreCalcB = StatsListB[t][2];
if (ScoreCalcA > 0) {
MAX1 = (float) Math.log(ScoreCalcA) + 1 ; // gets the tF
}
Else {
MAX1 = 0;
}
if (ScoreCalcB > 0) {
MAX2 = (float) Math.log(ScoreCalcB ) + 1; ///(fileList[BValue]); // / StatsListB[t][3]);
}
Else {
MAX2 = 1;
}
MAX3 = (float) Math.log(1) ; //gets the IDF
MAX4 = (float) Math.log((TOTAL_NUM_DOCS + 1) / (MAXB + 1));
ScoreCalcA = (float) 0;
ScoreCalcB = (float) MAX2 * MAX4;
LAST_CHECK = StatsListB[t][1];
if (BoolTerm == 1) { //OR
MatchingList[numDocs][1] = (float) (.7 * Math.min(ScoreCalcA, ScoreCalcB) + .3 *
Math.max(ScoreCalcA, ScoreCalcB));
107
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
MatchingListPNorm[numDocs][1] = Math.abs((float) Math.sqrt(( Math.pow(1,2)* Math.pow(ScoreCalcA,
2) + Math.pow(1,2)* Math.pow(ScoreCalcB, 2))/ (Math.pow(1,2) + Math.pow(1,2))));
//sort in descending order for Paice OR
if (ScoreCalcA >= ScoreCalcB) {
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(.7, 0) * ScoreCalcA + Math.pow(.7, 1) *
ScoreCalcB /(Math.pow(.7, 0) + Math.pow(.7, 1)));
}
Else {
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(.7, 0) * ScoreCalcB + Math.pow(.7, 1) *
ScoreCalcA /(Math.pow(.7, 0) + Math.pow(.7, 1)));
}
}
else //AND
{
MatchingList[numDocs][1] = (float) (.4 * Math.min(ScoreCalcA, ScoreCalcB) + .6 * Math.max(ScoreCalcA,
ScoreCalcB));
MatchingListPNorm[numDocs][1] = (float) (1 - Math.sqrt(( Math.pow(1,2)* Math.pow(1 - ScoreCalcA, 2) +
Math.pow(1,2)* Math.pow(1 - ScoreCalcB, 2))/ (Math.pow(1,2) + Math.pow(1,2))));
if (ScoreCalcA <= ScoreCalcB) {
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(1, 0) * ScoreCalcA + Math.pow(1, 1) * ScoreCalcB
/(Math.pow(1, 0) + Math.pow(1, 1)));
}
Else {
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(1, 0) * ScoreCalcB + Math.pow(1, 1) * ScoreCalcA
/(Math.pow(1, 0) + Math.pow(1, 1))) ;
108
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
}
}
numDocs++;
}
}
//if one term has 0 matched files and it's an OR query, loop through the other terms matching files for score
else if (MAXB == 0 && MAXA > 0 && BoolTerm == 1) {
for (int t =0; t < MAXA; t++){
AValue = StatsListA[t][1];
BValue = 0;
TOTAL_NUM_MATCHING_DOCS ++;
MatchingList[numDocs][0] = (float)StatsListA[t][1];
MatchingListPaice[numDocs][0] = (float)StatsListA[t][1];
MatchingListPNorm[numDocs][0] = (float)StatsListA[t][1];
MatchingListBoolean[numDocs][0] = (float) StatsListA[t][1];
MatchingListBoolean[numDocs][1] = (float) 1;
ScoreCalcA = StatsListA[t][2];
ScoreCalcB = 0;
if (ScoreCalcA > 0){
MAX1 = (float) Math.log(ScoreCalcA) + 1;// gets the tF
}
Else {
MAX1 = 1;
}
if (ScoreCalcB > 0) {
109
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
MAX2 = (float) Math.log(ScoreCalcB) + 1;
}
else
{
MAX2 = 0;
}
MAX3 = (float)Math.log((TOTAL_NUM_DOCS + 1) / (MAXA + 1));
MAX4 = (float) Math.log(1) ;
ScoreCalcA = (float) MAX1*MAX3;
ScoreCalcB = (float) 0;
LAST_CHECK = StatsListA[t][1];
if (BoolTerm == 1) { //OR
MatchingList[numDocs][1] = (float) (.7 * Math.min(ScoreCalcA, ScoreCalcB) + .3 * Math.max(ScoreCalcA,
ScoreCalcB));
MatchingListPNorm[numDocs][1] = (float) Math.sqrt(( Math.pow(1,2)* Math.pow(ScoreCalcA, 2) +
Math.pow(1,2)* Math.pow(ScoreCalcB, 2))/ (Math.pow(1,2) + Math.pow(1,2)));
if (ScoreCalcA >= ScoreCalcB){
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(.7, 0) * ScoreCalcA + Math.pow(.7, 1) *
ScoreCalcB /(Math.pow(.7, 0) + Math.pow(.7, 1)));
}
else{
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(.7, 0) * ScoreCalcB + Math.pow(.7, 1) *
ScoreCalcA /(Math.pow(.7, 0) + Math.pow(.7, 1)));
}
}
110
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
Else { //AND
MatchingList[numDocs][1] = (float) (.4 * Math.min(ScoreCalcA, ScoreCalcB) + .6 * Math.max(ScoreCalcA,
ScoreCalcB));
MatchingListPNorm[numDocs][1] = (float) (1 - Math.sqrt(( Math.pow(1,2)* Math.pow(1 - ScoreCalcA, 2) +
Math.pow(1,2)* Math.pow(1 - ScoreCalcB, 2))/ (Math.pow(1,2) + Math.pow(1,2))));
if (ScoreCalcA <= ScoreCalcB){
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(1, 0) * ScoreCalcA + Math.pow(1, 1) *
ScoreCalcB /(Math.pow(1, 0) + Math.pow(1, 1)));
}
else{
MatchingListPaice[numDocs][1] = (float) ((float) Math.pow(1, 0) * ScoreCalcB + Math.pow(1, 1) *
ScoreCalcA /(Math.pow(1, 0) + Math.pow(1, 1)));
}
}
numDocs ++;
}
}
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
//Sort and Print
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
for (int i = 0; i < numDocs ; i++) {
float currentMax = MatchingList[i][1];
111
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
float currentPaiceMax = MatchingListPaice[i][1];
int currentPaiceIndex = i;
int currentMaxIndex = i;
float currentPNormMax = MatchingListPNorm[i][1];
int currentPNormIndex = i;
for (int j = i; j < numDocs; j++) {
if (currentMax < MatchingList[j][1] && MatchingList[currentMaxIndex][1] < MatchingList[j][1])
{
currentMax = MatchingList[j][1];
currentMaxIndex = j;
}
if (currentPaiceMax < MatchingListPaice[j][1] && MatchingListPaice[currentPaiceIndex][1] <
MatchingListPaice[j][1])
{
currentPaiceMax = MatchingListPaice[j][1];
currentPaiceIndex = j;
}
if (currentPNormMax < MatchingListPNorm[j][1] &&
MatchingListPNorm[currentPNormIndex][1] < MatchingListPNorm[j][1])
{
currentPNormMax = MatchingListPNorm[j][1];
currentPNormIndex = j;
}
}
112
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
if (currentMaxIndex != i) {
float temp0 = MatchingList[currentMaxIndex][0];
float temp11 = MatchingList[currentMaxIndex][1];
MatchingList[currentMaxIndex][0] = MatchingList[i][0];
MatchingList[currentMaxIndex][1] = MatchingList[i][1];
MatchingList[i][0] = temp0;
MatchingList[i][1] = temp11;
}
if (currentPaiceIndex != i) {
float temp12 = MatchingListPaice[currentPaiceIndex][0];
float temp3 = MatchingListPaice[currentPaiceIndex][1];
MatchingListPaice[currentPaiceIndex][0] = MatchingListPaice[i][0];
MatchingListPaice[currentPaiceIndex][1] = MatchingListPaice[i][1];
MatchingListPaice[i][0] = temp12;
MatchingListPaice[i][1] = temp3;
}
if (currentPNormIndex != i) {
float temp4 = MatchingListPNorm[currentPNormIndex][0];
float temp5 = MatchingListPNorm[currentPNormIndex][1];
MatchingListPNorm[currentPNormIndex][0] = MatchingListPNorm[i][0];
MatchingListPNorm[currentPNormIndex][1] = MatchingListPNorm[i][1];
MatchingListPNorm[i][0] = temp4;
MatchingListPNorm[i][1] = temp5;
}
}
113
USING FUZZY SETS FOR RETRIEVAL OF SOFTWARE FOR REUSE
int w = 0;
if (numDocs > 150){
w = 150;
}
Else {
w = numDocs;
}
for (int p = 0; p < w; p++) {
System.out.println("Match " + p + 1);
System.out.println(MatchingList[p][0] + " MMM " + MatchingList[p][1]);
System.out.println(MatchingListPaice[p][0] + " Paice " + MatchingListPaice[p][1]);
System.out.println(MatchingListPNorm[p][0] + " P-Norm " + MatchingListPNorm[p][1]);
System.out.println(MatchingListBoolean[p][0] + " Boolean " + MatchingListBoolean[p][1]);
System.out.println("+++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++");
}
System.out.println("Number of docs found that match: " + TOTAL_NUM_MATCHING_DOCS);
System.out.println("Number of docs for Term 1: " + MAXA);
System.out.println("Number of docs for Term 2: " + MAXB);
return docFrequencies;
}
}
114