proceedings template - word€¦  · web viewby presenting the cluster of documents with a cluster...

12
Media Analyzer David Witherspoon University of Colorado at Boulder Boulder, Colorado 80309 ABSTRACT Users of document and/or web search engines often have to scan through the many short descriptions (snippets) of the documents that have been returned as relevant to their search query. In this paper I will be presenting the ability to organize the search results into groups or clusters, which will provide the user the ability to focus their search for the information they are looking for by selecting specific cluster of common documents that are interesting to them based off the initial search query. This paper presents a search query application that allows the user to perform multiple search queries and present the resulting relevant documents in the traditional linear text and using a graphical visualization. The application allows the user to select between K-means clustering algorithm and the Lingo Carrot2 clustering algorithm, which will be used to cluster the resulting documents of their search query. Keywords Information Retrieval, K-means, Lingo, Clustering, Lucene Indexing, Prefuse. 1. INTRODUCTION 1.1 Motivation I am currently working on a Research and Development (R&D) project that has a basic search capability, which includes the basic search query and present a list of results to the user. This has a limitation if the user provides a search query that is too general, small number of terms / words within the query and the terms have a high document frequency. Having a high document frequency indicates that the term(s) exist in every document (TODO insert document frequency definition from the book and reference it here), which returns too many documents that are not relevant to what the user is trying to search for. After the query has been performed and the results are presented in a list on a page, where there are more than likely multiple pages and the user then has to read through all of the snippets to determine if the document is relevant or not to what they are truly trying to find. In general if users cannot find what they are looking for on the first page and need to continue to the next page of N, they will typically abandon this search and try again. Another concern in dealing with a typical search query is the presentation of the document results to the user. In a typical search application the list of results are typically presented in a linear list of the document snippets with a limit of the number of documents displayed on a page. The search component developed on the R&D project follows this typical search results format. The problem with this is the fact that there could be some relevant documents on pages 5, 10, and 15 of the total N pages of results, but the user will never scan through that many documents to find them. The task for finding the relevant documents located on different pages requires too much time that any common user would not spend to find them. I am looking at providing a solution to both of these problems that will help our current search component within our R&D project. 1.2 Contributions The above led to the development of the application presented in this paper. To take on the first concern in dealing with a list of results that contain too many

Upload: others

Post on 14-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Proceedings Template - WORD€¦  · Web viewBy presenting the cluster of documents with a cluster name allows the user to scan many cluster topic names quickly and determine if

Media AnalyzerDavid Witherspoon

University of Colorado at BoulderBoulder, Colorado 80309

ABSTRACT

Users of document and/or web search engines often have to scan through the many short descriptions (snippets) of the documents that have been returned as relevant to their search query. In this paper I will be presenting the ability to organize the search results into groups or clusters, which will provide the user the ability to focus their search for the information they are looking for by selecting specific cluster of common documents that are interesting to them based off the initial search query.

This paper presents a search query application that allows the user to perform multiple search queries and present the resulting relevant documents in the traditional linear text and using a graphical visualization. The application allows the user to select between K-means clustering algorithm and the Lingo Carrot2 clustering algorithm, which will be used to cluster the resulting documents of their search query.

Keywords

Information Retrieval, K-means, Lingo, Clustering, Lucene Indexing, Prefuse.

1. INTRODUCTION1.1 MotivationI am currently working on a Research and Development (R&D) project that has a basic search capability, which includes the basic search query and present a list of results to the user. This has a limitation if the user provides a search query that is too general, small number of terms / words within the query and the terms have a high document frequency. Having a high document frequency indicates that the term(s) exist in every document (TODO insert document frequency definition from the book and reference it here), which returns too many documents that are not relevant to what the user is trying to search for. After the query has been performed and the results are presented in a list on a page, where there are more than likely multiple pages and the user then has to read through all of the snippets to determine if the document is relevant or not to what they are truly trying to find. In general if users cannot find what they are looking for on the first page and need to continue to the next page of N, they will typically abandon this search and try again.

Another concern in dealing with a typical search query is the presentation of the document results to the user. In a typical search application the list of results are typically presented in a linear list of the document snippets with a limit of the number of documents displayed on a page. The search component developed on the R&D project follows this typical search results format. The problem with this is the fact that there could be some relevant documents on pages 5, 10, and 15 of the total N pages of results, but the user will never scan through that many documents to find them. The task for finding the relevant documents located on different pages requires too much time that any common user would not spend to find them.

I am looking at providing a solution to both of these problems that will help our current search component within our R&D project.

1.2 ContributionsThe above led to the development of the application presented in this paper. To take on the first concern in dealing with a list of results that contain too many relevant and not relevant documents for the user to look through, is by clustering the documents that are similar. The advantage of organizing similar results into a group or cluster is to help the user locate a cluster of documents, based on the cluster name, which would be relevant to what they are trying to accomplish without having to abandon the search query. By presenting the cluster of documents with a cluster name allows the user to scan many cluster topic names quickly and determine if they want to look at the documents contained in the cluster or not. Once the user has selected a cluster that interests them, they will be able to see all of the documents that are part of the cluster and therefore should be relevant to what they are looking for. The clustering algorithms that I will be using are my implementation of K-means and integration of Carrot2’s Lingo clustering algorithm.

This leads to the second aspect of the project which is dealing with the way that data is present to the user. In this application I will be looking at providing two different ways of presenting the cluster of documents from the preformed search in a text based and a graphical visualization. The text based display will provide each of the cluster names as a header and the documents that are part of that cluster will be presented below. This will give the user the ability to look at the cluster’s name to see if they might be interested or not. The second approach is to present the results in a graphical way utilizing Prefuse. This capability will allow the user to see all of the clusters linked off the search query and then all of the documents linked off their cluster.

Due to the fact that the current search component is tightly integrated with company specific code, I will be creating a system from scratch. Also due to the fact that the data used by the R&D project is classified, I will be utilizing the Reuters news data set. While developing the application I will keep in mind that this needs to be designed so that it can fit within our R&D project with little modifications.

2. RELATED WORK2.1 Vector Space ModelIn order to compare the similarities between documents the idea of representing each document in a high dimensional space came about. The structure used to support this high dimensional space representation is the Vector Space Model (VSM). The VSM is a technique that transforms the problem of comparing the textual documents into a problem of comparing algebraic vectors in a high dimensional space [4] [1] [7]. When the

Page 2: Proceedings Template - WORD€¦  · Web viewBy presenting the cluster of documents with a cluster name allows the user to scan many cluster topic names quickly and determine if

document is being index, each of the unique terms and the number of occurrences of each term within the document is kept within the VSM. Then we are able to compare term by term between multiple documents and their term frequency to determine how similar the two documents are to each other. This concept will be utilized within the application to perform the similarity of documents in the k-means and the Lingo clustering algorithms. We will see later how the Lucene indexing is setup to utilize the ability of store the vector space model or term frequency vector for each document. Then the application will utilize this information to calculate the similarities of the documents.

2.2 Suffix Tree ClusteringSuffix Tree Clustering (STC) is a clustering algorithm that is based on indentifying the phrases that are common to groups of documents [8]. Each cluster is defined by the common phrase that is shared between all of the documents that are part of that cluster. The phrase for each cluster is a sequential group of words or terms that is common between all of the documents and then assigned as the phrase for that cluster. Suffix Tree Clustering has three logical steps and the first step is document cleaning [8]. Document cleaning is the process of removing terms that have a high document frequency and therefore are not unique to a few documents within the corpus. As with other clustering algorithms first processing step this utilizes a stemmer and other filters to remove these common terms. The next step is Identifying Base Clusters, which can be viewed as the creation of an inverted index of phrases for the corpus [8]. This is accomplished by utilizing a technique called suffix trees [8]. The final step of the suffix tree clustering is to combine the base clusters into clusters. This is done because the base clusters may overlap with each other as far as the documents containing the phrases of multiple base clusters causing the overlap. Therefore, the algorithm will combined these base clusters that have high overlap into a new cluster. This is also done in other clustering algorithms as the last step called cluster pruning.

2.3 Lingo Clustering AlgorithmLingo clustering algorithm is a novel algorithm for clustering search results with an emphasis on quality of the cluster’s description [4] [3]. Lingo was designed to get the labels of the clusters first and assign the documents to the clusters last. This was done to avoid the issue of having documents that have been clustered together without being able to provide a good label to describe the cluster. The Lingo clustering algorithm is made up of 5 major steps and the algorithm (provided in pseudo code) can be found in [4]. The first step, as with mostly all clustering algorithms, they begin with the preprocessing step that performs the stemming, stop word removal and other filtering. Then they moved on to frequent phrase extraction, where frequent phrases are recurring ordered sequences of terms appearing in the input documents [4]. This was also similar to the second step within the suffix tree clustering algorithm. During this process the Lingo algorithm is trying to find candidate cluster labels by finding these frequent phrases or single terms. The criterion that needs to be met by these candidate labels is that they must fulfill the follow [4]:

1. Appear in the input documents at least certain number of times.

2. Not cross sentence boundaries.

3. Be a complete phrase

4. Not begin or end with a stop word.

Once the frequent phrases have been discovered, they are used in the next step of the algorithm which is the Cluster Label Induction. There are three steps to the cluster label induction process that include term-document matrix building, abstract concept discovery, phrase matching, and label pruning that are defined in [4]. After the cluster label induction process is completed there will be k cluster and a label assigned to each. The next step is to use the Vector Space Model to assign documents to the clusters label [4]. The documents will be assigned to a cluster if it exceeds the Snippet Assignment Threshold, which is another control parameter that is set. If a document does not exceed the threshold for any of the clusters, then the Lingo algorithm will assign that document to the cluster labeled Others. At this point the algorithm has assigned all of the documents within the corpus to one of the k+1 clusters (the Other cluster makes the +1) and the final step is to sort the clusters by their score. The cluster’s score is determined by the label score multiplied by the number of documents within the cluster [4]. At this point the Lingo clustering algorithm has finished and provided a clustering of the documents contained within the search results with a human friendly label.

2.4 K-means Clustering AlgorithmK-means is different than both of the other clustering algorithms that have been discussed. The major difference with k-means is the fact that it finds the clusters first instead of finding the labels of the clusters. K-means clustering starts off with having k clusters and assigning k randomly selected centroids or means to those clusters. At this point the algorithm will look at the documents within the corpus and compare their term frequency vector to the k cluster’s mean vector to determine which cluster the document is the most similar with. Once the documents have been assigned to the clusters the next step is to calculate the new mean of each cluster based on the term frequency vectors of the documents that are part of that cluster. The new mean vector is then assigned to each cluster. The process of assigning the documents and calculating the new mean vector is repeated until there is convergence [1] [5] [6]. After convergence has been met, the final step is to select the label of the cluster. This can be accomplished by either selecting a list of words that are common to all the documents in the cluster or by selecting the title of the document that is the closest to the cluster’s mean vector [5].

3. RESOURCES AND DATA SOURCES3.1 PrefusePrefuse is a visualization toolkit that I will be using to display the search query, the query’s relationships to the different clusters and each cluster’s relationship to the documents contained in it. You can find Prefuse at http://prefuse.org/, where you can download the latest version and see many examples of visualizing your data.

3.2 Reuter Data SourceThe data source that I used was the Reuters 21578 xml version, which has defined xml elements and attributes defined. The data

Page 3: Proceedings Template - WORD€¦  · Web viewBy presenting the cluster of documents with a cluster name allows the user to scan many cluster topic names quickly and determine if

is easy to find out on the web just by using your standard search engine and search for Reuters 21578 xml dataset. For example one of the locations that I found the information at is http://modnlp.berlios.de/reuters21578.html. The dataset contains 22 files, where 21 of the files contain 1,000 documents each and the last file contains 578 documents.

3.3 PostgreSQLThe database that is being used within this application is PostgreSQL 8.3, which can be found at http://www.postgresql.org/. There is more recent version of PostgreSQL out there, but this is the same version that the R&D project that I am working on is using. Therefore, it made sense for me to keep with the same version, since the plan is to integrate what I have done here into a feature within the application. There is no reason that you cannot use any other type of relational database like Oracle, SQL Server, etc. Any of them will work for what the application is using the database for, which will be explained in section 4.

3.4 Lingo ClusteringLingo clustering algorithm is one of the clustering algorithms that are provided by Carrot2. Carrot2 is an open source framework for building search engines and the API for Java can be downloaded at http://project.carrot2.org/download-java-api.html. Carrot2 supports multiple clustering algorithms, but I decided to work with the Lingo clustering algorithm.

3.5 LuceneLucene is an open source search framework, which will be used to perform the indexing and search capabilities within the application. Lucene can be downloaded from http://lucene.apache.org/.

4. PROJECT APPROACH4.1 Data Gathering and IndexingI started off working on creating the application by first locating and downloading the XML version of the Reuters dataset, which provided me with a corpus containing 21578 documents. Once the documents are downloaded, I needed to create the table within the database that would hold all of the documents and their docId. The docId is an attribute contained within the document on the Reuter element. After creating the table, I was ready to process the 21578 xml documents. I create a file reader that would read the 23 files, extract out each document into a JDOM document, and store the document in a collection. This collection of JDOM documents we be used by two components. The first component would store the documents by docId in the table that was created in the PostgreSQL database. The application will take advantage of storing the complete document, which will be discussed in section 4.4. The second component takes the document and goes through the process of utilizing Lucene to index it.

The process of indexing each JDOM document consists of starting out by creating a Plain Old Java Object (POJO) that will contain all of the data from the JDOM document. Then take the POJO and assign it all of the elements and attributes contained in the JDOM document using XPath. Once this is completed I now have a POJO that represents the original Reuter xml document, which I use to create a Lucene document. This

abstraction allows me to be able to switch out the Indexing component to be something other than Lucene if needed. When creating the Lucene document, I determine what the fields are going to be and how they are going to be stored and indexed. The fields that I decided to create are the DocId, Title, and Subject. All of these fields are stored and all except for the DocId are indexed. The Body field is the only one that I indicated that I want to store the TermVector for. This tells Lucene to store the relative Term Frequency Vector for each document, which I will use later on in the application. The Term Frequency Vector is a vector of all the terms within the document and their associated number of times the term exists within the document [2]. The Term Frequency Vector does not contain every term within the dictionary and only contains the term if it exists in the document (so the term frequency is one or more).

Now that the Lucene Document has been created I need to get the Analyzer and Index Writer, so that Lucene can write the document to the Index. The Analyzer that I am using is the SnowballAnalyzer, which takes the Lucene document and uses the StandardTokenizer to split the Lucene document into individual tokens. The StandardTokenizer contains many filters that are used to perform different filtering functions on the tokens. The first filter normalizes the tokens, then the tokens are changed to all lower case, then the stop word list is applied to the tokens and finally the SnowballFilter is applied. The stop word list contains a list of words that are extremely common and do not provide much help in matching the users query to the documents within the index [2]. So the filter that applies the stop word list will remove all of the words that it comes across from the tokenizer and remove those words that match. The SnowballFilter actually applies a stemmer to the words from the tokenizer, which is the processing of applying heuristics to removing letters from the word in the hopes of finding the root of the word [2]. For example the stemmer is hoping to have the words cat and cats to both be reduced to the same word of cat. Now that we have the Analyzer defined, we can use that and the location of the index to create the Lucene IndexWriter. I use the IndexWriter to call the method to add the Lucene Document that I had created earlier, so that it can be processed by the Analyzer and Indexed. At this point I have indexed all of the Reuter Documents and stored the complete documents in the database.

4.2 SearchAfter storing and indexing the entire corpus, I added the capability of allowing the user to enter a search query and to return the top 200 relevant documents. We start out by creating a QueryParser that will utilize the same Analyzer that we used to index the documents to analyze the query statement provided by the user. The other part of the QueryParser is the definition of what field(s) will be searched across. In this case we are searching over the title and the body fields. When the user provides the query to the application, it will use the QueryParser to parse the query and create a search statement that contains the disjunction of all the resulting term across both fields. If I had used the conjunction of the terms, the application would miss many documents that were relevant but did not contain all of the terms within it. Therefore, it makes more sense to perform the disjunctive of the terms, so that we do not leave out relevant documents from the result set. Next, the search statement is

Page 4: Proceedings Template - WORD€¦  · Web viewBy presenting the cluster of documents with a cluster name allows the user to scan many cluster topic names quickly and determine if

passed off to the Lucene IndexSearcher that will search across the entire corpus and return a collection of the top 200 documents.

4.3 Clustering4.3.1 K-means Clustering AlgorithmAt this point we have the top 200 relevant documents and need to get the TermFreqVector for each document that was returned, since this information will be used by the k-means clustering algorithms. K-means is a method of cluster analysis which aims to partition n documents into k clusters in which each document belongs to the cluster with the nearest mean[]. My implementation of the k-means clustering algorithm started off by selecting the initial means of the k clusters. Instead of being completely random, I decided to select a document within the result set by a defined offset. The offset is the number of documents divided by the number of clusters. Then I started from the fist document in the collection and assigning its term freq vector to the first cluster. To find the next document, I would add the offset to get the term freq vector from that document and assign it to the next cluster. I would repeat this until all of the clusters had an initial mean as seen in Figure 1.

Figure 1. Step 1 Assign Initial MeansOnce the initial means have been established, the next step is to assign all of the documents to the cluster which has the nearest mean using a similarity algorithm as see in Figure 2.

Figure 2. Step 2 Assign Documents to ClustersThe similarity algorithm that I am using is the Euclidean distance to compare the similarity between the cluster’s mean vector and the document’s term freq vector. Since the term freq vector does not contain every term in the dictionary and only

contains terms that are in that document, I might have terms that are in the document that are not in the cluster’s mean vector and vice versa. Therefore, I create a collection S that contains all of the terms from both the document’s term freq vector and the cluster’s mean vector. Then I compare all of the terms and sum the squares of the value of the term frequency of that term subtracted from the cluster’s mean term value. Then after summing across all the terms, I take the square root of that value and return the distance. The formula is provided below:

(1)

At this point we will look at each document and compare its term freq vector with every cluster’s mean to determine which cluster has the shortest distance. Once we determine which cluster has the shortest distance from the document, we will assign that document to that cluster. We will continue this process until all of the documents are assigned to a cluster. Now that all of the documents are assigned to a cluster, we will go through each cluster and calculate its new mean vector. So for each term across all of the documents we will calculate the mean for that term, thus producing a new mean term vector for the cluster as shown in Figure 3.

Figure 3. Step 3 Assign New Mean to ClustersWe will now compare the new mean term vector with the original mean term vector to determine if we have reached convergence or not. If the mean term vector is the same then the documents should remain in the same clusters and we have reach convergence. If the mean term vector continues to change, then the documents within that cluster must be changing. If we have not reached convergence, we will repeat steps 2 and 3 until we have. Once we have we will have the k clusters defined and the documents that belong to them as shown in Figure 4.

Page 5: Proceedings Template - WORD€¦  · Web viewBy presenting the cluster of documents with a cluster name allows the user to scan many cluster topic names quickly and determine if

Figure 4. Step 4 Clusters ConvergeAt this point all of the clusters and their associated documents have been discovered and the labels for the clusters need to be created. In order to come up with the clusters label, the application will search though all of the documents associated with that cluster and determine which one is the closest to the centroid using Euclidean distance formula. Once that document has been determined, the title from the document is taken and used as the label for the cluster. This works well due to the fact that the documents in the cluster are very similar and the titles have a limited number of terms in them.

4.3.2 Lingo Clustering AlgorithmI decided that it would be good to allow multiple algorithms to be used to do the clustering depending on the choice of the user. This allows for future implementation of different algorithms to be added to the application without much integration. This is the reason that I included the Lingo clustering algorithm to prove that integration of a new clustering algorithm would work. After downloading the Lingo clustering algorithm and all of its dependent jars from the Carrot2 web site, I began the integration of it into that application. I first started off by integrating the search capability of the Lingo API into the search box within the application. Once I was able to get the application to allow a user to enter a search query, send that request off to the Lingo API and get the result set back, I was ready to integrate the result set into the different java objects that I had already created when doing the k-means clustering. At this point I just needed a way to get all the information from their result set into my result set to be able to display the clusters and related documents within the two views that had been created. Once the integration was completed, I could easily utilize the Lingo clustering algorithm to cluster the top 200 documents returned by the search query.

4.4 User Interface DesignThe application is a thick client that provides the typical area to fill in your search query. Currently it allows the user to select between the k-means and the lingo clustering algorithms. This can be seen in Figure5.

Figure 5. Clustering Algorithm SelectionOnce the algorithm has been selected and the user enters the search query, a new tab is created that contains both the text version of displaying the text and the graphical visualization of the clusters and documents. The default is to show the text version of the clustered results where the header of each section is the name of the cluster, the number of documents apart of that cluster and the score of the cluster. The score of the cluster can be implemented differently depending on the clustering algorithm. An example of using the application with the search query “corp”, shown below in Figure 6, demonstrates the results that are presented to the user. The user is also presented with a search header that contains the search terms in the tab label and that the search result contains 200 documents and 41 clusters.

Figure 6. Text Based ResultsIn the text area of the results page you can see a list of all the clusters denoted by the header and each document associated with the cluster below. What are being presented in the document are the title and the snippet of the body, which allows the user to see if they are interested in the document or not.

Page 6: Proceedings Template - WORD€¦  · Web viewBy presenting the cluster of documents with a cluster name allows the user to scan many cluster topic names quickly and determine if

Switching over to the Visualization tab under the search results you have the ability to view the clusters and the documents in 6 different ways. In Figure 7 I am showing what the results look like presented in the cloud view. When the user selects nodes within the graph, the other nodes that are linked to them are highlighted. In Figure 7 I had selected the search query, which is highlighted in red. At the same time all of the clusters that connect to the search query are highlighted in tan, which allows the user to scan over the cluster titles and sub documents much quicker than a list.

Figure 7. Cloud Visualization

When you select a cluster node, you can see that both the query node and the document nodes that belong to that cluster are highlighted as shown in Figure 8. Also in Figure 8 you can see that the graph is being presented in the tree visualization, which demonstrates that the user can work in the visualization that works the best for them. As you can see below this allows the user to scan easily over the documents that are part of the cluster to see if they are relevant to what they are looking for or not. So in this example of the “corp” search, the user might be interested in finding more information on Payouts that have occurred. Therefore, the user would select the “PAYOUT” cluster node to see the links to the relevant documents.

Figure 8. Tree Visualization

After looking at the document titles, the user might be more interested in what the contents of the document are. So the user would select the document and right click. As you can see in Figure 9, the Display Document right click menu is presented to the user and if they click on that then the document will be displayed.

Figure 9. Document SelectionWhen the user makes the request to the application to display the document, this is where storing the DocId in the index and the database comes in. The application takes the DocId from the document object and select the records based on that DocId, where it then get the original record back from the database and presents it in the dialog as shown in Figure 10. At this point the user is able to see the complete contents of the relevant document from a general search query without having to perform another more specific query by following the cluster of documents.

Figure 10. Displaying Document ContentsFinally the application also supports presenting multiple queries at the same time to the user as presented in Figure 11. Here I followed the standard opening the new search query in a new tab as most web browsers work. The user can always remove the

Page 7: Proceedings Template - WORD€¦  · Web viewBy presenting the cluster of documents with a cluster name allows the user to scan many cluster topic names quickly and determine if

search tabs that they are no longer interested in. This allows you to compare search results between tabs by looking back and forth.

Figure 11. Multiple SearchesFor navigation of the graph I have provided an overview on the right hand side of the application, which shows what is being displayed on the left hand side within the white box. The user can select the box and move it around to change what is being viewed in the left hand side or select and move the left hand side around. The user can also us the wheel on the mouse to zoom in and out and finally the user can select the left most button in the tool bar of the tab to center the contents of the graph within the frame. All of these features allow the user to interact with the graph of the search results and to more quickly find cluster of documents that are relevant to them.

5. EVALUATION and RESULTSThe way that I am evaluating the clusters that are generated by the two algorithms is by comparing how closely each cluster generated by the application matches the set of previously assigned topics. In the Reuter’s corpus all of the document have already been judged and assigned a topic. I am taking advantage of the work done by people to select a specific topic for the document.

After the search query has been run, the top 200 relevant documents have been return, and clustered; then I will take those clusters and calculate the Precision, Recall and F(1) measure for all the clusters. Each of the formulas will use the following variables:

RelevantItemsRetrieved = number of documents judged to be of topic T in cluster C.

RetrievedItems = number of documents in cluster C.

RelevantItems = number of documents judged to be of topic T in all the clusters.

The following are the formulas used for calculating Precision (P), Recall (R) and F(1) for cluster C and topic T:

Precision(C, T) = RelevantItemsRetrieved / RetievedItems

Recall(C, T) = RelevantItemsRetrieved / RelevantItems

F(1) measure = (2PR) / (P + R)

Performing the search query “Corp” on the Lingo Clustering Algorithm, I got a Recall of 0.97, Precision = 0.80 and F(1) = 0.87. Performing the same query on the K-means Clustering Algorithm, I got Recall of 0.97, Precision = 0.91 and F(1) = 0.94. So there seems to be a slight precision increase when you decide to create the clusters based on the similarity to a centroid, instead of using the labels to define what the clusters are and how the documents are going to be assigned to them.

6. CONCLUSIONThe goal of this project was to provide a way for the user to be able to scan through multiple documents returned by a general search query more easily by clustering the similar documents. Again this allows the user to look at the cluster labels to determine if the documents are relevant or not without having to scan all of the documents. By providing the text version of displaying the clusters and related documents and the graphical visualization of the documents I believe that I have achieved this goal. With supporting both the k-means and the lingo clustering algorithms, I have been able to experience the two different ways of looking at building the clusters. The lingo clustering algorithm, with its focus on producing meaningful cluster names, is much more advanced then the technique that I had implemented. On the other hand, the work that was done with the k-means clustering algorithm and the clustering labeling has been a great learning experience and produced good results.

7. FUTURE WORKPresenting the search query results in a graphical visualization does help will scanning through the results quicker due to the clustering and the relationships between the query, clusters, and documents. In the future development the graphical display presents the opportunity of adding more information to the display that could help the user even more. With clustering algorithms that provide a relevance score to the query, then lines that link the two could reflect this score by being different colors, having the score on the link as a label, or even change the thickness of the lines. Since this has become more of a visual display allows you to add more attributes or present data in a different and more meaningful way.Some future enhancements to the k-means clustering would be to look further into the use of the kd-trees to help with helping speed up the entire process of determining the k-means clusters. Another area to look into would be comparing the clusters generated by the k-means vs. the k-mediod and see which one produces better clusters.

Other areas of future work that I would like to look into to add more functionality to this application is to perform Information Extraction and more specifically Named Entity Extraction and Relationship mining. I can see the possibilities of clustering documents based off the Named Entity that are extracted. So for example the application could group documents by common people in them, or locations found in the documents, and so on. This could be another way of trying to sort the information to find things that are relevant.

The last area of improvement that I would look at doing was to perform some type of cluster reduction step for the k-mean algorithm. This would allow the flexibility of the number of clusters to find, because there could be some cases that the true

Page 8: Proceedings Template - WORD€¦  · Web viewBy presenting the cluster of documents with a cluster name allows the user to scan many cluster topic names quickly and determine if

number of unique and non-overlapping clusters there might be much less than the k clusters that the system originally starts out with.

8. REFERENCES[1] Han, J. and Kamber, M. 2006. Data Mining: Concepts and

Techniques, Second Edition. Morgan Kaufmann Publishing, San Francisco, CA.

[2] Manning, C. D., Raghavan, P., and Schutze, H. 2008. Introduction to Information Retrieval. Cambridge University Press.

[3] Osinski, S. Improving Quality of Search Results Clustering with Approximate Matrix Factorizations. Poznan Supercomputing and Networking Center at Poznan, Poland.

[4] Osinski, S., Stefanowski, J., and Weiss, D. 2004. Lingo: Search Results Clustering Algorithm Based on Singular

Value Decomposition. University of Technology at Poznan, Poland.

[5] Russell, S. and Norvig, P. 2003 Artificial Intelligence: A Modern Approach, Second Edition. Prentice Hall.

[6] Segaran, T. 2007. Programming Collective Intelligence: Building Smart Web 2.0 Applications, First Edition. O’Reilly.

[7] Steinbach, M., Karpypis, G., and Kumar, V. 2000. A Comparison of Document Clustering Techniques. University of Minnesota, Technical Report #00-034.

[8] Zamir, O. and Etzioni, O. Web Document Clustering: A Feasibility Demonstration. Department of Computer Science and Engineering at University of Washington. Seattle, Washington.