document clustering algorithms, representations and evaluation for information...

Document Clustering Algorithms,

Representations and Evaluation for Information

Retrieval

Christopher M. De [email protected]

Submitted ForDoctor of Philosophy in Computer Science

Electrical Engineering and Computer ScienceComputational Intelligence and Signal Processing

Queensland University of TechnologyBrisbane, Australia

Prior DegreesMasters of Information Technnology by Research


Bachelor of Information TechnologySoftware Engineering and Data Communications

Graduated with DistinctionQueensland University of Technology

Brisbane, Australia

SupervisorAssociate Professor Shlomo Geva

Electrical Engineering and Computer ScienceComputational Intelligence and Signal Processing


Associate SupervisorAssociate Professor Andrew TrotmanDepartment of Computer Science

University of OtagoDunedin, New Zealand

2014

mailto:[email protected]

The purpose of computing is insight, not numbers.

– Richard Hamming (1962) Numerical Methods for Scientists and Engineers.

AbstractDigital collections of data continue to grow exponentially as the information

age continues to infiltrate every aspect of society. These sources of data takemany different forms such as unstructured text on the world wide web, sensordata, images, video, sound, results of scientific experiments and customer pro-files for marketing. Clustering is an unsupervised learning approach that groupssimilar examples together without any human labeling of the data. Due to thevery broad nature of this definition, there have been many different approachesto clustering explored in the scientific literature.

This thesis addresses the computational efficiency of document clusteringin an information retrieval setting. This includes compressed representations,efficient and scalable algorithms, and evaluation with a specific use case for in-creasing efficiency of a distributed and parallel search engine. Furthermore, itaddresses the evaluation of document cluster quality for large scale collectionscontaining millions to billions of documents where complete labeling of a col-lection is impractical. The cluster hypothesis from information retrieval is alsotested using several different approaches throughout the thesis.

This research introduces novel approaches to clustering algorithms, docu-ment representations, and clustering evaluation. The combination of documentclustering algorithms and document representations outlined in this thesis areable to process large scale web search data sets. This has not previously beenreported in the literature without resorting to sampling and producing a smallnumber of clusters. Furthermore, the algorithms are able to do this on a singlemachine. However, they can also function in a distributed and parallel settingto further increase their scalability.

This thesis introduces new clustering algorithms that can also be applied toproblems outside the information retrieval domain. There are many large datasets being produced from advertising, social networks, videos, images, sound,satellites, sensor data, and a myriad of other sources. Therefore, I anticipatethese approaches will become applicable to more domains, as larger data setsbecome available.

Keywordsdocument clustering, representations, algorithms, evaluation, information re-trieval, clustering evaluation, clustering, cluster based search, hierarchical clus-tering, vector quantization, vector space model, dimensionality reduction, com-pressed representations, document signatures, random projections, random in-dexing, topology, hashing, machine learning, unsupervised learning, search,distributed search, similarity measures, pattern recognition, efficiency, index-ing, text mining, signature files, distributed information retrieval, vector spaceIR, K-tree, EM-tree, TopSig, cluster hypothesis, large scale learning, INEX,Wikipedia, XML Mining, retrieval models, bit vectors

3

AcknowledgmentsMany thanks go to my supervisor, Shlomo, who has provided valuable direc-tion, advice and insights into this line of research. His advice and direction havebeen essential in ensuring the success of this research. I wish to thank my as-sociate supervisor, Andrew, for providing feedback on many of the publicationscontained in this thesis. I wish to thank my family and friends for continuedsupport during my candidature. I wish to thank all of the co-authors on thepapers included in this thesis for their feedback and contributions. Finally, Ithank QUT for providing an opportunity for me to undertake this research andthe High Performance Computing center for supporting the Big Data Lab wheremany of the experiments were run.

External ContributionsAll papers where I am first named author have been written entirely by myself.For papers where I am not first named author, I have made contributions to eachof the papers. In the TopSig paper I wrote all material relating to documentclustering and contributed to other sections. In the INEX 2009 overview paper Iperformed the analysis of results and created the graphs for the paper. The listof contributions I have made has been listed in Chapter 4 as research questions.

Statement of Original AuthorshipThe work contained in this thesis has not been previously submitted to meetrequirements for an award at this or any other higher education institution. Tothe best of my knowledge and belief, the thesis contains no material previouslypublished or written by another person except where due reference is made.

Signature August 26, 2014

4

QUT Verified Signature

Contents

1 Introduction 121.1 Document Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 131.2 Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.5 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . 151.6 How To Read This Thesis . . . . . . . . . . . . . . . . . . . . . . 161.7 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 High Level Overview 182.1 Publication Information . . . . . . . . . . . . . . . . . . . . . . . 19

3 Background 223.1 Document Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Document Representations . . . . . . . . . . . . . . . . . . . . . 273.3 Ad Hoc Information Retrieval . . . . . . . . . . . . . . . . . . . . 28

3.3.1 Collection Distribution and Selection . . . . . . . . . . . . 303.3.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Cluster Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4.1 Measuring the Cluster Hypothesis . . . . . . . . . . . . . 333.4.2 Cluster Hypothesis in Ranked Retrieval . . . . . . . . . . 343.4.3 Sub-document Cluster Hypothesis in Ranked Retrieval . . 36

4 Research Questions 374.1 Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.4 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Representations 465.1 TopSig: topology preserving document signatures . . . . . . . . . 465.2 Pairwise Similarity of TopSig Document Signatures . . . . . . . . 475.3 Clustering with Random Indexing K-tree and XML Structure . . 47

5

6 Algorithms 486.1 TopSig: topology preserving document signatures . . . . . . . . . 486.2 Clustering with Random Indexing K-tree and XML Structure . . 486.3 Distributed Information Retrieval: Collection Distribution, Se-

lection and the Cluster Hypothesis for Evaluation of DocumentClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.4 The EM-tree algorithm . . . . . . . . . . . . . . . . . . . . . . . 49

7 Evaluation 507.1 Overview of the INEX 2009 XML Mining track: Clustering and

Classification of XML Documents . . . . . . . . . . . . . . . . . . 507.2 Overview of the INEX 2010 XML Mining track: Clustering and

Classification of XML Documents . . . . . . . . . . . . . . . . . . 517.3 Document Clustering Evaluation: Divergence from a Random

Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

8 Information Retrieval 528.1 Distributed Information Retrieval: Collection Distribution, Se-

lection and the Cluster Hypothesis for Evaluation of DocumentClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

9 Overview of the INEX 2009 XML Mining track: Clustering andClassification of XML Documents 539.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

10 Clustering with Random Indexing K-tree and XML Structure 6910.1 Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6910.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

11 Overview of the INEX 2010 XML Mining track: Clustering andClassification of XML Documents 8011.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

12 TopSig: topology preserving document signatures 9612.1 Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9612.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

13 Document Clustering Evaluation: Divergence from a RandomBaseline 11013.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

14 Pairwise Similarity of TopSig Document Signatures 11914.1 Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

15 Distributed Information Retrieval: Collection Distribution, Se-lection and the Cluster Hypothesis for Evaluation of DocumentClustering 12715.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12715.2 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . 128

6

16 The EM-tree Algorithm 16716.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

17 Conclusions, Implications and Future Directions 19217.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

7

List of Figures

2.1 A Map of Contributions in this Thesis . . . . . . . . . . . . . . . 19

9.1 The YAGO Category Distribution . . . . . . . . . . . . . . . . . 569.2 Purity and NCCG performance of different teams using the entire

dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619.3 Purity performance of different teams using the subset data . . . 629.4 NCCG performance of different teams using the subset data . . . 639.5 The supervised classification task . . . . . . . . . . . . . . . . . . 64

10.1 Random Indexing Example . . . . . . . . . . . . . . . . . . . . . 73

11.1 Complicated and Noisy Wikipedia Category Graph . . . . . . . . 8411.2 Algorithm to Extract Categories from the Wikipedia . . . . . . . 8411.3 Legend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9111.4 Micro Purity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9211.5 Micro Negentropy . . . . . . . . . . . . . . . . . . . . . . . . . . 9211.6 NMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9211.7 NCCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

12.1 RMSE Drop Precision . . . . . . . . . . . . . . . . . . . . . . . . 10112.2 INEX 2009 Precision vs Recall . . . . . . . . . . . . . . . . . . . 10412.3 INEX 2009 P@n . . . . . . . . . . . . . . . . . . . . . . . . . . . 10512.4 Wall Street Journal P@n . . . . . . . . . . . . . . . . . . . . . . . 10512.5 INEX 2009 P@10 by Topic . . . . . . . . . . . . . . . . . . . . . 10512.6 INEX 2010 Micro Purity . . . . . . . . . . . . . . . . . . . . . . . 10712.7 INEX 2010 NCCG . . . . . . . . . . . . . . . . . . . . . . . . . . 10712.8 INEX 2010 Clustering Speed Up . . . . . . . . . . . . . . . . . . 107

13.1 A Clustering Produced by a Clustering Algorithm . . . . . . . . 11513.2 A Random Baseline Distributing Documents into Buckets the

Same Size as a Clustering . . . . . . . . . . . . . . . . . . . . . . 11513.3 Legend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11613.4 Purity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11613.5 Purity Subtracted from a Random Baseline . . . . . . . . . . . . 11613.6 NCCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11713.7 NCCG Subtracted from a Random Baseline . . . . . . . . . . . . 117

8

13.8 NMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11713.9 NMI Subtracted from a Random Baseline . . . . . . . . . . . . . 11713.10RMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11713.11RMSE Subtracted from a Random Baseline . . . . . . . . . . . . 117

14.1 INEXrandom pmf . . . . . . . . . . . . . . . . . . . . . . . . . . . 12214.2 INEXreference pmf . . . . . . . . . . . . . . . . . . . . . . . . . . 12214.3 WSJ pmf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12214.4 INEXrandom cdf . . . . . . . . . . . . . . . . . . . . . . . . . . . 12314.5 INEXreference cdf . . . . . . . . . . . . . . . . . . . . . . . . . . 12314.6 WSJ cdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12314.7 INEXrandom cdf First 100 signatures from estimated distribution 12414.8 INEXreference cdf First 100 signatures from estimated distribution12414.9 WSJ cdf First 100 signatures from estimated distribution . . . . 12414.10INEXrandom cdf First 100 signatures from Binomial distribution 12414.11INEXreference cdf First 100 signatures from Binomial distribution 12514.12WSJ cdf First 100 signatures from Binomial distribution . . . . . 125

15.1 Wikipedia – k-means versus K-tree clusters . . . . . . . . . . . . 15215.2 Wikipedia – Varying the number of clusters . . . . . . . . . . . . 15315.3 Wikipedia – MAP: BM25 sum versus sum of squares . . . . . . . 15415.4 Wikipedia – P@10: BM25 sum versus sum of squares . . . . . . . 15415.5 Wikipedia – P@30: BM25 sum versus sum of squares . . . . . . . 15515.6 Wikipedia – P@100: BM25 sum versus sum of squares . . . . . . 15515.7 ClueWeb 2009 Category B . . . . . . . . . . . . . . . . . . . . . . 15715.8 ClueWeb 2009 Category B – P@10 . . . . . . . . . . . . . . . . . 15715.9 ClueWeb 2009 Category B – P@20 . . . . . . . . . . . . . . . . . 15815.10ClueWeb 2009 Category B – P@30 . . . . . . . . . . . . . . . . . 158

16.1 A m-way tree illustrating means . . . . . . . . . . . . . . . . . . 17216.2 A two level m-way tree illustrating the structure of the k-means

algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17216.3 INEX 2010 XML Mining collection – Convergence . . . . . . . . 18016.4 INEX 2010 XML Mining Collection – RMSE vs Number of Clusters18416.5 INEX 2010 XML Mining Collection – Runtime vs Number of

Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18416.6 INEX 2010 XML Mining Collection – Streaming EM-tree vs TSVQ188

9

List of Tables

9.1 Top-10 Category Distribution . . . . . . . . . . . . . . . . . . . . 56

10.1 K-tree Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7510.2 Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

11.1 Relevant Documents Missing from the XML Mining Subset . . . 8211.2 XML Mining Categories . . . . . . . . . . . . . . . . . . . . . . . 8511.3 Type I and II Classification Errors . . . . . . . . . . . . . . . . . 9311.4 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . 94

12.1 Detailed Evaluation of 36 clusters . . . . . . . . . . . . . . . . . . 107

14.1 Number of Signatures Generated by TopSig . . . . . . . . . . . . 12214.2 Nearest Neighbours Expected from cdfb . . . . . . . . . . . . . . 12414.3 Nearest Neighbours Expected from cdfe . . . . . . . . . . . . . . 12414.4 Signatures Expected within d

2 . . . . . . . . . . . . . . . . . . . . 125

16.1 INEX 2010 XML Mining collection – Convergence . . . . . . . . 18116.2 ClueWeb 2009 Category B – 104,000 Clusters . . . . . . . . . . . 18516.3 INEX 2010 XML Mining Collection – Streaming EM-tree vs TSVQ188

10

List of AbbreviationsTopSig . . . . . . . . . . Topology preserving document Signatures

IR . . . . . . . . . . . . . . . Information Retrieval

CPU . . . . . . . . . . . . Central Processing Unit

LSA . . . . . . . . . . . . . Latent Semantic Analysis

NMF . . . . . . . . . . . . Non-negative Matrix Factorization

DCT . . . . . . . . . . . . Discrete Consine Transform

SVD . . . . . . . . . . . . . Singular Value Decomposition

EM . . . . . . . . . . . . . . Expectation Maximization

RMSE . . . . . . . . . . . Root Mean Squared Error

TSVQ . . . . . . . . . . . Tree Structured Vector Quantization

INEX . . . . . . . . . . . INitiative for the Evaluation of XML retrieval

XML . . . . . . . . . . . . eXtensible Markup Language

BM25 . . . . . . . . . . . Okapi Best Match 25

CBM625 . . . . . . . . Cluster Best Match 625

NCCG . . . . . . . . . . Normalized Cumulative Cluster Gain

TREC . . . . . . . . . . . Text REtrieval Conference

CLEF . . . . . . . . . . . Conference and Labs of the Evaluation Forum

NTCIR . . . . . . . . . NII Testbed and Community for Information access Re-search

MAP . . . . . . . . . . . . Mean Average Precision

NMI . . . . . . . . . . . . . Normalized Mutual Information

TF-IDF . . . . . . . . . Term Frequency - Inverse Document Frequency

RI . . . . . . . . . . . . . . . Random Indexing

HITS . . . . . . . . . . . . Hyperlink Induced Topic Search

VSM . . . . . . . . . . . . Vector Space Model

TSM . . . . . . . . . . . . Tensor Space Model

11

Chapter 1Introduction

Digital collections of data continue to grow exponentially as the information age

continues to infiltrate every aspect of society. These sources of data take many

different forms such as unstructured text on the world wide web, sensor data,

images, video, sound, results of scientific experiments and customer profiles for

marketing. Clustering is an unsupervised learning approach that groups similar

examples together without any human labeling of the data. Due to the very

broad nature of this definition, there have been many different approaches to

clustering explored in the scientific literature.

This thesis addresses the computational efficiency of document clustering

in an information retrieval setting. This includes compressed representations,

efficient and scalable algorithms, and evaluation with a specific use case for in-

creasing efficiency of a distributed and parallel search engine. Furthermore, it

addresses the evaluation of document cluster quality for large scale collections

containing millions to billions of documents where complete labeling of a col-

lection is impractical. The cluster hypothesis from information retrieval is also

tested using several different approaches throughout the thesis. This thesis also

introduces new clustering algorithms that can be applied to problems outside

the information retrieval domain.

I briefly introduce the five main areas relating to the thesis. These are

document clustering, representations, algorithms, evaluation and information

retrieval. A more detailed introduction is contained in the high level overview,

background and research questions. The results are then presented as publica-

tions for each research area.

12

1.1 Document Clustering

Document clustering analyses written language in unstructured text to place

documents into topically related groups or clusters. Documents such as web

pages are automatically grouped together so that pages talking about the same

concepts are in the same cluster and those talking about different concepts

are in different clusters. This is performed in an unsupervised manner where

there is no manual labeling of the documents for these concepts, topics or other

semantic information. All semantic information is derived from the documents

themselves. The core concept that allows this to happen is the definition of a

similarity between two documents. An algorithm uses this similarity measure

and optimizes it so that the most similar documents are placed together.

1.2 Representations

Documents are often represented using high dimensional vectors where each

term in the vocabulary of the collection represents a dimension. This allows the

definition of similarity between documents using geometric measures such as

Euclidean distance and cosine similarity. This thesis explores the use of TopSig

document signatures for document clustering. TopSig produces a compressed

representation of document vectors using random projections and numeric quan-

tization. These document signatures preserve the topology of the document

vectors. The properties of the topology of these document signatures have been

investigated. TopSig vectors are represented as dense bit strings or bit vectors

that can be operated on 64 dimensions at a time using modern 64-bit digital

processors. This leads to gains in computational efficiency when comparing

entire document vectors.

1.3 Algorithms

Clustering is often thought of as unfeasible for large scale document collections.

Therefore, the ability for clustering algorithms to scale to large collections is

of particular interest. Algorithms with an O(n2) time complexity pose prob-

lems when scaling to collections containing millions to billions of documents.

This thesis explores the scalability of clustering algorithms in the setting of

information retrieval.

A novel variant of the k-means algorithm is presented that operates directly

with TopSig bit vectors. It allows a one to two magnitude performance increase

over traditional sparse vector approaches. It does this without reducing the

quality of the clusters produced. The K-tree algorithm uses the k-means al-

13

gorithm to perform splits in its tree structure. Building upon this result, the

TopSig K-tree document clustering approach can scale to clustering 50 million

documents, into over one hundred thousands of clusters, on a single machine, in

a single thread of execution, in 10 hours. Clustering of documents at this scale

has not previously been reported in the literature.

The EM-tree algorithm is introduced and it is shown to have further com-

putational advantages over the K-tree when using the bit vectors produced by

TopSig. Furthermore, it is much more amenable to parallel and distributed im-

plementations due to the batch nature of its optimization process. The conver-

gence of this new algorithm has been proven. The EM-tree also has advantages

over previous similar tree structured clustering algorithms when the data set

being clustered exceeds that of main memory and must be streamed from disk.

1.4 Evaluation

Category based document clustering evaluation does not have a specific use case.

Clusters are compared to a known solution referred to as the ground truth. The

ground truth is determined by humans who assign each document to one or

more classes based on the topics they believe exist in a document. In this the-

sis, I offer a novel and alternative evaluation using ad hoc relevance judgments

from information retrieval evaluation forums. This approach is motivated by the

cluster hypothesis, which states that documents relevant to information needs

tend to be more similar to each other than non-relevant documents – the rel-

evant documents tend to cluster together. Therefore, if the cluster hypothesis

holds, relevant documents for a given query will tend to appear in relatively

few document clusters. This evaluation tests document clustering for a specific

use case. It evaluates how effective a clustering is at increasing the throughput

of an information retrieval system by only searching a subset of a document

collection using collection selection. Using ad hoc relevance judgments for eval-

uation overcomes issues with labeling document collections containing millions

to billions of documents. This has already been addressed for the evaluation of

ad hoc information retrieval via the use of pooling. Furthermore, I suggest that

evaluation of document clustering via the cluster hypothesis from information

retrieval allows for a general evaluation of clustering. If the cluster hypothesis

does not hold for a particular clustering, then the clustering is not likely to be

useful in other contexts besides ad hoc information retrieval.

Evaluation of document clustering using ad hoc information retrieval has

taken place at the INEX XML Mining track in 2009 and 2010. It is a collabora-

tive forum where different groups of researchers work on different approaches to

classification and clustering of semi-structured XML documents. In 2010, a new

14

approach to extracting category information from the Wikipedia was introduced

that finds categories using shortest paths through the noisy and sometimes non-

sensical category graph. It is motivated by Occam’s Razor where the simplest

solution is often the best. In this case, the shortest paths through the category

structure are simplest. The Normalized Cumulative Cluster Gain evaluation

measure was introduced in this forum. It ranks clusters in an optimal order

using ad hoc relevance judgments from the ad hoc track at INEX. Therefore,

it is referred to as an “oracle” collection selection approach. It represents an

upper bound on ranking of clusters for collection selection given a clustering

and a set of queries and their relevance judgments. This evaluation indicated

that as the number of clusters of a collection increased, less of the collection

had to be searched to retrieve relevant documents.

The divergence from a random baseline meta-evaluation measure was in-

troduced to deal with ineffective clusterings. It takes any existing measure of

cluster quality, internal or external, and adjusts it for random chance by con-

structing a random baseline that has the same cluster size distribution as the

clustering being evaluated. Each clustering being evaluated has its own baseline

where the cluster size distribution matches. Ineffective clusterings exploit a par-

ticular cluster quality measure to assign high quality scores to clusterings that

are of no particular use. Take the NCCG measure as an example. If one cluster

contains the entire document collection, besides each other cluster containing

one document, then this one large cluster will likely contain all the relevant doc-

uments and always be ranked first by the “oracle” collection selection approach.

Therefore, it achieves the highest score possible but the entire collection still

has to be searched. A random baseline with the same cluster size distribution

will achieve the same score. Random guessing is not work and therefore this

ineffective clustering is assigned a score of zero. When adjusting the internal

measure of cluster quality, RMSE, or distortion, the divergence from random

baseline approach provides a clear optimum for the number of clusters with

respect to RMSE that is not apparent otherwise.

1.5 Information Retrieval

Web search engines have to deal with some of the largest data sets available.

Furthermore, they are frequently updated and have a high query load. As

document clustering places similar documents in the same cluster, the cluster

hypothesis supports the notion that only a small fraction of document clus-

ters need to be searched to fulfill a users information need when submitting a

query to a search engine. Therefore, clustering of large document collections

distributed over many machines can be exploited, so that only a fraction of the

15

collection has to be searched for a given query. The query can be dispatched to

machines containing relevant parts of the index for a particular query. Within

each individual machine, clustering can reduce the number of documents that

must be searched. Furthermore, fine grained clustering allows the cluster hy-

pothesis to be better exploited in terms of efficiency in a search engine. Using

the TopSig K-tree and CBM625 cluster ranking approach outlined in this thesis,

only 0.1% of the 50 million ClueWeb 2009 Category B document collection has

to be searched. This is a 13 fold improvement over the previously best known

approach.

1.6 How To Read This Thesis

This thesis is presented as a thesis by publication. One of the first things you

may notice is repeated material between papers. Each paper has been published

with relevant material for completeness. This way a reader does not have to

refer to several different papers to get the whole picture for the results presented.

Therefore, as you encounter repeated material, I encourage you to skip these

parts if you have already read them.

The overall structure of the contributions is described in the high level

overview section of the thesis in Chapter 2. The aim of this section is to allow

the reader to piece together a high level overview of all the contributions in

the thesis and how they fit together and build upon each other or support each

other. The problems of knowledge representation, learning algorithms, their

evaluation in an information retrieval context and applying them to ad hoc in-

formation retrieval are all highly inter-related. Each of the problems feed into

problems in different areas.

1.7 Outline

Each of the major areas of contribution are introduced. These are representa-

tions, algorithms, evaluation, and, information retrieval. The relevant parts of

each paper are highlighted in Chapters 5 through 8.

Conversely, each paper is introduced and it is explained how it fits into the

topics of representations, algorithms, evaluation, and, information retrieval in

Chapters 9 through 16.

I hope that this will allow you to understand the contributions from both

points of view. From the abstract concepts of representations, algorithms, eval-

uation, and, information retrieval. And, also from their concrete instantiations

in each of the papers and the associated open source software, which often touch

16

multiple of the abstract concepts. This way you can have your cake and eat it

too.

In Chapter 2, the overall structure of the contributions are mapped out.

In Chapter 3, relevant research and background information related to the thesis

is presented.

In Chapter 4, the research questions addressed by the research are defined. This

chapter also provides a more detailed overview of how all the publications fit

together.

In Chapter 5, the relevant sections of the papers to do with representations are

highlighted.

In Chapter 6, the relevant sections of the papers to do with algorithms are

highlighted.

In Chapter 7, the relevant sections of the papers to do with evaluation are

highlighted.

In Chapter 8, the relevant sections of the papers to do with information retrieval

are highlighted.

In Chapters 9 to 16, each paper is introduced and the reverse mapping of the

previous four chapters is explained. This is how each paper maps onto the

concepts of Representation, Algorithms, Evaluation, and, Information Retrieval.

In Chapter 17, the thesis is concluded and the implications of the research are

discussed. The potential directions for further research are highlighted.

17

Chapter 2High Level Overview

This thesis presents many inter-related results that build upon each other. New

document representations are used for clustering. New clustering algorithms

are used to improve ad hoc search. The INEX evaluation motivated the use

of fine grained clusters for ad hoc search. A map of the contributions of this

thesis are contained in Figure 2.1. Each arrow indicates that the following

contribution is built upon the results in the previous contribution. Additionally,

the relevant papers are listed in the bottom right of the relevant contribution.

These correspond to the numbers listed in the bibliography. I highlight some

examples from the map in the following paragraphs.

The TopSig representation was built upon random projections which lead

to the TopSig k-means algorithm. Document clusters using TopSig k-means

were used to evaluate the cluster hypothesis for ad hoc retrieval in the INEX

evaluation forum. The results of the INEX evaluation indicated that fine grained

clusters supported better ranking of relevant documents in clusters. Meanwhile,

the TopSig K-tree and EM-tree algorithms were devised for scaling clustering of

50 million documents into hundreds of thousands of clusters in a single thread of

execution. The combined result of the INEX evaluation and scalable clustering

approaches fed into the CBM625 approach to cluster-based retrieval. This lead

to a 13 fold decrease in the number of documents retrieved in a cluster-based

searched engine over the previous best known result on the 50 million document

ClueWeb09 Category B test collection.

The EM-tree algorithm provides further opportunities for efficiency and scal-

ing of algorithms when using the TopSig representation. The EM-tree algorithm

can optimize any cluster tree as input and therefore was combined with sampling

and the TSVQ algorithm to produce lower distortion clusters in less time the

TopSig K-tree approach. Furthermore, the batch optimization of a cluster tree

18

Figure 2.1: A Map of Contributions in this Thesis

makes the EM-tree algorithm particularly suitable for distributed, streaming

and parallel implementations of the algorithm. Based upon the results of the

single threaded implementation, it is expected that the streaming, distributed

and parallel implementation of the EM-tree algorithm will be able to cluster

the entire searchable Internet of 45 billion documents into millions of clusters.

As this scale of document clustering has not previously been possible, there are

many unknown applications for clustering of the entire searchable Internet.

2.1 Publication Information

This thesis presents 8 papers. 6 conference or workshop papers have been peer

reviewed and accepted for publication. 2 journal articles have been submitted

for peer review. The papers are listed below in chronological order.

19

Overview of the INEX 2009 XML Mining track: Clustering and Clas-

sification of XML Documents [171]

– Topics → Evaluation

– Chapter 9

– R. Nayak, C.M. De Vries, S. Kutty, S. Geva, L. Denoyer, and P. Gallinari.

This paper appears in the proceedings of the INEX workshop in the Springer

Lecture Notes on Computer Science. The paper was peer reviewed. It intro-

duces the use of ad hoc relevance judgements for evaluating document clustering

in the context of its use.

Clustering with Random Indexing K-tree and XML Structure [59]

– Topics → Representations,Algorithms

– Chapter 10

– C.M. De Vries, L. De Vine, and S. Geva.


Lecture Notes on Computer Science. The paper was peer reviewed. It intro-

duces approaches to clustering documents using XML structure and content

using the combination of Random Indexing and K-tree.

Overview of the INEX 2010 XML Mining track: Clustering and Clas-

sification of XML Documents [62]


– Chapter 11

– C.M. De Vries, R. Nayak, S. Kutty, S. Geva, and A. Tagarelli.


Lecture Notes on Computer Science. The paper was peer reviewed. This paper

builds upon the evaluation of document clustering using ad hoc relevance judg-

ments in the previous overview paper.

TopSig: topology preserving document signatures [90]

– Topics → Representation,Algorithms

– Chapter 12

– S. Geva, and C.M. De Vries.

This paper appears in the proceedings of the 20th ACM Conference on In-

formation and Knowledge Management. The paper was peer reviewed. This

paper introduces the TopSig representation and applies it to ad hoc retrieval

and document clustering.

Document Clustering Evaluation: Divergence from a Random Base-

line [60]

20


– Chapter 13

– C.M. De Vries, S. Geva, and A. Trotman.

This paper appears in the proceedings of the Workshop Information 2012

held at the Technical University of Dortmund. The paper was peer reviewed.

This paper investigates an evaluation approach in detail that was introduced to

differentiate pathological clusterings at INEX.

Pairwise Similarity of TopSig Document Signatures [58]

– Topics → Representations

– Chapter 14

– C.M. De Vries, and S. Geva

This paper appears in the proceedings of the Seventeenth Australasian Doc-

ument Computing Symposium in 2012 held at the University of Otago in New

Zealand. The paper was peer reviewed. This paper investigates the topology

of TopSig document signatures as a means to explain their performance char-

acteristics as a retrieval model.

Distributed Information Retrieval: Collection Distribution, Selection

and the Cluster Hypothesis for Evaluation of Document Clustering

[61]

– Topics → Algorithms, InformationRetrieval

– Chapter 15

– C.M. De Vries, S. Geva, and A. Trotman.

This journal article has been submitted to the ACM Transactions on Infor-

mation Systems for peer review. This paper investigates using scalable cluster-

ing algorithms to produce fine grained clusters of web scale document collections

to reduce the number of documents that must be inspected by an information

retrieval system.

The EM-tree Algorithm [55]

– Topics → Algorithms

– Chapter 16

– C.M. De Vries, L. De Vine, and S. Geva.

This journal article has been submitted to the Journal of Machine Learning

Research for peer review. This paper introduces the EM-tree algorithm and

presents a proof for its convergence. It demonstrates improvements over K-tree

in clustering document signatures in terms of quality and run-time efficiency.

21

Chapter 3Background

This chapter introduces document clustering, clustering algorithms, ad hoc in-

formation retrieval and the cluster hypothesis. These are the primary areas

of research investigated in this thesis. For an extensive review of clustering

algorithms see “Pattern Classification” by Duda et al. [74]. For an extensive

review of information retrieval, including document clustering, ad hoc informa-

tion retrieval, the cluster hypothesis and their evaluation, see “Introduction to

Information Retrieval” by Manning et al. [163].

3.1 Document Clustering

Document clustering is used in many different contexts, such as exploration of

structure in a document collection for knowledge discovery [214], dimensionality

reduction for other tasks such as classification [136], clustering of search results

for an alternative presentation to the ranked list [102] and pseudo-relevance

feedback in retrieval systems [145].

Recently there has been a trend towards exploiting semi-structured docu-

ments [172, 67]. This uses features such as the XML tree structure and hyper-

link graphs to derive data from documents to improve the quality of clustering.

Document clustering groups documents into topics without any knowledge

of the category structure that exists in a document collection. All semantic

information is derived from the documents themselves. It is often referred to as

unsupervised clustering. In contrast, document classification is concerned with

the allocation of documents to predefined categories where there are labeled

examples. Clustering for classification is referred to as supervised learning where

a classifier is learned from labeled examples and used to predict the classes of

unseen documents.

22

The goal of clustering is to find structure in data to form groups. As a result,

there are many different models, learning algorithms, encoding of documents and

similarity measures. Many of these choices lead to different induction principles

[77] which result in discovery of different clusters. An induction principle is an

intuitive notion as to what constitutes groups in data. For example, algorithms

such as k-means [157] and Expectation Maximization [64] use a representative

based approach to clustering where a prototype is found for each cluster. These

prototypes are referred to as means, centers, centroids, medians and medoids

[77]. A similarity measure is used to compare the representatives to examples

being clustered. These choices determine the clusters discovered by a particular

approach.

A popular model for learning with documents is the Vector Space Model

(VSM) [198]. Each dimension in the vector space is associated with one term

in the collection. Term frequency statistics are collected by parsing the docu-

ment collection and counting how many times each term appears in each docu-

ment. This is supported by the distributional hypothesis [101] from linguistics

that theorizes that words that occur in the same context tend to have similar

meanings. If two documents use a similar vocabulary and have similar term

frequency statistics, then they are likely to be topically related. The result is

a high dimensional, sparse document-by-term matrix with properties that can

be explained by Zipf distributions [254] in term occurrence. The matrix repre-

sents a document collection where each row is a document and each column is

a term in the vocabulary. In the clustering process, document vectors are often

compared using the cosine similarity measure. The cosine similarity measure

has two properties that make it useful for comparing documents. Document

vectors are normalized to unit length when they are compared. This normal-

ization is important since it accounts for the higher term frequencies that are

expected in longer documents. The inner product that is used in computing the

cosine similarity has non-zero contributions only from words that occur in both

documents. Furthermore, sparse document representation allows for efficient

computation.

Different approaches exist to weight the term frequency statistics contained

in the document-by-term matrix. The goal of this weighting is to take into

account the relative importance of different terms, and thereby facilitate im-

proved performance in common tasks such as classification, clustering and ad

hoc retrieval. Two popular approaches are TF-IDF [196] and BM25 [190, 235].

Clustering algorithms can be characterized by two properties. The first

determines if cluster membership is discrete. Hard clustering algorithms only

assign each document to one cluster. Soft clustering algorithms assign docu-

ments to one or more clusters in varying degree of membership. The second

23

determines the structure of the clusters found as being either flat or hierar-

chical. Flat clustering algorithms produce a fixed number of clusters with no

relationships between the clusters. Hierarchical approaches produce a tree of

clusters, starting with the broadest level clusters at the root and the narrowest

at the leaves.

K-means [157] is one of the most popular learning algorithms for use with

document clustering and other clustering problems. It has been reported as

one of the top 10 algorithms in data mining [239]. Despite research into many

other clustering algorithms, it is often the primary choice for practitioners due

to its simplicity [98] and quick convergence [6]. Other hierarchical clustering

approaches such as repeated bisecting k-means [210], K-tree [56] and agglom-

erative hierarchical clustering [210] have also been used. Note that the K-tree

paper is by the author of thesis and other related content can be found in the

associated Masters thesis [52]. Further methods such as graph partitioning al-

gorithms [120], matrix factorization [241], topic modeling [25] and Gaussian

mixture models [64] have also been used.

The k-means algorithm [157] uses the vector space model by iteratively opti-

mizing k centroid vectors, which represent clusters. These clusters are updated

by taking the mean of the nearest neighbors of the centroid. The algorithm

proceeds to iteratively optimize the sum of squared distances between the cen-

troids and the nearest neighbor set of vectors (clusters). This is achieved by

iteratively updating the centroids to the cluster means and reassigning nearest

neighbors to form new clusters, until convergence. The centroids are initialized

by selecting k vectors from the document collection uniformly at random. It

is well known that k-means is a special case of Expectation Maximization with

hard cluster membership and isotropic Gaussian distributions [182].

The k-means algorithm has been shown to converge in a finite amount of

time [203] as each iteration of the algorithm visits a possible permutation with-

out revisiting the same permutation twice, leading to the worst case analysis of

exponential time. Arthur et al. [6] have performed a smoothed analysis to ex-

plain the quick convergence of k-means theoretically. This is the same analysis

that has been applied to the simplex algorithm, which has an n2 worst case com-

plexity but usually converges in linear time on real data. While there are point

sets that can force k-means to visit every permutation, they rarely appear in

practical data. Furthermore, most practitioners limit the number of iterations

k-means can run for, which results in linear time complexity for the algorithm.

While the original proof of convergence applies to k-means using squared Eu-

clidean distance [203], newer results show that other similarity measures from

the Bregman divergence class of measures can be used with the same complex-

ity guarantees [14]. This includes similarity measures such as KL-divergence,

24

logistic loss, Mahalanobis distance and Itakura-Saito distance. Ding and He

[72] demonstrate the relationship between k-means and Principle Component

Analysis. PCA is usually thought of as a matrix factorization approach for

dimensionality reduction whereas k-means is considered a clustering algorithm.

It is shown that PCA provides a solution to the relaxed k-means problem by

ignoring some constants, thus formally creating a link between k-means and

matrix factorization methods.

3.1.1 Evaluation

Evaluating document clustering is a difficult task. Intrinsic or internal measures

of quality such as distortion [179] or log likelihood [22] only indicate how well

an algorithm optimized a particular representation. Intrinsic comparisons are

inherently limited by the given representation and are not comparable between

different representations. Extrinsic or external measures of quality compare a

clustering to an external knowledge source such as a ground truth labeling of

the collection or ad hoc relevance judgments. This allows comparison between

different approaches. Extrinsic views of truth are created by humans and suffer

from the tendency for humans to interpret document topics differently. Whether

a document belongs to a particular topic or not can be subjective. To further

complicate the problem, there are many valid ways to cluster a document col-

lection. It has been noted that clustering is ultimately in the eye of the beholder

[77].

Most of the current literature on clustering evaluation utilizes the classes-to-

clusters evaluation which assumes that the classes of the documents are known.

Each document has known category labels. Any clustering of these documents

can be evaluated with respect to this predefined classification. It is important

to note that the class labels are not used in the process of clustering, but only

for the purpose of evaluation of the clustering results.

Evaluation Metrics

When comparing a cluster solution to a labeled ground truth, the standard

measures of Purity, Entropy, NMI and F1 are often used to determine the quality

of clusters with regard to the categories. Let ω = {w1, w2, . . . , wK} be the set

of clusters for the document collection D and ξ = {c1, c2, . . . , cJ} be the set of

categories. Each cluster and category are a subset of the document collection,

∀c ∈ ξ, w ∈ ω : c, w ⊂ D. Purity assigns a score based on the fraction of a

cluster that is the majority category label,

argmaxc∈ξ

|c ∩ wk||wk|

, (3.1)

25

in the interval [0, 1] where 0 is the absence of purity and 1 is total purity.

Entropy defines a probability for each category and combines them to represent

order within a clustering,

− 1

log J

J∑

j=1

|cj ∩ wk||wk|

log|cj ∩ wk||wk|

, (3.2)

which falls in the interval [0, 1] where 0 is total order and 1 is complete disorder.

F1 identifies a true positive (tp) as two documents of the same category in the

same cluster, a true negative (tn) as two documents of different categories in

different clusters and a false negative (fn) as two documents of the same cate-

gory in different clusters where the score combines these classification judgments

using the harmonic mean,

2× tp2× tp+ fn+ fp

. (3.3)

The Purity, Entropy and F1 scores assign a score to each cluster which can be

micro or macro averaged across all the clusters. The micro average weights each

cluster by its size, giving each document in the collection equal importance, in

the final score. The macro average is simply the arithmetic mean, ignoring the

size of the clusters. NMI makes a trade-off between the number of clusters and

quality in an information theoretic sense. For a detailed explanation of these

measures please consult [163].

Single and Multi Label Evaluation

Both the clustering approaches and the ground truth can be single or multi

label. Examples of algorithms that produce multi label clusterings are soft or

fuzzy approaches such as fuzzy c-means [21], Latent Dirichlet Allocation [25] or

Expectation Maximisation [64]. A ground truth is multi label if it allows more

than one category label for each document. Any combination of single or multi

label clusterings or ground truths can be used for evaluation. However, it is only

reasonable to compare approaches using the same combination of single or multi

label clustering and ground truths. Multi label approaches are less restrictive

than single label approaches as documents can exist in more than one category.

There is redundancy in the data whether it is clustering or a ground truth. A

ground truth can be viewed as a clustering and compared to another ground

truth to measure how well the ground truths fit each other.

26

3.2 Document Representations

There are many different tehniques that effect the generation of representations

for clustering. By pre-processing documents, using new representations for dif-

ferent sources of information, and improving the representation of knowledge

for learning, both the quality of clusters discovered and the run-time cost of

producing clusters can be improved.

When representing natural language, different components of the language

lead to different representations. Popular models include the bag-of-words ap-

proach [210], n-grams [167], part of speech tagging [3], and, semantic spaces

[159].

Before any term based representation can be built, what constitutes a term

must be determined via a number of factors such as term tokenization, conver-

sion of words to their stems [181], removal of stop words [82], and, conversion

of characters such as removing accents. Additionally, automated techniques

may remove irrelevant boilerplate content from documents such as navigation

structure on a web page are used [243].

There are many approaches to weight terms in the bag-of-words model.

Whissel and Clarke [234] investigate different weighting strategies for docu-

ment clustering. TF-IDF [196] BM25 [190, 235] are two popular approaches to

weighting terms in the bag-of-words model.

Links between documents can be represented for use in document clustering.

There are algorithms that work directly with the graph structure in an iterative

fashion to produce clusters [223]. Other approaches embed the graph structure

in a document representation such as vector space representations [187, 160].

The representations are then used with general clustering algorithms. The links

can also be weighted by popular algorithms from ad hoc information retrieval

such as PageRank [176] and HITS [126].

Other internal structural information about documents can be represented

implicitly or explicitly. One effective implicit approach is to double the weight

of title words in a bag-of-words representation [44]. Other approaches explicitly

represent structure, such as those that represent the XML structure of docu-

ments. For example, Kutty et al. [134] represent both content and structure

explicitly using a Tensor Space Model. Tagarelli et al. [213] combine structure

and content together using tree mining and modified similarity measures.

There are many dimensionality reduction techniques aimed at reducing the

number of dimensions required to represent a document in a vector space model.

They all aim to preserve the information available in documents. However, some

particular approaches such as Latent Semantic Analysis [63] based upon the

Singular Value Decomposition have been suggested to improve document repre-

27

sentations by finding “concepts” in text. Popular approaches include Principle

Component Analysis [23], Singular Value Decomposition [92], Non-negative Ma-

trix Factorization [241], Wavelet transform [170], Discrete Consine Transform

[83], Semantic Hashing [195], Locality Sematnic Hashing [108], Random Index-

ing [194], and, Random Manhattan Indexing [184].

Random Indexing or Random Projections can be implemented using different

strategies. Achlioptas [1] introduced an approach that is friendly for database

enivornments that uses simpler uniform random distributions and ternary vec-

tors. This approach is also used by Random Indexing [194] and TopSig [90],

which is a contribution of this thesis. TopSig further compresses these represen-

tations by quantizing each dimension to a single bit. Kanerva [116] begins with

binary vectors and combines them into a dense binary representation. Plate

[180] uses holographic representations for nested compositional structure using

real valued vectors.

With the rise of the social web, many web pages now have social annotation

such as tags. These can be introduced into a document representation to further

disambiguate the topic of a document [185].

Various other multimedia sources of information can be represented in doc-

uments such as audio [158], video [242] and images [153]. The processing of

these signals constitutes many research fields to themselves. But there have

been approaches incorporating this information into the document clustering

process [18].

Documents can be mapped onto ontologies to provide a conceptual frame-

work for the representation of meaning in documents. This has been used to

improve the quality of document clusters [105].

Additionally, incorporating infromation from other category systems such as

the Wikipedia has shown to improve the quality of clusters [106].

There are many approaches trying to use a single additional approach to im-

prove document clustering via improved representation. However, there seems

to be few approaches combining multiple sources and comparing how they con-

tribute to the identification of clusters.

3.3 Ad Hoc Information Retrieval

Ad hoc information retrieval is concerned with the retrieval of documents given a

user query. The phrase ad hoc is of Latin origin and literally means “for this”.

Users create a query specific to their information need and retrieval systems

return relevant documents. Much of modern IR research has been concerned

with ranked retrieval of documents. Each document is assigned an estimate of

relevance for a given query and the results are displayed in descending order of

28

relevance.

Retrieval systems or search engines provide automated ad hoc retrieval. The

systems are typically implemented in software, but there have been hardware

based retrieval systems [233]. The inverted file is the data structure that under-

pins most modern retrieval systems [163]. Like an index at the back of a text

book, an inverted file maps words to instances where they appear. The inverted

file maps each term in the collection vocabulary to a postings list that identifies

documents containing the term and the frequency with which it occurs. Various

ranking functions re-weight query and document scores using the inverted file

to improve ad hoc retrieval effectiveness. Popular approaches include TF-IDF

[196], Language Models [247, 161], Divergence from Randomness [4] and Okapi

BM25 [189]. However, signature file based approaches have been popular in the

past [78] and have recently reappeared in the mainstream IR literature using

random projections or random indexing [90], which is a contribution of this

thesis. Signature files use a n bit binary string to represent documents and

queries which are compared using the Hamming distance. The motivation of

this approach is to exploit efficient bit-wise instructions on modern CPUs.

The study of retrieval models most commonly as different similarity functions

focusses on the quaity aspects of information retrieval systems. This studies how

can the quality of results with respect to human judgement of a search engine

be improved. In contrast, indexing implementations focus on the efficiency and

approximate indexing that make a trade-off between quality of search results

and efficiency.

Document signatures have been missing from main stream publications about

search engines and the Information Retrieval field in general for many years. Zo-

bel et al. clearly demonstrate the inferior performance of traditional document

signatures in their paper, “Inverted Files Versus Signature Files for Text In-

dexing” [255]. They conclude that signature files or document signatures are

slower, offer less functionality, and require more storage. Witten [237] come to

the same conclusion. Traditional signature based indexes are not competitive

with inverted files. They offer no distinct advantages over inverted indexes.

Traditional bit slice signature files [78] use efficient bit-wise operators. This

is presented in an ad hoc manner and is motivated by efficient bit-wise processing

without respect to Information Retrievel theory. Each document is allocated a

signature of N bits. Each term is assigned a random signature where only n <<

N bits are set. These are combined using the bit-wise XOR operation. This is a

Bloom filter [26] for the set of terms in a document. To avoid collisions resulting

in errors, the difference between n and N must be very large. In contrast, the

TopSig signatures [90] presented in this thesis compress the same representation

that underlies the inverted index, the document-by-term matrix. It first begins

29

with full precision vectors and compresses them via random projections and

numeric quantization. This is founded in the theory of preservation of the

topological relationships of low dimensional embeddings of vectors presented in

the the Johnson-Lindenstrauss lemma [114].

Other recent approaches to similarity search [249] have focussed on map-

ping documents to N bit strings for comparison using the Hamming distance.

However, they have only focussed on short strings for document-to-document

search for tasks such as near duplicate detection or the “find other documents

like these” queries. They do not investigate use of such structures for ad hoc

information retrieval or clustering. Additionally they use machine learning ap-

proaches to find a better mapping for signatures, which may be computatinally

prohibitive for web scale document collections containin billions of documents.

3.3.1 Collection Distribution and Selection

A distributed information retrieval system consists of many different machines

connected via a data communications network working together as one to pro-

vide search services for documents stored on each of the machines.

In a distributed retrieval setting, several different information resources can

be stored in different retrieval systems and indexes. These resources are com-

bined to provide a single query interface. This often results in an uncooperative

distributed system. Uncooperative systems only allow use of the standard query

interface provided by each system. All necessary information about a system

that is acting as part of the distributed system has to be gained via these inter-

faces. The collection selection approach has to build an appropriate model by

querying the systems individually.

Alternatively, a large collection such as the World Wide Web must be stored

on more than one machine so it can be stored and processed. A large collection

such as this is often processed in the setting of a cooperative distributed infor-

mation retrieval system. Full access to the indexes of each system is possible.

This enables use of global statistics and any other approaches that require more

than the use of the standard query interface.

Collection distribution is the strategy that determines which documents are

assigned to which machine. Common approaches include random [128, 73],

source [240] and topic based allocation [128, 240]. A common criticism of topic

based allocation is that automated methods for document clustering are not

scalable for large collections.

Collection selection is the process that determines which machines to search

in a distributed system given a query. There are many approaches to collection

selection such as SHiRE [130], gGlOSS [94] and Cue-Validity Variance [246]

30

that use collection statistics to summarize the collection stored on a particular

machine.

3.3.2 Evaluation

Information retrieval evaluation allows comparative evaluation of retrieval sys-

tems. System effectiveness is quantified by one of two approaches. User based

evaluations measure user satisfaction with the system whereas system based

evaluation determines how well a system can rank documents with respect to

relevance judgments of users. The system based approach has become most

popular due to its repeatability in a laboratory setting and the difficulty and

cost of performing user based evaluations [229].

The Cranfield paradigm [42] is an experimental approach to the evaluation

of retrieval systems. It has been adapted over the years and provides the theory

that supports modern information retrieval evaluation efforts such as CLEF

[81], NTCIR [115], INEX [8] and TREC [39]. It is a system based approach

that evaluates how different systems rank relevant documents. For systems to

be compared, the same set of information needs and documents have to be used.

A test collection consists of documents, statements of information need, and

relevance judgments [229]. Relevance judgments are often binary. A document

is considered relevant if any of its contents can contribute to the satisfaction of

the specified information need. Information needs are also referred to as topics

and contain a textual description of the information need, including guidelines

as to what may or may not be considered relevant. Typically, only the keyword

based query of a topic is given to a retrieval system.

Many different measures exist for evaluating the distribution of relevant

documents in ranked lists. The simplest measure is precision at n (P@n), defined

as the fraction of the documents that are relevant in the first n documents in

the ranked list. A strong justification for the use of P@n is that it reflects a

real use case [256]. Most users only view the first few pages of results presented

by a retrieval system. On the web, users may not even view the entire first

page of results. Users reformulate their queries rather than continue searching

the result list. Zobel et al. [256] argue that recall does not reflect a user’s

need except in some specific circumstances such as medical and legal retrieval

that require exhaustive searching. Furthermore, achieving total recall requires

complete knowledge of the collection. The mean average precision measure or

MAP takes into account both precision and recall. It is the area under the

precision versus recall curve. Precision is the fraction of relevant documents

at a given recall level, and a recall level is the fraction of returned relevant

documents in the set of all relevant documents. As recall increases, precision

31

tends to drop since systems have a higher density of relevant results early in the

ranked list. This naturally is the goal of a search engine.

The original Cranfield experiments required complete relevance judgments

for the entire collection. This approach rules out large collections due to the

necessity of exhaustive evaluation. On large scale collections with millions to

billions of documents it is not possible to evaluate the relevance of every single

document to every query. Therefore, an approach called pooling was developed

[209]. Each system in the experiment returns a ranked list of the first 1,000

to 2,000 documents in decreasing order of relevance. A proportion of the most

relevant documents, often around 100 documents, are taken from each system

and pooled together, and duplicates are removed since often there is a large

overlap between the results lists of different systems. These pooled results are

presented to the assessors in an unranked order. The assessors then determine

relevance on this relatively small pool.

There has been much debate and experimentation to determine the effect of

pooling. As test collections are designed to be reused, new approaches may lead

to the return of relevant documents that were not represented in the pool that

was assessed. These are un-judged relevant documents. Zobel [255] found this is

true for un-judged runs, but it does not unfairly affect a system’s performance on

TREC collections. However, thought is required when designing an information

retrieval evaluation experiment. The pool depth and diversity may need to

change based on the type and size of the collection. Furthermore, tests such as

those suggested by Zobel [255] can be carried out to determine the reliability

and reusability of a test collection.

Saracevic [200] performed a meta-analysis of judge consistency. They con-

cluded that consistency changes dramatically depending on the expertise of the

assessors. Assessors with higher expertise agreed more often on the relevance

of a document. Consistency also changed from experiment to experiment using

the same assessors. Cranfield based evaluation of information retrieval has been

scrutinized heavily over the years. However, assessor disagreement does not af-

fect this type of evaluation because the rank order of each retrieval system does

not often change when using different assessments from different judges. The

stability only applies when averaging retrieval system performance over a set of

queries. The retrieval quality can change drastically from query to query given

a different assessor.

3.4 Cluster Hypothesis

The cluster hypothesis connects ad hoc information retrieval and document

clustering. Manning et. al. [163] states that “documents in the same cluster

32

behave similarly with respect to relevance to information needs”. If a document

from a cluster is relevant to a query, then it is likely other documents in the

same cluster are also relevant to the same query. This is due to the clustering

algorithm grouping documents with similar terms together. There may also

be some higher order correlation due to effects caused by the distributional

hypothesis [101] and limitation of analysis to the size of a document; i.e. words

are defined by the company they keep. It should be noted that the relevance of

a document is assumed in an unsupervised manner in an operational retrieval

system. While our evaluation of document clustering uses relevance judgments

from humans, these are not always available or are noisy as they are derived

from click through data.

The cluster hypothesis has been analyzed in three different forms. Origi-

nally the cluster hypothesis was tested at the collection level by clustering all

documents in the collection. It was then extended to cluster only search results

to improve display and ranking of documents. Finally, it has been extended to

the sub-document level to deal with the multi-topic nature of longer documents.

3.4.1 Measuring the Cluster Hypothesis

There have been direct and indirect approaches to measuring the cluster hypoth-

esis. Direct approaches measure similarity between documents to determine if

relevant documents are more similar than non-relevant documents. Indirect

measures use other IR related tasks such as ad hoc retrieval or document clus-

tering to evaluate the hypothesis. If the cluster hypothesis is true for ad hoc

search, then only a small fraction of the collection needs to be searched while

the quality of search results will not be impacted or potentially improved. The

same idea applies to the spread of relevant documents over a clustering of a

collection. If the hypothesis is true, then the documents will only appear in a

few clusters.

Voorhees [227] proposed the nearest neighbor (NN) test. It analyzes the

five nearest neighbors of relevant documents according to a similarity measure.

It determines the fraction of relevant documents located nearby other relevant

documents. The NN test is comparable across different test collections and

similarity measures making it useful for comparative evaluations. Voorhees

[227] states that the test chose five nearest neighbors arbitrarily. If the number

of neighbors is increased, then the test becomes less local in nature. The NN

test overcame problems with the original cluster hypothesis test suggested by

[111]. The original test plots the similarity between all pairs of relevant–relevant

and relevant–non-relevant documents separately using a histogram approach.

It does so for each query and then the results are averaged over all queries. A

33

separation between the distributions indicates that the cluster hypothesis holds.

The criticism of this approach is that there will always be more relevant–non-

relevant pairs than relevant–relevant pairs. Voorhees [227] found that these

two different measures give very different pictures of the cluster hypothesis.

However, van Rijsbergen and Jones [225] found that the original measure was

useful for explaining the different performance of retrieval systems on different

collections.

Smucker and Allan [207] introduce a global measure of the cluster hypothesis

based on the relevant document networks and measures from network analysis.

The NN test is a local measure of clustering as it only inspects the five nearest

neighbors of each relevant document. The measure proposed builds a directed,

weighted, fully connected graph between all relevant documents where vertices

are documents and edges represent the rank of the destination document when

the source document is used as a query. This represents all relationships be-

tween relevant documents and not just those that fall within the five nearest

as in the NN test. The normalized, mean reciprocal distance (nMRD) from

small world network analysis [143] is used to measure the global efficiency based

upon all shortest path distances between pairs of documents in the network.

Each measure of the cluster hypothesis allows for a different view, allowing for

multiple comparisons across multiple approaches to ranking.

In the following sections, we review the measurement of the cluster hypoth-

esis via indirect measures.

3.4.2 Cluster Hypothesis in Ranked Retrieval

Many retrieval systems return a ranked list of documents in descending or-

der of estimated relevance. Clustering provides an alternative approach to the

organization of retrieved results that aims to reduce the cognitive load of analyz-

ing retrieved documents. It achieves this by clustering documents into topical

groups. The cluster hypothesis implies that most relevant documents will ap-

pear in a small number of clusters. A user can easily identify these clusters and

find most of the relevant results. Alternatively, clusters are used to improve

ranked results such as pseudo-relevance feedback and re-ranking using clusters.

Early studies in clustering ad hoc retrieval results by Croft [48] and Voorhees

[228] performed static clustering of the entire collection. More recent studies

found that clustering results for each query is more effective for improving the

quality of results [102, 150, 215, 145]. However, clustering of an entire docu-

ment collection allows for increased efficiency and its discussion has reappeared

recently in the literature [171, 62, 128].

Evaluation of the cluster hypothesis changed significantly after the result

34

of Hearst and Pedersen [102] that confirmed that the cluster hypothesis can

improve result ranking significantly. It did this by clustering results for each

query rather than entire collections. Hearst and Pedersen [102] revise the cluster

hypothesis. If two documents are similar to one query, they are not necessar-

ily similar to another query. As documents have very high dimensionality, the

definition of nearest neighbors changes depending on the query. Their system

dynamically clusters retrieval results for each query using the interactive Scat-

ter/Gather approach. The Scatter/Gather approach displays five clusters of the

top n documents retrieved by a system. A user selects one or more clusters indi-

cating that only those documents are to be viewed. This process can be applied

recursively, performing the Scatter/Gather process again on the selected sub-

set. The experiments using this system were performed on the TREC/Tipster

collection using the TREC-4 queries. The top ranked clusters always contained

at least 50 percent of the relevant documents retrieved. The results retrieved

by the Scatter/Gather system were shown to be statistically significantly better

than not using the system according to the t-test.

Crestani and Wu [47] investigate the cluster hypothesis in a distributed in-

formation retrieval environment. Even though there are many issues that intro-

duce noise when presenting results from heterogeneous systems in a distributed

environment, they demonstrate that clustering is still an effective approach for

displaying search results to users.

Lee et al. [145] present a cluster-based resampling method for pseudo-

relevance feedback. Previous attempts to use clusters for pseudo-relevance feed-

back have not resulted in statistically significant increases in retrieval efficiency.

This is achieved by repeatedly selecting dominant documents, which is moti-

vated by boosting from machine learning and statistics that repeatedly selects

hard examples to move the decision boundary toward hard examples. It uses

overlapping or soft clusters where documents can exist in more than one clus-

ter. A dominant document is one that participates in many clusters. Repeatedly

sampling dominant documents can emphasize the topics in a query and thus this

approach clusters the top 100 documents in the ranked list. The most relevant

terms are selected from the most relevant clusters and are used to expand the

original query. It was most effective for large scale document collections where

there are likely to be many topics present in the results for a query.

Re-ranking of the result list using clusters has found to be effective using

both vector space approaches [147, 146], language models [156] and local score

regularization [71]. In these approaches, the original ranked listed is combined

with the rank of clusters, rather than just the original query-to-document score.

Kulkarni and Callan [128] investigate the use of topical clusters for efficient

and effective retrieval of documents. This work finds similar results to those

35

discovered at the INEX XML Mining track [171, 62] when clustering large doc-

ument collections, that relevant results for a given query cluster tightly and only

fall in a few topical document clusters. Kulkarni and Callan solve the scalabil-

ity problem of clustering by using sampling using k-means with the Kullback-

Leibler divergence similarity measure. This work also implements and evaluates

a collection selection approach called ReDDE and finds that topical clusters

outperform random partitioning and that as little as one percent of a collection

needs to be searched to maintain early precision.

3.4.3 Sub-document Cluster Hypothesis in Ranked Re-

trieval

Lamprier et al. [138] investigate the use of sub-document segments or passages

to cluster documents in information retrieval systems. They extend the notion

of the cluster hypothesis to the sub-document level where different segments of

a document can have different topics that are relevant to different queries. If

the cluster hypothesis holds at the sub-document level, then relevant passages

will tend to cluster together. The approach proposed by Lamprier et al. [138]

returns whole documents but uses passages to cluster multi-topic documents

more effectively.

Lamprier et al. [138] suggest several different approaches to the segmentation

of documents into passages. The authors chose to select an approach based on

an evolutionary search algorithm that minimizes similarity between segments

called SegGen [137]. They dismissed using arbitrary regions of overlapping text

even though this was shown to be effective for passage retrieval by Kaszkiel

and Zobel [121]. The reasons given are added complexity and that only disjoint

subsets of the documents are appropriate in this case.

Lamprier et al. [138] performed experiments using sub-document clustering

on ad hoc retrieval results with the aim of improving the quality. The au-

thors did not consider sub-document clustering of the entire collection because

earlier studies suggested this was not effective for whole document clustering.

They found that sub-document clustering allowed better thematic clustering

that placed more relevant results together than whole document clustering.

36

Chapter 4Research Questions

The research questions relating to the publications presented in this thesis are

broken into the topics of representations, algorithms, evaluation and information

retrieval.

4.1 Representations

The major research question for representations is, How can compressed

representations improve the scalability and efficiency of document

clustering?

How can compressed document representations be generated effi-

ciently and effectively for use with document clustering algorithms?

Document vectors are often represented for learning using sparse vector ap-

proaches. Dense document representations are typically more computationally

efficient when comparing entire documents. A dense representation that uses the

same number of bytes is often faster because the entire vector is allocated con-

tiguously and therefore, prefetching, branch prediction, and caches work more

effectively on modern CPUs. Additionally, dimensionality reduction may require

less bytes to represent the same information. However, dense representations

such as those produced by dimensionality reduction techniques such as the Sin-

gular Value Decomposition, Principle Component Analysis and Non-Negative

Maxtrix Factorization require many iterations of optimization. Furthermore,

they require the entire document-by-term matrix to be analyzed.

In this thesis, I have investigated the use of random projections or random

indexing with the combination of numeric quantization for document cluster-

ing. This resulted in the TopSig [90] document representation being used for

37

document clustering. It represents documents as binary vectors that are com-

pared using the Hamming distance. This allows the vectors to be processed

64 dimensions at a time in a single CPU instruction. Unlike iterative matrix

factorization approaches to dimensionality reduction such as SVD, PCA and

NMF, TopSig can encode documents as dense binary vectors without global

analysis of the document-by-term matrix. Each document can be mapped onto

a lower dimensional dense vector independently of all other documents in the

document-by-term matrix. TopSig provides state-of-the-art compression when

compared to inverted index approaches to compressing document vectors. Typ-

ically, document hashes or binary signatures are 64 or 128 bits in length and are

used for near duplicate detection. TopSig produces much longer signatures from

1024 to 4096 bits that allow it to be effective for document clustering. However,

this requires algorithms that work directly with the binary vectors produced by

TopSig and this problem is addressed by later research questions.

How can compressed document representations be generated for clus-

tering in isolation without using global statistics?

The use of random indexing allows document vectors to be produced using

only the term frequencies within a single document in the TopSig model. There

is some evidence to suggest that the quality of these vectors offer a trade-off

between the cost of using global statistics and the quality of the representa-

tion. Any machine indexing documents can simply encode the document as a

binary vector and append it to a file, or store it with the object it represents.

The primary advantage of this approach is that a document signature can be

generated completely in isolation and compared using the universal Hamming

distance metric. Alternatively, statistics can be gathered from a smaller hetero-

geneous general knowledge document collection such as the Wikipedia. Terms

that do not appear in the Wikipedia can be assumed to only appear in the given

document for calculation of global statistics. This approach was undertaken to

index the 50 million document ClueWeb09 Category B document collection [61].

Can compressed document representations improve the efficiency of

document clustering?

Binary document vectors such as those produced by TopSig have been ex-

tensively used for nearest neighbor search. Many clustering algorithms perform

a nearest neighbor search as part of their optimization process. However, how

to implement algorithms efficiently using binary vectors requires the interme-

diate representations used in the algorithms to also be represented as binary

vectors. These questions are answered in the algorithms section and have lead

to a one to two orders of magnitude increase in clustering speed by using the

38

binary representations of TopSig [90] without a reduction in document cluster

quality. However, other algorithms have been developed that offer a further

trade-off between efficiency and cluster quality. The combination of the TopSig

representation and clustering algorithms is addressed in more detail in the al-

gorithms section of the research questions.

Why do TopSig signatures exhibit start-of-the-art performance at

early recall but not at deep recall?

The TopSig representation has been used for ad hoc information retrieval

and has been found to be competitive with state-of-the-art approaches at early

recall. However, at higher recall it suffers in comparison to probabilistic and

language models. This question has been answered by analyzing the topology

of the document vectors produced by TopSig [58]. The pairwise distances be-

tween document vectors were compared. The TopSig representation does bias

the distribution of random binary vectors as represented by the binomial distri-

bution, so that there is non-uniformity and therefore clustering of the feature

space. However, many documents become equidistant around the middle of the

distribution. It is the tail of the distribution that allows differentiation at early

recall. At higher recall most of the documents become equidistant and this is

often referred to as the “curse of dimensionality”.

How can the entity based XML structure of the INEX 2009 Wikpiedia

improve the quality of document clustering?

The XML entity structure of the INEX 2009 Wikipedia was extracted using

a bag of features approach. The terms used to cluster documents were restricted

to those occurring inside the XML entity tags; i.e. the terms associated with

entities. Furthermore, the entity tags were represented as a bag-of-tags where

each tag represented the type of entity such as city or person. These document

vectors were represented using random indexing to produce dense document

vectors. They were then clustered using the random indexing K-tree [59]. Each

representation was tried individually and in combination with each other. The

entity text approach proved most effective when using a category based evalua-

tion. The combination of entity text and tags proved most effective when using

ad hoc relevance judgements for evaluation.

4.2 Algorithms

The major research question for algorithms is, How can binary document

representations be exploited to increase the scalability and efficienct

39

of clustering algorithms?

How can document signatures be clustered efficiently?

One of the most popular clustering algorithms is the k-means algorithm that

finds clusters by iteratively updating prototypes that are called means, cen-

troids, cluster representatives or cluster centers. Comparison of binary vectors

is only efficient if the centroids are also binary vectors. Therefore, in the TopSig

paper [90] I introduced an algorithm inspired by k-means. Every centroid is

a binary vector the same length as the signatures being clustered. Centroids

are updated by taking the median of all vectors. For binary valued vectors this

is the most frequently occurring bit value, 0 or 1, in the dimension of nearest

neighbors associated with the centroid. While there is no proof to show that this

approach will converge as is the case for k-means when using squared Euclidean

distance and updating of centroids as means, the algorithm has converged in

practice. The convergence can be seen in the EM-tree paper [55]. When using

4096-bit vectors the TopSig k-means approach that works directly with binary

vectors offers a 10 to 20 fold improvement in clustering speed. There is no sta-

tistically significant difference of the quality of the results.

What are the trade-offs that can be made between the efficiency and

quality of clustering? Can efficiency be gained without sacrificing

quality?

A trade-off between cluster quality and speed can be made by using shorter

document signatures with 1024-bit signatures providing most of the quality of

4096-bit signatures. This is highlighted in the experiments in the TopSig paper

[90].

From a different perspective, algorithms that attempt to approximate the

k-means clustering algorithm such as the delayed update K-tree, the EM-tree

algorithm and TSVQ [55] can be used to make a trade-off between cluster qual-

ity and the efficiency of finding clusters. The difference in quality of clusters

found by k-means and the delayed update K-tree offer very little difference in

terms of search quality [61], although there is a difference in RMSE [55]. The

TopSig EM-tree algorithm when using 4096-bit vectors starts with an RMSE of

approximately 1950 bits when producing 100 clusters. It converges to a solution

approximately 280 bits lower of 1675 bits. The TSVQ algorithm converges to

a slightly better solution 290 bits lower than the original RMSE. The k-means

algorithm converges approximately another 10 bits lower. Considering there

is a total of 300 bits of optimization completed, a difference of 10 to 50 bits

is not a huge sacrifice considering the algorithms allow clustering of document

collections that k-means can not perform without distributed and parallel com-

40

putation.

How can document clusters be found when the document represen-

tations are too large to fit in main memory and must be streamed

from disk? How can clustering algorithms be effectively parallelized

and distributed among many machines?

The EM-tree is suitable to streaming implementations where the data set

is streamed from disk. The entire cluster tree is optimized at each iteration

allowing an iteration of optimization to be performed with one pass over the

data set. It is the batch nature of the optimization that makes this possible, and

is also advantageous for distributed and parallel implementations. The entire

cluster tree is frozen at each iteration. Therefore, the data set can be divided

among the units of execution and then once all examples have been inserted

into the tree, the results can be combined to update the tree. The details of

this approach are discussed in the EM-tree paper [55].

How can Terabytes to Petabytes of text be clustered into fine grained

clusters?

It is expected that a parallel, distributed and streaming EM-tree algorithm

can cluster billions of documents in under a day, given the projections calculated

based on the efficiency of a single threaded implementation [55]. Considering it

is estimated that Google indexes approximately 45 billion pages, this approach

is likely to be able to cluster the entire serachable Internet into millions of fine

grained clusters! This problem has been attacked from 3 angles. Firstly, compu-

tationally efficient document representations were created for use with cluster-

ing, namely TopSig. Secondly, the k-means algorithm was adapted to work with

TopSig, offering 1 to 2 orders of magnitude of performance increase over sparse

vector approaches. Thirdly, the approach used in the TopSig k-means algorithm

was combined with approaches based on the m-way nearest neighbor search tree

to further improve the scalability of these algorithms. The EM-tree algorithm

paper [55] provides results that indicate the algorithm is suitable for a parallel,

distributed and streaming implementation. To prove beyond any doubt that

the entire searchable Internet can be clustered, this needs to be implemented.

This is an exciting avenue of research for the results presented in this thesis.

How can a data structure used for geometric clustering algorithms be

formalized so that several algorithms, including flat and hierarchical

algorithms can be constructed using the same data structure?

The EM-tree algorithm paper [55] introduces the m-way nearest neighbor

search tree data structure and defines it formally. It is demonstrated how k-

41

means, TSVQ, EM-tree, agglomerative hierarchical clustering and the K-tree

algorithms can be constructed using the data structure.

How can the convergence of the EM-tree algorithm be proven?

When the EM-tree algorithm was first devised it was not certain if the al-

gorithm was guaranteed to converge. The EM-tree paper [55] proves that the

k-means algorithm is being performed in the root node of the tree. This is then

used in combination with structural induction to prove that all tree nodes will

converge in finite time.

4.3 Evaluation

The major research question for evaluation is, How can document cluster-

ing be evaluated in the context of Information Retrieval?

How can document clusters be evaluated in the context of a spe-

cific use-case instead of using the proxy of classification accuracy for

evaluation?

Most of the current literature evaluates different clustering approaches using

a ground truth set of categories as if it is a classification problem. This the-

sis presents an alternative evaluation using ad hoc relevance judgements from

information retrieval evaluation. This is motivated by the cluster hypothesis,

stating that relevant documents tend to cluster together. Documents that are

similar to each other are likely to be relevant to the same query. Therefore, it

is expected that relevant results for a query will exist in relatively few docu-

ment clusters. If the clustering is learning something useful with respect to the

cluster hypothesis, then relevant documents will be expected in fewer clusters

than generating clusters by random allocation completely ignoring document

content. This approach to clustering evaluation has been tested throughout the

thesis. It began at the INEX evaluation forum [171, 62]. It was examined fur-

ther with a publication on the Divergence from a Random Baseline approach to

evaluation [60]. Finally, it was shown that the use-case indeed works with the

implementation of a cluster-based search engine [61].

Can the cluster hypothesis be verified through document clustering

and ad hoc retrieval?

This research question has been answered directly by all of the relevant ma-

terial in the evaluation section of the thesis. All experimental results indicate

that the cluster hypothesis holds and that only a fraction of a document collec-

42

tion is required to be searched to rank most of the relevant documents.

Are categories appropriate for evaluating document clusters?

The goal of clustering is to place similar documents in the same cluster.

Using categories for evaluation of clustering makes the assumption that two

documents of the same category are similar. This may not strictly be the case

for the broad definition of categories often used for evaluation of document clus-

tering. For example, while help desk support and VLSI chip design both fall

in the broad category of Information Technology, the disciplines have very lit-

tle in common. Additionally, the number of categories typically used during

evaluation does not reflect the hundreds of thousands of potential categories

that exist for a heterogeneous general knowledge document collection such as

the Wikipedia or the entire Internet. These arguments and more are presented

throughout the publications in this thesis. Firstly, the evaluation approach is

highlighted in Chapters 7, 9, and, 11. Then its use-case is implemented as a

search engine in Chapters 8 and 15. Additionally, apart from classification, the

use of categories has no use-case. Indeed, if you want to perform document clas-

sification, it is more likely useful to train a classifier than use document clusters.

The predictions will be more accurate.

How can large scale document collections be annotated with human

generated information for evaluation?

One of the largest drawbacks to using category based evaluation is the hu-

man cost associated with labeling millions to billions of documents. One of

the largest document collections available for evaluation using categories is the

RCV1 [152] collection. It consists of approximately 800,000 documents with

hundreds of categories. It was reported that this process of labeling documents

was terrible for the assessors involved. However, ad hoc retrieval evaluation

offers a solution to this problem via the use of pooling. It reduces assessor load

and has already been validated in information retrieval forums using large scale

collections. This argument is most directly covered by the paper “Distributed

Information Retrieval” [61].

How can a clustering be evaluated to measure how effective it is for

collection selection while reducing number of clusters that need to be

searched while maintaining search result quality?

The Normalized Cumulative Cluster Gain (NCCG) measure was introduced

at the evaluation at INEX in 2009 [171] and 2010 [62]. It measures the spread of

relevant documents over a clustering for the optimal collection selection. It uses

an oracle to rank clusters according to the number of relevant documents they

43

contain. This is the best possible way to rank a clustering given a query. This

evaluation is driven by the use-case of reducing search load in a distributed in-

formation retrieval system by only searching the first few most relevant clusters.

How can a global optimum to the number of clusters be found when

measuring distortion using RMSE?

When varying the number of clusters found by a particular approach from 1

to the number of examples being clustered, n, the RMSE continues to improve

until there are n clusters. The Divergence from Random Baseline approach [60]

augments the RMSE value such that it reaches a peak somewhere in between 1

and n. It then proceeds to decrease when too many clusters are created. This

provides a clear optimum for the number of clusters with respect to RMSE with

respect to a particular clustering approach.

How can the Wikipedia be used to obtain a reliable and useful ground

truth for classification evaluation?

The XML Mining track at INEX has used the Wikipedia for clustering and

classification of documents for several years. Prior to 2010 the categories were

extracted from the Wikipedia portals. The drawback to this approach is that

only documents related to portals can be used. However, the Wikipedia con-

tains a category graph but the problem is that it is noisy and contains many

categories that are not useful for a particular document. Therefore, categories

were extracted from the Wikipedia category graph that follow the shortest path

between a particular page and one of the “Main Topic Classifications”. This is

motivated by Occam’s Razor where the simplest explanation is often the best.

As the subset of Wikipedia used for clustering and classification was defined by

the ad hoc reference run in 2010, this allowed categories to be extracted for all

pages. Furthermore, these categories were consistent in evaluation when reduc-

ing the quality of a clustering; the quality according to the categories was also

reduced [90].


The major research question for information retrieval is, How can fine grained

clusters of large scale document collections be used so that only a frac-

tion of the collection has to be searched?

How can clusters be ranked given a query?

An approach to ranking the clusters is used once a document collection has

44

been clustered. Only the documents in the most relevant clusters are searched.

There are many existing approaches to ranking clusters in the literature. A new

approach named CBM625 [61] was introduced where cluster representatives are

built using BM25 document weights. It squares the BM25 weights before com-

bining them. This improves the ranking of relevant results in clusters on the

Wikipedia. Note that the clusters were not found using BM25. They were

found using the TopSig representation. The CBM625 approach builds a new

representation for clusters where there is a cluster inverted index. It does this

based on the groupings of the clusters; i.e. which documents belong to which

cluster.

What is the benefit of document clustering in terms of reduced pro-

cessing costs? How can a search engine be implemented to exploit

document clusters to reduce search load?

The results in Chapters 8 and 15 of this thesis [61] evaluated the search

load in the same manner as a prior publication that uses cluster-based retrieval

[128]. It measures the number of documents searched in the minimum number

of clusters that are required to produce P@10, P@20 and P@30 scores that are

not statistically significantly different from exhaustive retrieval. The approach

taken by Kulkarni and Callan [128] was published during the development of

the approaches outlined in this thesis. However, the approach outlined in [61]

is able to retrieve 13 times less documents than the approach of Kulkarni and

Callan [128] on large scale document collections while not reducing the quality

when compared to exhaustive search. Both of these separate lines of research

implement working retrieval systems that use queries with relevance judgements

for evaluation. Different approaches to clustering and retrieval are taken, how-

ever, the same results confirming the cluster hypothesis are found.

Do the results of the optimal collection selection based evaluation of

document clusters transfer to application in a working search engine?

The optimal collection selection results from INEX suggested that a finer

grained clustering would allow less of a document collection to be searched and

still rank most of the relevant documents. This was validated by implementation

of a cluster-based search engine that does exactly that [61].

45

Chapter 5Representations

Different representations of documents can improve the quality of clusterings

and run-time efficiency of processing documents when clustering. If a represen-

tation does not allow adequate separation of the topics in a document collection,

then no learning algorithm is going to magically fix this. However, by exploit-

ing different sources of information the documents may become more separable

and therefore increase the quality of the clustering. Dimensionality reduction,

such as the approach presented in TopSig, can improve the run-time efficiency

of processing document vectors by reducing the total number of bytes required

to represent documents for learning.

This chapter presents different approaches for improving the quality and

efficiency when processing document vectors for document clustering.

5.1 TopSig: topology preserving document sig-

natures

This paper introduces the TopSig document signature representation. TopSig

produces binary strings, bit vectors or document signatures by applying ran-

dom indexing or random projections and numeric quantization to document

vectors. This creates a dense compressed representation of documents. The

compressed binary representation of TopSig is one of the factors that allows

the improvement of computational efficiency of clustering approaches outlined

in this thesis. The binary representation allows efficient processing on 64-bit

processors by processing 64 dimensions of each bit vector in a single CPU in-

struction. The compressed nature of the representation allows for less data to be

processed than when using other vector representations for document clustering.

46

This also improves the performance and scalability of clustering approaches as

less data has to be processed.

5.2 Pairwise Similarity of TopSig Document Sig-

natures

This paper investigates the topology of TopSig document signatures by analyz-

ing the distributions of the pairwise distances between signatures. It highlights

that the indexing process of TopSig is influencing the random codes used to per-

form dimensionality reduction. It skews the binomial distribution for uniform

random binary strings such that the feature space is no longer uniform and is

clustered. It is this non-uniformity that allows the differentiation of meaning.

Furthermore, it highlights that the curse of dimensionality still applies for docu-

ment signatures with most signatures being equidistant to each other around the

middle of the distribution. Only the signatures in the left tail of the distribution

allow adequate separation of documents; i.e. only the local neighborhood of the

signatures are interpetable. This also gives a reasonable explanation of why

the TopSig model is only competitive at early precision for ad hoc information

retrieval when compared to language and probabilistic models. There is only

adequate separation between documents in the head of the ranked list. When

proceeding down the ranked list, documents become equidistant at higher recall.

5.3 Clustering with Random Indexing K-tree and

XML Structure

This paper investigates using the semantic XML structure of the INEX 2009

Wikipedia collection for improving the quality of clustering. The INEX 2009

Wikipedia contains entity markup as XML. The text inside the entity tags

and the tags are represented as a bag of features. They are then clustered by

Random Indexing K-tree. It was found that by combining the entity tags and

text, the NCCG measure for collection selection, could be improved over using

the entity text alone. However, the entity text was clearly better as discovering

the categories used for evaluation as measured by Micro Purity.

47

Chapter 6Algorithms

This chapter addresses efficient clustering algorithms for scaling to large doc-

ument collections. The processing of ever increasing amounts of data allows

better estimation of the probability distributions discovered during the unsu-

pervised learning process.

6.1 TopSig: topology preserving document sig-

natures

This paper introduces a modified version of the k-means algorithm is introduced

that works directly with the binary string representation of TopSig. All cluster

centers and documents are binary strings. This allows the Hamming distance

measure to be used for comparison which can be computed 64 dimensions at

a time using 64-bit processors. This allows a one to two order magnitude or-

der increase in computational efficiency when compared to the sparse vector

approaches used in the popular clustering toolkit, CLUTO. When using 4096

bit document signatures, the quality of the clustering is not statistically signif-

icantly different to the sparse vector approach used in CLUTO.

6.2 Clustering with Random Indexing K-tree and

XML Structure

This paper investigates the use of Random Indexing K-tree for clustering the

entire 2.7 million document INEX 2009 XML Wikipedia. Random Indexing uses

random projections to create reduced dimensionality dense real valued vector

representations from sparse high dimensional vectors. This paper demonstrates

48

that the Random Indexing K-tree can scale to the relatively large Wikipedia

document collection while producing tens of hundreds of thousands of clusters.

Note that most researchers who participated in the evaluation at INEX chose

to only submit clusterings of the smaller 50,000 document subset. This may

indicate that many other clustering approaches have issues when scaling to

larger collections.

6.3 Distributed Information Retrieval: Collec-

tion Distribution, Selection and the Clus-

ter Hypothesis for Evaluation of Document

Clustering

The TopSig K-tree is introduced to scale clustering to 50 million documents

while producing approximately 140,000 clusters in 10 hours in a single thread

of execution. To the best of my knowledge, clustering of a document collection

this large into this many clusters has not been reported in the literature. A

delayed update mechanism is introduced as the K-tree algorithm regularly up-

dates means along the insertion path of vectors inserted into the tree. When

using binary vectors the cost of updating a mean is high as all the bits in vectors

associated with the mean must be unpacked and repacked.

6.4 The EM-tree algorithm

A new algorithm called the EM-tree algorithm is introduced in this paper. It

iteratively optimizes an entire m-way nearest neighbor search tree. It is proven

to converge. Furthermore, it provides efficiency advantages over the TopSig

K-tree when using TopSig bit vectors because means are updated less often.

49

Chapter 7Evaluation

This chapter investigates the evaluation of document clustering using ad hoc

information retrieval relevance judgments. It proposes that the often narrow

and well defined queries used for evaluation may be more meaningful than the

broad and lofty categories often found in category based evaluation such as

“Arts”. The use of pooling addresses the problem of creating external human

created information for evaluation for collections containing millions to billions

of documents. The evaluation of document clustering using relevance judgments

is undertaken for a specific use, increasing the throughput of an information

retrieval system by only searching a subset of a collection contained in the most

relevant document clusters for a given query.

7.1 Overview of the INEX 2009 XML Mining

track: Clustering and Classification of XML

Documents

This paper introduces the collaborative evaluation at INEX. It uses ad hoc

relevance judgments to evaluate document clustering for collection selection.

The Normalized Cumulative Cluster Gain measure is introduced to evaluate a

clustering with an “oracle” collection selection approach. This represents an

upper bound on collection selection performance given a clustering solution.

50

7.2 Overview of the INEX 2010 XML Mining

track: Clustering and Classification of XML

Documents

This paper continues the evaluation of document clustering at INEX. It also

introduces a new method for extracting categories from the Wikipedia category

graph.

7.3 Document Clustering Evaluation: Divergence

from a Random Baseline

This paper introduces an approach to the correction of ineffective clusterings

called Divergence from a Random Baseline. It produces a baseline that looks

identical to the clustering except that the documents are allocated to clusters

uniformly randomly. The cluster size distribution of the baseline matches that

of the clustering being evaluated. This is able to differentiate problematic clus-

terings using the NCCG evaluation measure. It also provides an optimum to

distortion as measured by RMSE that is not apparent otherwise.

51

Chapter 8Information Retrieval

8.1 Distributed Information Retrieval: Collec-

tion Distribution, Selection and the Clus-

ter Hypothesis for Evaluation of Document

Clustering

This paper combines all of the previous sections of representations, algorithms

and evaluation to implement the ideas as a cluster-based search engine. The

final result is that only 0.1% of the 50 million document ClueWeb 2009 Category

B collection has to be searched. This is a 13 fold improvement over the previ-

ously best known result. It also addresses issues surrounding the evaluation of

document clustering and demonstrates the use case presented in the evaluation

at INEX.

52

Chapter 9Overview of the INEX 2009 XML

Mining track: Clustering and

Classification of XML Documents

I made a large proportion of the contributions in this paper relating to the

clustering task. I wrote the evaluation software used for the clustering task 1,

defined and expanded the scope of the evaluation, performed the analysis of

submissions, created plots, performed data preparation, aided in organization

of XML Mining track, and, aided in writing of the paper.

9.1 Evaluation

This paper introduces the collaborative evaluation at INEX. It uses ad hoc

relevance judgments to evaluate document clustering for collection selection.

The Normalized Cumulative Cluster Gain measure is introduced to evaluate a

clustering with an “oracle” collection selection approach. This represents an

upper bound on collection selection performance given a clustering solution.

1http://mloss.org/software/view/468/

53

http://mloss.org/software/view/468/

Overview of the INEX 2009 XML Mining Track: Clustering and Classification of XML Documents

Richi Nayak Christopher M. De Vries Sangeetha Kutty Shlomo Geva

Faculty of Science and Technology Queensland University of Technology

GPO Box 2434, Brisbane Qld 4001, Australia

{r.nayak, christopher.devries, s.kutty, s.geva}@qut.edu.au

Ludovic Denoyer Patrick Gallinari

University Pierre et Marie Curie LIP6 – 104 avenue du président Kennedy

75016 PARIS - FRANCE

{Ludovic.denoyer, Patrick.gallinari}@lip6.fr

Abstract. This report explains the objectives, datasets and evaluation criteria of both the clustering and classification tasks set in the INEX 2009 XML Mining track. The report also describes the approaches and results obtained by the different participants. Keywords: XML document mining, INEX, Wikipedia, Structure and content, Clustering, Classification.

1 Introduction

The XML Document Mining track was launched for exploring two main ideas: (1) identifying key problems and new challenges of the emerging field of mining semi-structured documents, and (2) studying and assessing the potential of Machine Learning (ML) techniques for dealing with generic ML tasks in the structured domain i.e. classification and clustering of semi structured documents. This track has run for five editions during INEX 2005, 2006, 2007, 2008 and 2009. The four first editions have been summarized in [2, 3, 4] and we focus here on the 2009 edition.

INEX 2009 included two tasks in the XML Mining track: (1) unsupervised clustering task and (2) semi-supervised classification task where documents are organized in a graph. The clustering task requires the participants to group the documents into clusters without any knowledge of cluster labels using an unsupervised learning algorithm. On the other hand, the classification task requires the participants to label

54

the documents in the dataset into known classes using a supervised learning algorithm and a training set. This report gives the details of clustering and classifications tasks.

2 The Clustering Track

In the last decade, we have observed a proliferation of approaches for clustering XML documents based on their structure and content [9,12]. There have been many approaches developed for diverse application domains. Many applications require data objects to be grouped by similarity of content, tags, paths, structure and semantics. The clustering task in INEX 2009 evaluates clustering approaches in the context of XML information retrieval.

The INEX 2009 clustering task is different from the previous years due to its incorporation of a different evaluation strategy. The clustering task explicitly tests the Jardine and van Rijsbergen cluster hypothesis (1971) [8], which states that documents that cluster together have a similar relevance to a given query. It uses manual query assessments from the INEX 2009 Ad Hoc track. If the cluster hypothesis holds true, and if suitable clustering can be achieved, then a clustering solution will minimise the number of clusters that need to be searched to satisfy any given query. There are important practical reasons for performing collection selection on a very large corpus. If only a small fraction of clusters (hence documents) need to be searched, then the throughput of an information retrieval system will be greatly improved. INEX 2009 clustering task provides an evaluation forum to measure the performance of clustering methods for collection selection on a huge scale test collection. The collection consists of a set of documents, their labels, a set of information needs (queries), and the answers to those information needs.

2.1 Corpus

The INEX XML Wikipedia collection is used as a dataset in this task. This 60 Gigabyte collection contains 2.7 million English Wikipedia XML documents. The XML mark-up includes explicit tagging of named entities and document structure. In order to enable participation with minimal overheads in data-preparation the collection was pre-processed to provide various representations of the documents. For instance, a bag-of-words representation of terms and frequent phrases in a document, frequencies of various XML structures in the form of trees, links, named entities, etc. These various collection representations made this task a lightweight task that required the participants to submit clustering solutions without worrying about pre-processing this huge data collection. There are a total of 1,970,515 terms after stemming, stopping, and eliminating terms that occur in a single document for this collection. There are 1,900,075 unique terms that appear more than once enclosed in entity tags. There are 5213 unique entity tags in the collection. There are a total of 110,766,016 unique links in the collection. There

55

are a total of 348,552 categories that contain all documents except for a 118,685 document subset containing no category information. These categories are derived by using the YAGO ontology [16]. The YAGO categories appear to follow a power law distribution as shown in Figure 1. Distribution of documents in the top-10 cluster category is shown in Table 1.

Figure 1: The YAGO Category Distribution

Category Documents

Living people 307304

All disambiguation pages 143463

Articles with invalid date parameter in template 77659

All orphaned articles 34612

All articles to be expanded 33810

Year of birth missing (living people) 32499

All articles lacking sources 21084

Human name disambiguation pages 18652

United States articles missing geocoordinate data 15363

IUCN Red List least concern species 15241

Table 1: Top-10 Category Distribution

56

A subset of collection containing about 50,000 documents (of the entire INEX 2009 corpus) was also used in the task to evaluate the categories labels results only, for teams that were unable to process such a large data collection.

2.2 Tasks and Evaluation Measures

The task was to utilize unsupervised classification techniques to group the documents into clusters. Participants were asked to submit multiple clustering solutions containing different numbers of clusters such as 100, 500, 1000, 2500, 5000 and 10000. The clustering solutions are evaluated by two means. Firstly, we utilise the classes-to-clusters evaluation which assumes that the classification of the documents in a sample is known (i.e., each document has a class label). Then any clustering of these documents can be evaluated with respect to this predefined classification. It is important to note that the class labels are not used in the process of clustering, but only for the purpose of evaluation of the clustering results. The standard criterion of purity is used to determine the quality of clusters. These evaluation results were provided online and ongoing, starting from mid-October. Entropy and F-Score were not used in evaluation. The reason behind was that a document in the corpus maps to more than one category. Due to multi labels that a document can have, it was possible to obtain higher value of Entropy and F-Score than the ideal solution. Purity measures the extent to which each cluster contains documents primarily from one class. Each cluster is assigned with the class label of the majority of documents in it. The macro and micro purity of a clustering solution cs is obtained as a weighted sum of the individual cluster purity. In general, larger the value of purity, better the clustering solution is.

Purity (k) = !"#$%& !" !"#$%&'() !"#$ !"# !"#$%&'( !"#$! !" !"#$%&' !!"#$%&'( !"#$%&'() !" !"#$%&' !

Micro-Purity (cs) = !"#$%&(!)∗!"#$%&"'()*+,%$--(!)!

!!!!"#$%&"'()*+,%$--(!)!

!!!

Macro-Purity (cs) = !"#$%& !!!!!

!"#$% !"#$%& !" !"#$%&'($)

The clustering solutions are also evaluated to determine the quality of cluster relative to the optimal collection selection goal, given a set of queries. Better clustering solutions in this context will tend to (on average) group together relevant results for (previously unseen) ad-hoc queries. Real Ad-hoc retrieval queries and their manual assessment results are utilised in this evaluation. This novel approach evaluates the clustering solutions relative to a very specific objective - clustering a large document collection in an optimal manner in order to satisfy queries while minimising the search space. The Normalised Cumulative Gain is used to calculate the score of the

57

best possible collection selection according to a given clustering solution of n number of clusters. Better the score when the query result set contains more cohesive clusters. The cumulative gain of a cluster (CCG) is calculated by counting the number of relevant documents in a cluster, c, for a topic, t, where c is the set of documents in a cluster and t is the set of relevant documents for a topic.

𝐶𝐶𝐺 𝑐, 𝑡 = |𝑐 ∩ 𝑡|

For a clustering solution for a given topic, a (sorted) vector CG is created representing each cluster by its CCG value. Clusters containing no relevant documents are represented by a value of zero. The cumulated gain for the vector CG is calculated which is then normalized on the ideal gain vector. Each clustering solution cs is scored for how well it has split the relevant set into clusters using CCG for the topic t.

𝑆𝑝𝑙𝑖𝑡𝑆𝑐𝑜𝑟𝑒 𝑡, 𝑐𝑠 =cumsum CG!"

nr!

nr = Number of relevant documents in the returned result set for the topic t.

A worst possible split is assumed to place each relevant document in a distinct cluster. Let CG1 be a vector that contains the cumulative gain of every cluster with a document each.

𝑀𝑖𝑛𝑆𝑝𝑙𝑖𝑡𝑆𝑐𝑜𝑟𝑒(𝑡, 𝑐𝑠) =cumsum(CG1)|!"!|

nr!

The normalized cluster cumulative gain (nCCG) for a given topic t and a clustering solution cs is given by,

nCCG 𝑡, 𝑐𝑠 = 𝑆𝑝𝑙𝑖𝑡𝑆𝑐𝑜𝑟𝑒 𝑡, 𝑐𝑠 − 𝑀𝑖𝑛𝑆𝑝𝑙𝑖𝑡𝑆𝑐𝑜𝑟𝑒(𝑡, 𝑐𝑠)

1 −Min𝑆𝑝𝑙𝑖𝑡𝑆𝑐𝑜𝑟𝑒(𝑡, 𝑐𝑠)

The mean and the standard deviation of the nCCG score over all the topics for a clustering solution cs are then calculated. n is total number of topics.

Mean nCCG(cs) = !""# !,!"!!!!

!

Std Dev nCCG(cs) = (!""# !,!" ! !"#$ !""#(!"))!!

!!!!

A total of 68 topics were used to evaluate the quality of clusters generated on the full set of collection of about 2.7 million documents. A total of 52 topics were used to evaluate the quality of clusters generated on the subset of collection of about 50,000

58

documents. A total number of 4858 documents were found relevant by the manual assessors for the 68 topics. An average number of 71 documents were found relevant for a given topic by manual assessors. The nCCG value varies from 0 to 1.

2.3 Participants, Submissions and Evaluation

A total of six research teams have participated in the INEX 2009 clustering task. Two of them submitted the results for the subset data only. We briefly summarised the approaches employed by the participants. Exploiting Index Pruning Methods for Clustering XML Collections [1] [1] used Cover-Coefficient Based Clustering Methodology (C3M) to cluster the XML documents. C3M is a single-pass partitioning type clustering algorithm which measures the probability of selecting a document given a term that has been selected from another document. As another approach, [1] adapted term-centric and document-centric index pruning techniques to obtain more compact representations of the documents. Documents are clustered with these reduced representations for various pruning levels, again using C3M algorithm. All of the experiments are executed on the subset of INEX 2009 corpus including 50K documents. Clustering with Random Indexing K-tree and XML Structure [5] The Random Indexing (RI) K-tree has been used to cluster the entire 2,666,190 XML documents in the INEX 2009 Wikipedia collection. Clusters were created as close as possible to the 100, 500, 1000, 5000 and 10000 clusters required for evaluation. The algorithm produces clusters of many sizes in a single pass. The desired clustering granularity is selected by choosing a particular level in the tree. In the context of document representation, topology preserving dimensionality reduction is preserving document meaning – or at least this is the conjecture which the team tests here Document structure has been represented by using a bag of words and a bag of tags representation derived from the semantic markup in the INEX 2009 collection. The term frequencies were weighted with BM25 where K1 = 2 and b = 0.75. The tag frequencies were not weighted. Exploiting Semantic tags in XML Clustering [10] This technique combines the structure and content of XML documents for clustering. Each XML document in the INEX Wikipedia corpus is parsed and modeled as a rooted labeled ordered document tree. A constrained frequent subtree mining algorithm is then applied to extract the common structural features from these document trees in the corpus. Using the common structural features, the corresponding content features of the XML documents are extracted and represented in a Vector Space Model (VSM). The term frequencies in the VSM model were weighted with both TF-IDF and BM25. There were 100, 500 and 1000 clusters created for evaluation.

59

Performance of K-Star at the INEX’09 Clustering Task [13] The employed approach was quite simple and focused on high scalability. The team used a modified version of the Star clustering method which automatically obtains the number of clusters. In each iteration, this clustering method brings together all those items whose similarity value is higher than a given threshold T, which is typically assumed to be the similarity average of the whole document collection and, therefore, the clustering method "discover" the number of clusters by its own. The run submitted to the INEX clustering task split the complete document collection into small subsets which are clustered with the above mentioned clustering method.

Evaluation

Figure 2, Figure 3 and Figure 4 show the performance of various teams in the clustering task. The legends are formatted in the following fashion, [metric] – [institution] (username) [method].

60

Figure 2: Purity and NCCG performance of different teams using the entire dataset

61

Figure 3: Purity performance of different teams using the subset data

62

Figure 4: NCCG performance of different teams using the subset data

63

3. The Classification Track

Dealing with XML document collections is a particularly challenging task for ML and IR. XML documents are defined by their logical structure and their content (hence the name semi-structured data). Moreover, in a large majority of cases (Web collections for example), XML documents collections are also structured by links between documents (hyperlinks for example). These links can be of different types and correspond to different information: for example, one collection can provide hierarchical links, hyperlinks, citations, ..... Most models developed in the field of XML categorization simultaneously use the content information and the internal structure of XML documents (see [2] and [3] for a list of models) but they rarely use the external structure of the collection i.e the links between documents. Some methods using both content and links have been proposed in [4].

The XML Classification Task focuses on the problem of learning to classify documents organized in a graph of documents. Unlike the 2008 track, we consider here the problem of Multiple labels classification where a document belongs to one or many different categories. This task considers a tansductive context where, during the training phase, the whole graph of documents is known but the labels of only a part of them are given to the participants (Figure ).

Training set Final labelling

Figure 5: The supervised classification task. Colors/Shapes correspond to categories, circle/white nodes are unlabeled nodes. Note that in this track, documents may belong to many categories

64

3.1 Corpus

The corpus provided is a subset of the INEX 2009 Corpus. We have extracted a set of 54,889 documents and the links between these documents. These links corresponds to the links provided by the authors of the Wikipedia articles. The documents have been transformed into TF-IDF vectors by the organizers. The corpus thus corresponds to a set of 54,889 vectors of dimension 186,723. The documents belong to 39 categories that correspond to 39 Wikipedia portals. We have provided the labels of 20 % of the documents. The corpus is composed of 4,554,203 directed links that correspond to hyperlinks between the documents of the corpus. Each document is concerned by 84.1 links on average.

Number of documents 54,889

Number of training documents 11,028

Number of test documents 43,861

Number of categories 39

Number of links 4,554,203

Number of distinct words 186,723

3.2 Evaluation Measures

In order to evaluate the submissions of the participants, we have used different measures. The first set of measures are computed over each category and then averaged over the categories (using a micro or a macro average):

• Accuracy (ACC) corresponds to the classification error. Note that a system that returns zero relevant category for each document has a quite good accuracy.

• F1 score (F1) corresponds to the classical F1 measure and measures the ability of a system to find the relevant categories.

The second set of measures are computed over each document and then averaged over the documents.

• Average precision (APR) corresponds to the Average Precision computed over the list of categories returned for each document. It measures the ability of a

65

system to rank correctly the relevant categories. This measure is based on a ranking score of each category for each document.

3.3 Participants and Submissions

Five different teams have participated to the track. They have submitted different runs and we present here only the best results obtained by each team. Note that, due to additional experiments made after the submission deadline, the results presented here and the results presented in the participants’ articles can be different.

Team Micro

ACC Macro ACC

Micro F1 Macro F1 APR

University of Wollongong 92.5 94.6 51.2 47.9 68

University of Peking 94.7 96.2 51.8 48 70.2

XEROX Research Center 96.3 97.4 60 57.1 67.8

University of Saint Etienne 96.2 97.4 56.4 53 68.5

University of Granada 67.8 75.4 26.2 25.3 72.9

3.4 Summary of the methods

We give here a brief description of the methods submitted by the participants. Please refer to the participants articles for a detailed description of the methods and for the final results obtained by the different teams.

Multi-label Wikipedia classification with textual and graph features [6]

This paper proposes to evaluate different classification methods used on both the textual features of the pages to classify, and also on graph features computed from the structure of the graph. These features include for example the mean centrality, the degree centrality, etc...Different classifiers have been tested to handle the multi-label problem.

66

Supervised Encoding of Graph-of-Graphs for Classification and Regression Problems [7]

This article proposes a novel method which aims at encoding graph of graph structures where data correspond to a graph of elements which are also composed of graphs. The graph to graph structure is described and then used as a classification model based on a back-propagation of the error through the different level of the nested structure.

UJM at INEX 2009 XML Mining Track [11]

The authors use different classification strategies based on a set of content features to handle the classification problem. They mainly compare different features selection methods and thresholding strategies.

Link-based text classification using Bayesian networks [14]

The article presents a Bayesian network model that is able to handle both content and links between documents. The proposed model is an extension of the Naïve Bayes model to documents organized in a graph.

Extended VSM for XML Document Classification using Frequent Subtrees [15]

The last paper proposes the structured link vector model which aims at modeling both the content and the structure of the documents in a vector. Mainly, the authors propose to insert into classical content-based features vectors information about the frequent XML subtrees and the links between documents

4. Conclusion

The XML Mining track in INEX 2009 brought together researchers from Information Retrieval, Data Mining, Machine Learning and XML fields. The clustering task allowed participants to evaluate clustering methods against a real use case and with significant volumes of data. The task was designed to facilitate participation with minimal effort by providing not only raw data, but also pre-processed data which can be easily used by existing clustering software. The classification task allowed participant to explore algorithmic, theoretical and practical issues regarding the classification of interdependent XML documents.

67

5. Acknowledgments

We would like to thank all the participants for their efforts and hard work.

6. References

1. Altingovde I, Atilgan D, Ulusoy O., Exploiting Index Pruning Methods for Clustering XML Collections. In INEX 2009 Preproceedings.

2. Denoyer, L., Gallinari, P.: Report on the xml mining track at inex 2005 and inex 2006: categorization and clustering of xml documents. 41(1) (2007) 79–90

3. Denoyer, L., Gallinari, P.: Report on the xml mining track at inex 2007 categorization and clustering of xml documents. 42(1) (2008) 22–28

4. Denoyer, L., Gallinari, P.: Overview of the inex 2008 xml mining track. In: INEX. (2008) 401–411

5. De Vries C, Geva S, De Vine, L., Clustering with Random Indexing K-tree and XML Structure. In INEX 2009 Preproceedings.

6. Chidlovskii B.: Multi-label Wikipedia classification with textual and graph features. In: INEX. (2009)

7. Hagenbuchner M, Zhang S, Scarselli F, Chung Tsoi A.: Supervised Encoding of Graph-of-Graphs for Classification and Regression Problems. In: INEX. (2009)

8. Jardine, N. and van Rijsbergen, C. J., (1971) The Use of Hierarchic Clustering in Information Retrieval. Inform. Stor. ~ Retr., 7, 217- 240.

9. Kutty S, Nayak R, Li Y.: HCX: An Efficient Hybrid Clustering Approach for XML Documents. Proceedings of the ACM Document Engineering Symposium, Munich, Germany. (2009) 94-97

10. Kutty, S., Nayak, R., Li Y., Clustering XML documents using Multi-feature Model. In INEX 2009 Preproceedings.

11. Largeron C., Moulin C., Gery M., UJM at INEX 2009 XML Mining Track.In: INEX. (2009)

12. Nayak – Book chapter 13. Pinto, D, Tovar, M., Vilari˜no , D, Beltran, B., Salazar, H. BUAP: Performance

of K-Star at the INEX’09 Clustering Task . In INEX 2009 Preproceedings. 14. E. Romero A, M de Campos L, Fernandez-Luna J. M., Huete J. F., Masegosa A.

R.: Link-based text calssification using Bayesian networks. In: INEX (2009) 15. Yang J., Wang S.: Extended VSM for XML Document Classification using

Frequent Subtrees. In: INEX. (2009) 16. Suchanek F, Kasneci, G, Weikum, G, YAGO: A Core of Semantic Knowledge

Unifying WordNet and Wikipedia. In:WWW 2007.

68

Chapter 10Clustering with Random Indexing

K-tree and XML Structure

For this paper I completed almost all tasks related to contributions. Lance

contributed some parts of the software 1 related to Random Indexing. I wrote

the manuscript, defined the experimental setup, conducted the experiments,

performed data analysis, and, wrote software.

10.1 Representations

This paper investigates using the semantic XML structure of the INEX 2009

Wikipedia collection for improving the quality of clustering. The INEX 2009

Wikipedia contains entity markup as XML. The text inside the entity tags

and the tags are represented as a bag of features. They are then clustered by

Random Indexing K-tree. It was found that by combining the entity tags and

text, the NCCG measure for collection selection, could be improved over using

the entity text alone. However, the entity text was clearly better as discovering

the categories used for evaluation as measured by Micro Purity.

10.2 Algorithms

This paper investigates the use of Random Indexing K-tree for clustering the

entire 2.7 million document INEX 2009 XML Wikipedia. Random Indexing uses

random projections to create reduced dimensionality dense real valued vector

representations from sparse high dimensional vectors. This paper demonstrates

1http://ktree.sf.net

69

http://ktree.sf.net

that the Random Indexing K-tree can scale to the relatively large Wikipedia

document collection while producing tens of hundreds of thousands of clusters.

Note that most researchers who participated in the evaluation at INEX chose

to only submit clusterings of the smaller 50,000 document subset. This may

indicate that many other clustering approaches have issues when scaling to

larger collections.

70

Clustering with Random Indexing K-tree andXML Structure

Christopher M. De Vries, Shlomo Geva, and Lance De Vine

Faculty of Science and Technology,Queensland University of Technology, Brisbane, Australia

[email protected] [email protected] [email protected]

Abstract. This paper describes the approach taken to the clusteringtask at INEX 2009 by a group at the Queensland University of Tech-nology. The Random Indexing (RI) K-tree has been used with a repre-sentation that is based on the semantic markup available in the INEX2009 Wikipedia collection. The RI K-tree is a scalable approach to clus-tering large document collections. This approach has produced qualityclustering when evaluated using two different methodologies.

Key words: INEX, XML, Mining, Documents, Clustering, Structure,K-tree, Random Indexing, Random Projection

1 Introduction

The cluster hypothesis suggests that documents that cluster together tend tohave relevance to similar queries. The clustering task at INEX 2009 aims toevaluate the utility of clustering in collection selection. The goal of clusteringis to minimise the spread of relevant results of ad-hoc queries over a clusteringsolution. The purpose of clustering in this context is to determine the distributionof a collection over multiple machines. We have a dual optimisation problem -it is desirable to maximise the number of clusters while minimising the spreadof relevant results of ad-hoc queries over the clusters. Search efficiency can beincreased with the distribution of clusters (sub-collections) on more machines.However, since it is not possible to produce clusters that split the collectionto perfectly satisfy all conceivable ad-hoc queries, a good clustering solution isexpected to optimise the distribution such that for most ad-hoc queries most ofthe results can be found in a small set of clusters. The goal of collection selectionis then to rank the clusters (sub-collections) to identify the order in which theyshould be searched to satisfy any given query.

We have used K-tree [1, 2] to generate clustering solutions. The scalability ofK-tree in a document clustering setting has been discussed by De Vries and Geva[3, 4]. The original contribution to K-tree in INEX 2009 is the use of RandomIndexing (RI) to represent the documents. The K-tree algorithm has also beenmodified to work with the RI representation. RI facilitates an efficient and eco-nomical vector space representation. The RI K-tree provides a scalable approach

71

to clustering large collections at multiple granularities. The latest Wikipedia col-lection has included semantic markup that is based on the YAGO ontology. Thismarkup had been used in encoding the documents, and two simple approachesare described in Section 4.

This paper introduces and defines Random Indexing in Section 2 and ex-plains its use with K-tree in Section 3. The representation of semantic markupis discussed in Section 4. The combination of RI K-tree and representation ofsemantic markup introduced in earlier sections is applied to the INEX clusteringtask in Sections 5, 6 and 7. The paper ends with a conclusion in Section 8

2 Random Indexing

RI [5] is an efficient, scalable and incremental approach to the implementationof a word space model. Word space models use the distribution of terms indocuments to create high dimensional document vectors. The directions of thesedocument vectors represent various semantic meanings and contexts.

Latent Semantic Analysis (LSA) [6] is a popular word space model. LSAcreates context vectors from a document term occurrence matrix by perform-ing Singular Value Decomposition (SVD). Dimensionality reduction is achievedthrough projection of the document term occurrence vectors onto the subspacespanned by the vectors with the largest singular values in the decomposition.This projection is optimal in the sense that it minimises the sum of squares ofthe difference between the original matrix and the projected matrix components.In contrast, Random Indexing first creates random context vectors of lower di-mensionality and then combines them to create a term occurrence matrix in thedimensionally reduced space. Each term in the collection is assigned a randomvector and the document term occurrence vector is then a superposition of all theterm random vectors. There is no matrix decomposition and hence the processis efficient.

RI is also known as Random Projection and is explained by the Johnson andLinden-Strauss lemma [7]. It states that if points in a high dimensional spaceare projected into a lower dimensional, randomly selected subspace of sufficientdimensions they will approximately retain the same topology. Any n point set inEuclidean space can be embedded in O(log n/ε2) dimensions without distortingthe pair-wise distances between points by more than 1 ± ε, where 0 < ε < 1.Dasgupta and Gupta [8] have provided a proof for the Johnson and Linden-Strauss lemma, showing that the proposed bounds of the lemma hold.

The RI mapping is performed by producing r dimensional index vectors foreach term in a collection, where r is the desired dimensionality of the reducedspace. We have chosen these vectors to be sparse and ternary. Ternary indexvectors were introduced by Achlioptas [9] as being friendly for database environ-ments. Bingham and Mannila [10] have found that the sparsity of index vectorsdoes not effect the distortion of the embedding via experimental analysis. Sparseindex vectors reduce the time to complete RI as only 10 percent of the dimen-sions are non-zero. However, other choices exist for index vectors such as binary

72

spatter codes [11] which are randomly selected binary vectors and holographicreduced representations [12] that are dense randomly selected real valued vec-tors. When indexing the INEX 2009 collection, the index vectors are multipliedby the BM25 weight for each term in each document and added to the RI docu-ment vector. The document vector becomes a superposition of the index vectorsmultiplied by the term weights as determined by BM25.

RI can be viewed as a matrix multiplication of a document by term matrix Dand a random projection matrix I resulting in a reduced matrix R. Row vectorsof I contain index vectors of r dimensions for each term in D. Moreover, n is thenumber of documents, t is the number of terms and r is the dimensionality ofthe reduced spaced. R is the reduced matrix where each row vector represents adocument. Equation 1 defines RI as a matrix multiplication.

Dn×tIt×r = Rn×r (1)

Note that the RI document vectors themselves are not random. They arecomposed of a superposition of random term vectors and the superposition resultdepends on BM25 term weights and document content.

Another way to view RI is to interpret each index vector as a code. Thesecodes are nearly orthogonal to all other codes produced, resulting in minimalinterference between terms in the reduced vector space. Orthogonality can bemeasured by creating a pair-wise distance matrix between index vectors usingcosine similarity as a distance measure. If two vectors are orthogonal their cosinesimilarity will be zero. The closer the vectors are to orthogonal the closer theircosine similarity will be to zero. Therefore, it is expected that the pair-wisedistance matrix will contain values close to zero in every position except themain diagonal. Finding truly orthogonal codes is computationally expensive andtherefore avoided. Nearly orthogonal codes are found by drawing values in thevector from a normal distribution. Figure 1 shows the addition of index vectors(nearly orthogonal codes) to create a document representation.

Fig. 1. Random Indexing Example

73

3 Random Indexing K-tree

The K-tree is an online and dynamic clustering algorithm that scales well bymaking many local decisions resulting in a hierarchical tree structure. It is a hy-brid of the B+-tree and k-means algorithms where the B+-tree has been adaptedfor multi-dimensional data and the k-means algorithm is used to perform splitsin the tree. It is built in a bottom-up manner as data arrives. De Vries and Geva[2–4] discuss the algorithm and its application to document clustering, includingthe scalability of the algorithm. K-tree was compared to the popular CLUTOclustering toolkit and found to cluster significantly faster when a large numberof clusters are required [4]. The Random Indexing (RI) K-tree [13] combinesK-tree with RI to improve the quality of results and run-time performance.

The time complexity of K-tree depends on the length of the document vectors.K-tree insertion incurs two costs, finding the appropriate leaf node for insertionand k-means invocation during node splits. It is therefore desirable to operatewith a lower dimensional vector representation.

The combination of RI with K-tree is a good fit. Both algorithms operatein an on-line and incremental mode. This allows it to track the distribution ofdata as it arrives and changes over time. K-tree insertions and deletions allowflexibility when tracking data in volatile and changing collections. Furthermore,K-tree performs best with dense vectors, such as those produced by RI.

Given the scalable and dynamical properties of the RI K-tree algorithm wepropose it is a good solution for clustering large volatile document collections.The logarithmic lookup time of K-tree [13] to find the most similar cluster isalso of use in a functioning information retrieval system relying on collectionselection. This allows a query broker to direct queries to the most relevant sub-collection in sub-linear time with respect to the size of the collection.

4 Document Representation

Document structure has been represented by using a bag of words and a bagof tags representation derived from the semantic markup in the INEX 2009collection. Both are vector space representations.

The bag of words is made up of term frequencies contained within any entitytags in the collection. The term frequencies were weighted with BM25 [14] whereK1 = 2 and b = 0.75. We hypothesise that terms contained within entity tagsare more likely to indicate the topic of a document. Therefore, documents withthe same topic will fall closer together in the vector space representation. Thisindirectly exploits the XML structure.

The bag of tags representation is made up in an analogous manner of XMLentity tag frequencies. The tag frequencies were not weighted. Entity tags consistof concepts such as scientist, location and person. We conjecture that documentswith similar tags will belong to the same topic. Future work may compare tagfrequency based vectors to set based vectors. In set based vectors each tag wouldbe recorded as existing in a document or not. This way it can be determined if

74

the use of tag frequencies is worthwhile. If a power law distribution exists in tagfrequencies the Inverse Document Frequency heuristic may also prove useful asit did with link graphs [3]. The entity tags directly exploit structure by indexingit.

The bag of words and tags representations were combined. This is done byadding the two vector space representations together and then normalising theresulting vector to unit length. As both of the representations are based on RI,the codes between representations will be nearly orthogonal. However, a largernumber of dimensions may be required to accommodate the extra information.

5 Run-time Performance

The performance of RI K-tree has been measured when operating in main mem-ory. The concern is with the performance of the clustering algorithm. Efficiencywas not taken into account when indexing or loading the final representationinto memory.

All performance figures are for processing all 2,666,190 XML Wikipedia doc-uments. The RI operations took a total of 1860 seconds for the entity text repre-sentation. The randomly selected lower dimensional space had 1000 dimensions.The run time of the K-tree algorithm varies between 1200 and 1500 seconds de-pending on the tree order selected between 15 and 50. This includes the processof re-inserting all vectors to their nearest neighbour leaves upon completion ofthe tree building process. This produces clustering at many different granulari-ties at once. Table 1 lists the different sized clusters found by trees of order 20,40, 60, 80 and 100, where m is the tree order.

Level m = 20 m = 40 m = 60 m = 80 m = 100

1 12 3 8 3 922 111 89 356 129 20113 542 1260 5610 3090 531744 2161 12865 89612 677945 8529 1549346 372307 197299

Table 1. K-tree Clusters

6 Experimental Setup

The Random Indexing (RI) K-tree has been used to cluster all 2,666,190 XMLdocuments in the INEX 2009 Wikipedia collection. Bag of words and tags repre-sentations were used to create different clusters. Both representations were also

75

combined to create a third set of clusters. Clusters were created as close as pos-sible to the 100, 500, 1000, 5000 and 10000 clusters required for evaluation. TheRI K-tree produces clusters in an unsupervised manner where the exact numberof clusters can not be precisely controlled. It is determined by the tree orderand the randomised seeding process. The algorithm produces clusters of manysizes in a single pass. The desired clustering granularity is selected by choosinga particular level in the tree.

Random Indexing (RI) is an efficient dimensionality reduction technique thatprojects points in a high dimensional space onto a randomly selected lower di-mensional space. It is able to preserve the topology of the points. In the contextof document representation, topology preserving dimensionality reduction is pre-serving document meaning, or at least this is the conjecture which we test here.The RI projection produces dense document vectors that work well with theK-tree algorithm.

Cluster quality has been measured with two metrics this year. Purity is acommonly used metric and it is measured against an external ground truth. Inthe case of the INEX 2009 collection, the categories were created by YAWN.Purity is the fraction of documents with the majority category in a cluster.Micro purity is the average across all clusters in a solution where each cluster’scontribution is weighted by the fraction of documents it contains from the wholecollection. Thus, smaller clusters have less influence and larger clusters havemore influence on the average. Normalised Cumulative Cluster Gain (NCCG) isa new measure based on relevance judgments from search queries in the ad-hoctrack. The ad-hoc track at INEX provides most relevant documents for eachtopic based on manual human evaluations. Given the relevant results an oraclecluster ranking system can be built, where clusters are sorted in descendingorder by the number of relevant documents they contain. NCCG measures thespread of relevant documents over the clusters. A score of one is achieved if allrelevant documents appear in the first cluster and a score of zero is achieved ifrelevant documents are evenly spread across all clusters. NCCG rewards placingall relevant documents together. Therefore, it is testing the clustering hypothesisthat states that relevant documents for a query tend to cluster together.

7 Experimental Results

Table 2 lists micro purity and NCCG scores for all submissions that clusteredthe full INEX 2009 collection. The table is split into sections corresponding tothe required cluster sizes specified for the track. The RI K-tree, using the entitytext representation is clearly the best approach when it comes to finding highpurity clusters using an approach that can scale to the full collection at allcluster sizes. The NCCG metric for collection selection favours the combinationof entity text and tags over either representation. It changes the ordering ofresults when compared to the traditional ground truth based approach. TheC3m based approach produced higher quality clusters with respect to the NCCGmetric on two occasions at 100 and 10,000 clusters.

76

Guyon et. al. [15] argue that the context of clustering needs to be taken intoaccount during evaluation. The evaluation of this INEX task tests the clusteringhypothesis in the information retrieval specific. Clustering is intended to facili-tate document distribution and collection selection for ad-hoc retrieval, and it istested in that setting. This differs greatly from evaluation where authors assigncategories to documents and the categories are then used as the ground truth forthe evaluation of clustering. Guyon et. al. [15] argue, and we agree, that groundtruth based evaluations are unsound. This is particularly true when it comes toan information retrieval setting where the number of potential topics (clusters) isvirtually unconstrained. It is a virtually impossible task to compare alternativeclustering possibilities by inspecting large numbers of documents in clusters. Incontrast, the evaluation of topics represented as queries in an ad-hoc retrievalsystem achieves high levels of inter-judge agreement. These relevance judgmentshave been the backbone of ad-hoc information retrieval system evaluations formany years. They have also been exposed to criticism and review by many ofthe top researchers in the field. By exploiting this high quality, human gener-ated information, we can have great confidence that we are testing clustering inthe context of its use. The context is specifically clustering of documents in aninformation retrieval setting.

Guyon et. al. go as far to say “In our opinion, this approach [ground truthapproach] is dangerous. The underlying assumption is that points with the sameclass labels form clusters. This might be the case for some data sets but notfor others.”. If the ground truth reflected the application of clustering in aninformation retrieval context, then the scores would agree between purity andNCCG. However, they do not. Therefore, we argue that the NCCG scores basedon ad-hoc queries are more meaningful in an information retrieval setting.

Relevance of documents to queries can also be derived from click-throughdata in an operational search engine. This provides a potential mountain ofrelevance judgments.

8 Conclusion

In conclusion the RI K-tree provided a scalable approach to clustering at mul-tiple granularities in a single pass with quality comparable to other approaches.The hypothesis that combining entity text and tag based representations willimprove quality held true for the new ad-hoc based evaluation. Furthermore,the evaluation provided insights into why it is important to take context of useinto account when evaluating clustering.

77

Method Clusters Micro Purity NCCG

RI K-tree Text 88 0.1744 0.7859RI K-tree Tags 99 0.1427 0.7851RI K-tree Text and Tags 105 0.1450 0.8003C3m Content Only (bildb) 101 0.1566 0.8205

RI K-tree Text 420 0.1918 0.6770RI K-tree Tags 477 0.1526 0.7546RI K-tree Text and Tags 509 0.1668 0.7330

RI K-tree Text 1009 0.2140 0.6450RI K-tree Tags 1026 0.1699 0.7021RI K-tree Text and Tags 963 0.1690 0.7092C3m Content Only (bildb) 1001 0.1617 0.6614

RI K-tree Text 2450 0.2136 0.6100RI K-tree Tags 2407 0.1769 0.6348RI K-tree Text and Tags 2536 0.1928 0.6575BM25 BicMsGrowingKMeans (mark) 2263 0.1698 0.6349

RI K-tree Text 4914 0.2384 0.5581RI K-tree Tags 4993 0.2020 0.5729RI K-tree Text and Tags 4978 0.2038 0.6003

RI K-tree Text 9725 0.2719 0.4736RI K-tree Tags 10453 0.2321 0.5274RI K-tree Text and Tags 9896 0.2509 0.5492C3m Content Only (bildb) 10001 0.1942 0.6035BM25 BicMsGrowingKMeans (mark) 12636 0.2416 0.5885

Table 2. Clusters

78

References

1. : K-tree project page, http://ktree.sourceforge.net. (2009)2. Geva, S.: K-tree: a height balanced tree structured vector quantizer. Proceedings

of the 2000 IEEE Signal Processing Society Workshop Neural Networks for SignalProcessing X, 2000. 1 (2000) 271–280 vol.1

3. De Vries, C., Geva, S.: Document clustering with k-tree. Advances in FocusedRetrieval: 7th International Workshop of the Initiative for the Evaluation of XMLRetrieval, INEX 2008, Dagstuhl Castle, Germany, December 15-18, 2008. Revisedand Selected Papers (2009) 420–431

4. De Vries, C., Geva, S.: K-tree: large scale document clustering. In: SIGIR ’09:Proceedings of the 32nd international ACM SIGIR conference on Research anddevelopment in information retrieval, New York, NY, USA, ACM (2009) 718–719

5. Sahlgren, M.: An introduction to random indexing. In: Methods and Applicationsof Semantic Indexing Workshop at the 7th International Conference on Terminol-ogy and Knowledge Engineering, TKE 2005. (2005)

6. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing bylatent semantic analysis. Journal of the American Society for Information Science41(6) (1990) 391–407

7. Johnson, W., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbertspace. Contemporary mathematics 26(189-206) (1984) 1–1

8. Dasgupta, S., Gupta, A.: An elementary proof of the Johnson-Lindenstrausslemma. Random Structures & Algorithms 22(1) (2002) 60–65

9. Achlioptas, D.: Database-friendly random projections: Johnson-Lindenstrauss withbinary coins. Journal of Computer and System Sciences 66(4) (2003) 671–687

10. Bingham, E., Mannila, H.: Random projection in dimensionality reduction: appli-cations to image and text data. In: KDD ’01: Proceedings of the seventh ACMSIGKDD international conference on Knowledge discovery and data mining, NewYork, NY, USA, ACM (2001) 245–250

11. Kanerva, P.: The spatter code for encoding concepts at many levels. In: ICANN94,Proceedings of the International Conference on Artificial Neural Networks. (1994)

12. Plate, T.: Distributed representations and nested compositional structure. PhDthesis (1994)

13. De Vries, C., De Vine, L., Geva, S.: Random indexing k-tree. In: ADCS09: Aus-tralian Document Computing Symposium 2009, Sydney, Australia. (2009)

14. Robertson, S., Jones, K.: Simple, proven approaches to text retrieval. Update(1997)

15. Guyon, I., von Luxburg, U., Tubingen, G., Williamson, R., Canberra, A.: Cluster-ing: Science or Art?

79

Chapter 11Overview of the INEX 2010 XML

Mining track: Clustering and

Classification of XML Documents

For this paper I completed almost all tasks related to the contributions. Unlike

the previous INEX overview paper, I was involved in organizing both the clus-

tering and the classification tasks. I wrote the manuscript, created the plots,

performed data preparation and analysis, aided in organization of XML Mining

track, wrote the evaluation software 1, proposed and implemented the technique

to extract categories from the Wikipedia graph, and, proposed and implemented

the divergence from a random baseline approach to identify pathological clus-

terings.

11.1 Evaluation

This paper continues the evaluation of document clustering at INEX. It also

introduces a new method for extracting categories from the Wikipedia category

graph.


80


Overview of the INEX 2010 XML Mining Track:

Clustering and Classification of XML Documents

Christopher M. De Vries1 Richi Nayak1 Sangeetha Kutty1 Shlomo Geva1

Andrea Tagarelli2

Faculty of Science and Technology,Queensland University of Technology, Brisbane, Australia1

University of Calabria, Italy2

[email protected] {r.nayak, s.kutty, s.geva}@[email protected]

Abstract. This report explains the objectives, datasets and evaluationcriteria of both the clustering and classification tasks set in the INEX2010 XML Mining track. The report also describes the approaches andresults obtained by participants.

Key words: XML document mining, INEX,Wikipedia, Structure, Con-tent, Clustering, Classification

1 Introduction

The XML Document Mining track was launched for exploring two main ideas: (1)identifying key problems and new challenges of the emerging field of mining semi-structured documents, and (2) studying and assessing the potential of MachineLearning (ML) techniques for dealing with generic ML tasks in the structureddomain, i.e., classification and clustering of semi-structured documents. Thistrack has run for six editions during INEX 2005, 2006, 2007, 2008, 2009 and2010. The first five editions have been summarized in [1,2,3,4] and we focus hereon the 2010 edition.

INEX 2010 included two tasks in the XML Mining track: (1) unsupervisedclustering task and (2) semi-supervised classification task where documents areorganized in a graph. The clustering task requires the participants to group thedocuments into clusters without any knowledge of category labels using an unsu-pervised learning algorithm. On the other hand, the classification task requiresthe participants to label the documents in the dataset into known categoriesusing a supervised learning algorithm and a training set. This report gives thedetails of clustering and classification tasks.

2 Corpus

Working with XML documents is a particularly challenging task for ML andIR. XML documents are defined by their logical structure and content. The

81

current Wikipedia collection contains structure as (1) document structure suchas sections, titles and tables, (2) semantic structure as entities mined by YAWN,and (3) navigation structure as document to document links. In 2008 and 2009the classification task focused on exploiting the link structure of the Wikipediaand continues to do so this year. The clustering task has continued in the samemanner as previous years and uses any available content or structure.

A 146,225 document subset of the INEX XML Wikipedia collection was usedas a data set for the clustering and classification tasks. The subset is determinedby the reference run used for the ad hoc track. The reference run contains the1500 highest ranked documents for each of the queries in the ad hoc track.The queries were searched using an implementation of Okapi BM25 in the ANTsearch engine. Using the reference run reduced the collection from 2,666,190 to146,225 documents. This is a new approach for selecting the XML Mining subset.In previous years it was selected by choosing documents from Wikipedia portals.

The clustering evaluation uses ad hoc relevance judgements for evaluationand most of the relevant documents are contained in the subset. Table 1 containsdetails of documents relevant to queries missing from the subset. The referencerun contains approximately 90 percent of the relevant documents.

Topic Relevant Missing Topic Relevant Missing

2010003 231 24 (10.39%) 2010035 16 3 (18.75%)2010004 124 29 (23.39%) 2010036 94 0 (0.00%)2010006 151 20 (13.26%) 2010037 11 0 (0.00%)2010007 49 6 (12.24%) 2010038 433 8 (1.85%)2010010 251 6 (2.39%) 2010039 138 0 (0.00%)2010014 64 7 (10.94%) 2010040 60 3 (5.00%)2010016 506 72 (14.23%) 2010041 35 0 (0.00%)2010017 5 0 (0.00%) 2010043 130 11 (8.46%)2010018 34 0 (0.00%) 2010045 159 60 (37.74%)2010019 6 0 (0.00%) 2010046 53 0 (0.00%)2010020 34 0 (0.00%) 2010047 18 0 (0.00%)2010021 203 28 (13.79%) 2010048 72 11 (15.28%)2010023 115 31 (26.96%) 2010049 42 6 (14.29%)2010025 19 0 (0.00%) 2010050 147 5 (3.40%)2010026 54 5 (9.26%) 2010054 292 42 (14.38%)2010027 77 4 of (5.19%) 2010056 269 37 (13.75%)2010030 80 32 (40.00%) 2010057 74 0 (0.00%)2010031 18 1 (5.56%) 2010061 13 0 (0.00%)2010032 23 2 (8.70%) 2010068 222 2 (0.90%)2010033 134 16 (11.94%) 2010069 358 82 (22.91%)2010034 115 9 (7.83%)

Total 5451 587 (10.77%)

Table 1. Relevant Documents Missing from the XML Mining Subset

82

3 Categories

In previous years, document categories have been selected using Wikipedia por-tals where each portal becomes a category. The drawback of this approach is thatit only finds categories for documents related to portals. Last year the categoriesused for clustering evaluation were produced by YAWN that creates categoriesbased on entities found from the YAGO ontology. These categories are very finegrained and narrow and were found not to be particularly useful.

A new approach for extracting categories was taken this year. The Wikipediacategories listed for each document are very similar to the YAGO categories asYAGO contains entities based on Wikipedia information. Both the Wikipediaand YAGO categories are noisy and very fine grained. However, the Wikipediacategories exist in a category graph where there are 24 high level topical cate-gories called the “main topic classifications” 1. Unfortunately, the category graphis not a hierarchy and contains cycles. Many of the paths from a document to themain topic classifications do not make sense. Additionally, users who add cat-egories to Wikipedia pages often attach them to fine grained categories in thegraph. They may not realize what links the internal structure of the graph con-tains when choosing particular categories. The category graph can be changedover time also changing the original intent of the author. Therefore, categorieswere extracted by finding the shortest paths through the graph between a doc-ument and any of the main topic classifications. This is motivated by Occam’sRazor where the simplest explanation is often the correct one. Figure 1 illustratesthe Wikipedia category graphs and highlights a hypothetical shortest categorypath for the document Hydrogen.

For INEX 2010 the category graph from the 22nd of June 2010 Wikipediadump was used. The graph consists of Wikipedia pages with the “Category:”prefix such as “Category:Science”. The graph is extracted by finding links be-tween category pages. Generally speaking, a category page links to another cat-egory page that is broader in scope. Wikipedia pages indicate their categoriesby linking to a category page.

Figure 2 lists the algorithm used to extract the categories. The INEX 2010categories were extracted where only the 2 broadest levels of categories wereextracted (t = 2). Only categories containing more than 3000 documents wereused. This approach extracts multiple categories for a document resulting in amulti-label set of documents for INEX 2010. Note that paths that contain the“Category:Hidden” category were not used. Table 2 lists the categories that wereextracted.

In Figure 2, P is the set of Wikipedia pages (articles) to find categories for. Cis the set of Categories in the Wikipedia.M the set of categories in the main topicclassifications. G = (V,E) is the Wikipedia category graph consisting of a set ofvertices V and edges E where the vertices consist of pages P and categories C.Where P ⊂ V , C ⊂ V ,M ⊂ V andM ⊂ C. Moreover, t is a parameter indicatingthe broadest t levels to consider as categories; if t is 1 then only the main topic

1 http://en.wikipedia.org/wiki/Category:Main_topic_classifications

83

Fig. 1. Complicated and Noisy Wikipedia Category Graph

classifications are considered; if t is 2 then the main topic classifications and anycategories 1 edge away in the graph are considered and so on.

Note that a path is a sequence of graph vertices visited from page p ∈ Pto main topic m ∈ M . For example, Hydrogen → Category:Elements → Cate-gory:Chemistry → Category:Science, is the hypothetical path for the Wikipediadocument Hydrogen.

ExtractCategories(G,M,P, t)

1 E = a map from page p ∈ P to a list of categories for p2 for p ∈ P3 S = the set of shortest paths between p and any category in M4 for s ∈ S5 if path s does not contain Category:Hidden6 B = the set of last t vertices in path s7 for b ∈ B8 append b to list E[p]9 return E

Fig. 2. Algorithm to Extract Categories from the Wikipedia

The category extraction process could be enhanced in the future using fre-quent pattern mining to find interesting repeated sequences in the shortest paths.Other graph algorithms such as the Minimum Spanning Tree algorithm could be

84

Category Documents Category Documents

People 48186 Agriculture 5975Society 34912 Education 4367Culture 27986 Companies 4314Geography 22747 Biology 4309Politics 18519 Recreation 4276Humanities 14738 Environment 4216Countries 13966 Musical culture 4195Arts 11979 Geography stubs 4052History 10821 Information 3919Business 10249 American musicians 3845Applied sciences 9278 Language 3764Life 9018 Literature 3660Technology 8920 Belief 3412Entertainment 8887 Creative works 3395Nature 7400 Human geography 3370Science 7311 Places 3202Computing 6835 Law 3156Health 6329 Cultural history 3117

Table 2. XML Mining Categories

used to simplify the graph. The browsable category tree starting at the “maintopic classifications” appears to have processed the category graph as well. Usingthis post-processed graph could also improve the categories.

4 Clustering Task

The task was to utilize unsupervised machine learning techniques to group thedocuments into clusters. Participants were asked to submit multiple clusteringsolutions containing 50, 100, 200, 500 and 1000 clusters. The categories extractedcontained 36 categories due to only using categories with greater than 3000documents. This choice was arbitrary and the decision for cluster sizes was madebased on the number of documents in the collection before the categories wereextracted. As there are not really 36 “true” categories, a direct comparison of36 clusters with 36 categories is not necessary. The number of categories ina document collection is extremely subjective. Measuring how the categoriesbehave over multiple cluster sizes indicates the quality of clusters and the trendcan be visualized.

4.1 Clustering Evaluation Measures

The clustering solutions are evaluated by two means. Firstly, we utilize thecategories-to-clusters evaluation which assumes that the categorization of the

85

documents in a sample is known (i.e., each document has known category la-bels). Any clustering of these documents can be evaluated with respect to thispredefined categorization. It is important to note that the category labels arenot used in the process of clustering, but only for the purpose of evaluation ofthe clustering results.

The standard measures of Purity, Entropy, NMI and F1 are used to determinethe quality of clusters with regard to the categories. Negentropy [5] is also used. Itmeasures the same system property as Entropy but it is normalized and invertedso scores fall between 0 and 1 where 0 is the worst and 1 is the best. Theevaluation measures the mapping of categories-to-clusters where the categoriesare multi-label but the clusters are not. A document can have multiple categoriesbut documents can only belong to one cluster. Each measure is defined to dealwith a multi-label ground truth.Purity. The standard criterion of purity is used to determine the quality of clus-ters by measuring the extent to which each cluster contains documents primarilyfrom one category. The simplicity and the popularity of this measure means thatit has been used as the only evaluation measure for the clustering task in theINEX 2006 and INEX 2009. In general, the larger the value of purity, the betterthe clustering solution.

Let ω = {w1, w2, . . . , wK}, denote the set of clusters for the dataset D andξ = {c1, c2, . . . , cJ} represent the set of categories. The purity of a cluster wk isdefined as:

P (wk) =maxj|wk ∩ cj |

|wk|(1)

where wk is the set of documents in cluster wk and cj is the set of documentsthat occurs in category cj . The numerator indicates the number of documentsin cluster k that occurs most in category j and the denominator is the numberof documents in the cluster wk.

The purity of the clustering solution ω can be calculated based on micro-purity and macro-purity. Micro-purity of the clustering solution ω is obtainedas a weighted sum of individual cluster purity. Macro-purity is the unweightedarithmetic mean based on the total number of categories [5].

Micro-Purity(ω, ξ) =

∑Kk=0 P (wk) ∗ |wk|∑K

k=0 |wk ∩ cj |(2)

Macro-Purity(ω, ξ) =

∑Kk=0 P (wk)

J(3)

Entropy. It is used to measure the distribution of the documents on variouscategories. Given a particular cluster ωk of size nk, the entropy of this cluster isdefined to be:

E(ωk) = − 1

log J

J∑

i=1

njk

nklog

njk

nk(4)

86

where J is the number of categories in the dataset, and njk is the number of

documents of the jth category that were assigned to the kth cluster [6]. Theclustering solution can then be measured by the sum of the individual clusterentropies weighted according to the clustering size as defined below:

Entropy =

K∑

k=1

nk

KE(ωk) (5)

It is scaled from 0 to 1. A perfect clustering solution will have an entropy valueof 0.F1-measure Another standard measure that is used to evaluate the clusteringsolution is the F1-measure. It helps to calculate not only the number of docu-ments that are correctly classified together in a cluster but also the number ofdocuments that are misclassified from the cluster.

In order to calculate the F1-measure, three types of decisions are used. Amongthem there are two types of correct decisions: True Positives (TP) and True Neg-atives (TN). A TP decision assigns two similar documents to the same cluster; aTN decision assigns two dissimilar documents to different clusters. On the otherhand, a False Positive (FP) is an error decision that assigns two dissimilar doc-uments to the same cluster [7]. Though there is another error decision, FN, thatassigns two similar documents to different clusters, it is not used in calculatingF1-measure.

Using the TP, TN and FP decisions, the precision and the recall for themicro-F1 are defined as:

precisionmicro-F1 =

∑Jj=1 TPj

∑Jj=1 TPj + FPj

(6)

recallmicro-F1 =

∑Jj=1 TPj

∑Jj=1 TPj + TNj

(7)

The precision and the recall for the macro-F1 are defined as

precisionmacro-F1 =

∑Jj=1

TPj

TPj+FPj

J(8)

recallmacro-F1 =

∑Jj=1

TPj

TPj+TNj

J(9)

where TPj is the number of documents in category cj that exists in cluster wk,TPj is the number of documents that is not in category cj but that exists incluster wk and TNj is the number of documents that is in category cj but doesnot exist in cluster wk.

F1 can now be defined as:

F1 =2× precision× recall

precision+ recall(10)

87

Micro-F1 =2× precisionmicro-F1 × recallmicro-F1

precisionmicro-F1 + recallmicro-F1(11)

Macro-F1 =2× precisionmacro-F1 × recallmacro-F1

precisionmacro-F1 + recallmacro-F1(12)

Normalized Mutual Information (NMI). Another evaluation measure isthe Normalized Mutual Information (NMI) which helps to identify the trade-offbetween the quality of the clusters against the number of clusters [7].

NMI [7] is defined as,

NMI(ω, ξ) =I(ω; ξ)

[H(ω) +H(C)]/2(13)

I(ω; ξ) =

∑k

∑j P (wk ∩ cj)log

P (wk∩cj)P (wk)P (cj)∑

k

∑j

|wk∩cj |N log

N |wk∩cj ||wk||cj|

(14)

where P (wk), P (cj) and P (wk ∩ cj) indicate the probabilities of a document incluster wk, category cj and in both wk and cj .

H(ω) is the measure of uncertainty given by,

H(ω) =−∑

k(P (wk)logP (wk))

−∑k

|wk|N log |wk|

N

(15)

Collection selection evaluation using NCCG measureThis evaluation measure was used in evaluating the INEX 2009 dataset [4]

and is based on Van Rijsbergen’s clustering hypothesis. Van Rijsbergen and hisco-workers [8] conducted intensive study on the use of the clustering hypothesistest on information retrieval, which states that documents which are similarto each other may be expected to be relevant to the same requests; dissimilardocuments, conversely, are unlikely to be relevant to the same requests. If thehypothesis holds true, then relevant documents will appear in a small numberof clusters and the document clustering solution can be evaluated by measuringthe spread of relevant documents for the given set of queries.

To test this hypothesis on a real-life dataset, the INEX 2009 dataset, theclustering task was evaluated by determining the quality of clusters relativeto the optimal collection selection [4]. Collection selection involves splitting acollection into subsets and recommending which subset needs to be searched fora given query. This allows a search engine to search fewer documents, resultingin improved runtime performance over searching the entire collection.

The evaluation of collection selection was conducted using the manual queryassessments for a given set of queries from the INEX 2009 Ad Hoc track [4].The manual query assessment is called the relevance judgment in InformationRetrieval (IR) and has been used to evaluate ad hoc retrieval of documents.It involves defining a query based on the information need, a search enginereturning results for the query and humans judging whether the results returnedby the search engine are relevant to the information need.

88

Better clustering solutions in this context will tend to (on average) grouptogether relevant results for (previously unseen) ad hoc queries. Real ad hocretrieval queries and their manual assessment results are utilised in this evalua-tion. This approach evaluates the clustering solutions relative to a very specificobjective – clustering a large document collection in an optimal manner in orderto satisfy queries while minimising the search space. The metric used for evalu-ating the collection selection is called the Normalized Cumulative Cluster gain(NCCG) [4].

The NCCG is used to calculate the score of the best possible collection se-lection according to a given clustering solution of n number of clusters. Thescore is better when the query result set contains more cohesive clusters. TheCumulative Gain of a Cluster (CCG) is calculated by counting the number ofdocuments of the cluster that appear in the relevant set returned for a topic bymanual assessors.

CCG(c, t) =n∑

i=1

(Reli) (16)

For a clustering solution for a given topic, a (sorted) vector CG is createdrepresenting each cluster by its CCG value. Clusters containing no relevant doc-uments are represented by a value of zero. The cumulated gain for the vector CGis calculated, which is then normalized on the ideal gain vector. Each clusteringsolution c is scored for how well it has split the relevant set into clusters usingCCG for the topic t.

SplitScore(t, c) =

|CG|∑ cumsum(CG)

nr2(17)

where nr = Number of relevant documents in the returned result set for thetopic t.

A scenario with worst possible split is assumed to place each relevant doc-ument in a distinct cluster. Let CG1 be a vector that contains the cumulativegain of every cluster with each document.

MinSplitScore(t, c) =

|CG1|∑ cumsum(CG1)

nr2(18)

The normalized cluster cumulative gain (nCCG) for a given topic t and aclustering solution c is given by,

nCCG(t, c) =SplitScore(t, c) −MinSplitScore(t, c)

1−MinSplitScore(t, c)(19)

The mean and the standard deviation of the nCCG score over all the topicsfor a clustering solution are then calculated.

Mean(nCCG(c)) =

∑nt=0 nCCG(t, c)

Total Number of topics(20)

89

Std Dev (nCCG(c)) =

∑nt=0 [nCCG(t, c)−Mean(nCCG(c))]2

Total Number of topics(21)

The NCCG value varies from 0 to 1. A larger value of NCCG for a givenclustering solution is better, since it represents the fact that an increased numberof relevant documents are clustered together in comparison to a smaller numberof relevant documents. Further details of this metric can be found in [4]

Divergence from Random Most measures of cluster quality can be tricked bychanging the number of clusters or documents in the submission. The Purity andEntropy measures can be fooled if each document is placed in its own cluster.Every cluster becomes pure because it only contains one document. The NCCGmeasure can be fooled by creating one cluster with all the documents except forevery other cluster containing one document. The NCCG measure orders clustersby the number of relevant documents they contain. A large cluster containingmost documents will almost always be ranked first. Therefore, almost all relevantdocuments will exist in one cluster, achieving the highest score possible.

Any measure that can be tricked by creating a pathological clustering solutioncan be adjusted for by subtracting a cluster solution from a uniform randomlygenerated solution with the same number of clusters with the same number ofdocuments in each cluster. Apart from how documents are assigned to clusters,the random baseline appears the same as the real solution. Therefore, each solu-tion needs a uniform random baseline to be generated. This is done by shufflingthe document IDs uniformly randomly and splitting them into clusters the samesize as the solution being measured. The score for the uniform random solutionis subtracted from the matching solution being measured. The graphs and tablesin the following section contain the results for all metrics where this approachwas taken.

The submissions this year from BUAP contained several large clusters andmany other small clusters. This tricked the NCCG metric into giving arbitrarilyhigh scores. When the scores are subtracted from a uniform random baselinewith the same properties they performed no better than a randomly generatedsolution. This can be seen in Figure 7.

4.2 Clustering Participants, Submissions and Evaluations

The clustering tasks had submissions from three participants from Peking Uni-versity, BUAP and Queensland University of Technology. The submissions la-beled Random are a random solution that does not use any information aboutthe documents. A cluster for each document is chosen uniformly at randomfrom one of the k clusters required. Figures 4 to 7 graph the best performingsubmissions from each participant for Purity, Negentropy, NMI and NCCG. Thedivergence from random for each metric is also graphed. Figure 3 contains thelegend for all these graphs.

90

The full details of the results are listed in tables in a separate documentavailable from http://de-vries.id.au/inex10/full_results.pdf. The ta-bles have been broken into sections matching the required numbers of clusters 50,100, 200, 500 and 1000. Some submissions were outside the 5 percent toleranceof number of clusters. These form separate groups in the tables.

The group from Peking University [9] made a submission based on the struc-tured link vector model (SLVM). It incorporated document structure, links andcontent. This year they focused on the preprocessing step for document struc-ture and links. They modified two popular clustering algorithms, AHC clusteringalgorithm and K-means algorithm, to work with this model.

The group from BUAP [10] proposed an iterative clustering method forgrouping the Wikipedia documents. The recursive clustering process iterativelybrings together subsets of the complete collection by using two different cluster-ing methods: k-star and k-means. In each iteration, they select representativeitems for each group which are then used for the next stage of clustering.

The group from the Queensland University of Technology used a 1024 bitdocument signature representation generated by quantizing random indexing orrandom projections of TF-IDF vectors. The k-means algorithm was modified tocluster binary strings of data using the hamming distance, including a differentapproach to calculating means of binary vectors.

Fig. 3. Legend

5 Classification Task

The goal of the classification task was to utilize supervised or semi-supervisedmachine learning techniques to predict categories of documents from a set ofknown categories described in Section 3. The training set of documents contained17 percent of the collection where each category had at least 20 percent of thecategory labels available.

5.1 Classification Evaluation Measures

Classification is evaluated using Type I and II errors made by classifiers. Eachcategory is transformed into a binary classification problem. One category isevaluated at a time using all documents. The scores are calculated based on

91

Clusters

Mic

ro P

urity

0.35

0.40

0.45

0.50

0.55

0.60

200 400 600 800 1000Clusters

Mic

ro P

urity

Div

erge

nce

0.00

0.05

0.10

0.15

0.20

0.25

200 400 600 800 1000

Fig. 4. Micro Purity

Clusters

Mic

ro N

egen

trop

y

0.10

0.15

0.20

0.25

200 400 600 800 1000Clusters

Mic

ro N

egen

trop

y D

iver

genc

e

0.00

0.05

0.10

0.15

200 400 600 800 1000

Fig. 5. Micro Negentropy

Clusters

NM

I

0.05

0.10

0.15

200 400 600 800 1000Clusters

NM

I Div

erge

nce

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

200 400 600 800 1000

Fig. 6. NMI

92

Clusters

NC

CG

0.2

0.4

0.6

0.8

200 400 600 800 1000Clusters

NC

CG

Div

erge

nce

0.0

0.1

0.2

0.3

0.4

0.5

0.6

200 400 600 800 1000

Fig. 7. NCCG

the Type I and II errors and then micro and macro averaged. Micro averagingweights the average by the category size and macro averaging does not. Table 3defines the Type I and II errors for a category.

In Category Not in Category

Predicted in Category True Positive (tp) False Positive (fp)

Predicted not in Category False Negative (fn) True Negative (tn)

Table 3. Type I and II Classification Errors

The F1, Precision and Recall scores are calculated as described in Equations22 to 24. F1 is the harmonic mean of precision and recall.

F1 =2× tp

2× tp+ fn+ fp(22)

Precision =tp

tp+ fp(23)

Recall =tp

tp+ fn(24)

5.2 Classification Participants, Submissions and Evaluations

Two groups from Peking University and the Queensland University of Technol-ogy (QUT) made submissions for the classification task. The results are listedin Table 4.

The group from Peking University [9] made a submission based on the thestructured link vector model (SLVM). It incorporated document structure, links

93

and content. This year they focused on the preprocessing step for documentstructure and links.

The group from QUT made a submission using content only to provide abaseline approach. Documents were represented in the bag of words vector spacemodel using the BM25 weighting for each term where the tuning parametersK1 = 2 and b = 0.75. A Support Vector Machine (SVM) was used to classifyeach document by treating each category as a binary classification problem.

F1 Precision RecallSubmission Micro Macro Micro Macro Micro Macro

QUT BM25 SVM 0.536 0.473 0.562 0.527 0.523 0.440Peking tree1 sim3 linkTxt 0 0.460 0.380 0.553 0.525 0.436 0.334Peking tree2 sim3 linkTxt N 0.518 0.446 0.436 0.359 0.652 0.614Peking tree2 sim2 linkTxt 0 0.452 0.371 0.562 0.536 0.423 0.321Peking tree1 sim1 linkTxt 0 0.399 0.314 0.582 0.570 0.363 0.252Peking tree1 sim3 linkTxt 67 0.508 0.435 0.422 0.345 0.653 0.612Peking tree2 sim2 0.521 0.452 0.480 0.414 0.574 0.510Peking tree1 sim3 0.518 0.444 0.443 0.368 0.635 0.582Peking tree1 sim3 linkTxt N 0.517 0.444 0.432 0.356 0.656 0.613Peking tree1 sim2 linkTxt N 0.521 0.454 0.456 0.389 0.615 0.559

Table 4. Classification Results

6 Conclusion

The XML Mining track in INEX 2010 brought together researchers from Infor-mation Retrieval, Data Mining, Machine Learning and XML fields. The clus-tering task allowed participants to evaluate clustering methods against a realuse case and with significant volumes of data. The task was designed to facili-tate participation with minimal effort by providing not only raw data, but alsopre-processed data which can be easily used by existing clustering software. Theclassification task allowed participants to explore algorithmic, theoretical andpractical issues regarding the classification of interdependent XML documents.

94

References

1. Denoyer, L., Gallinari, P., Vercoustre, A.: Report on the XML mining track atINEX 2005 and INEX 2006. Comparative Evaluation of XML Information Re-trieval Systems (2007) 432–443

2. Denoyer, L., Gallinari, P.: Report on the XML mining track at INEX 2007 catego-rization and clustering of XML documents. In: ACM SIGIR Forum. Volume 42.,ACM (2008) 22–28

3. Denoyer, L., Gallinari, P.: Overview of the INEX 2008 XML mining track. Ad-vances in Focused Retrieval (2009) 401–411

4. Nayak, R., De Vries, C., Kutty, S., Geva, S., Denoyer, L., Gallinari, P.: Overviewof the INEX 2009 XML mining track: Clustering and classification of XML docu-ments. Focused Retrieval and Evaluation (2010) 366–378

5. De Vries, C., Geva, S.: Document Clustering with K-tree. Advances in FocusedRetrieval (2009) 420–431

6. Zhao, Y., Karypis, G.: Criterion functions for document clustering: experimentsand analysis. Technical Report Technical Report 01-40, Department of ComputerScience (November 2001)

7. Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval.Cambridge University Press, NY, USA (2008)

8. Rijsbergen, C.J.v.: Information Retrieval. Butterworths (1979)9. Wang, S. amd Liang, F., Yang, J.: PKU at INEX 2010 XML mining track. INEX

(2010)10. Tovar, M., Cruz, A., Vazquez, B., Pinto, D., Vilarino, D.: An iterative clustering

method for the XML-mining task of the INEX 2010. INEX (2010)

95

Chapter 12TopSig: topology preserving

document signatures

In this paper I contributed all sections related to document clustering with Top-

Sig document signatures. I proposed and implemented the variant of k-means

that performs optimization directly with the binary signatures produce by Top-

Sig. Outside of the sections related to document clustering, I also contributed

to the writing of the manuscript, experimental design, execution of experiments,

and data analysis.


This paper introduces the TopSig document signature representation. TopSig

produces binary strings, bit vectors or document signatures by applying ran-

dom indexing or random projections and numeric quantization to document

vectors. This creates a dense compressed representation of documents. The

compressed binary representation of TopSig is one of the factors that allows

the improvement of computational efficiency of clustering approaches outlined

in this thesis. The binary representation allows efficient processing on 64-bit

processors by processing 64 dimensions of each bit vector in a single CPU in-

struction. The compressed nature of the representation allows for less data to be

processed than when using other vector representations for document clustering.

This also improves the performance and scalability of clustering approaches as

less data has to be processed.

96

12.2 Algorithms

This paper introduces a modified version of the k-means algorithm that works

directly with the binary string representation of TopSig. All cluster centers

and documents are binary strings. This allows the Hamming distance measure

to be used for comparison which can be computed 64 dimensions at a time

using 64-bit processors. This allows a one to two order of magnitude increase

in computational efficiency when compared to the sparse vector approaches

used in the popular clustering toolkit, CLUTO. When using 4096 bit document

signatures, the quality of the clustering is not statistically significantly different

to the sparse vector approach used in CLUTO.

97

TopSig: Topology Preserving Document Signatures

Shlomo GevaFaculty of Science and Technology


[email protected]

Christopher M. De VriesFaculty of Science and Technology


[email protected]

ABSTRACTPerformance comparisons between File Signatures and In-verted Files for text retrieval have previously shown severalsignificant shortcomings of file signatures relative to invertedfiles. The inverted file approach underpins most state-of-the-art search engine algorithms, such as Language and Proba-bilistic models. It has been widely accepted that traditionalfile signatures are inferior alternatives to inverted files. Thispaper describes TopSig, a new approach to the constructionof file signatures. Many advances in semantic hashing anddimensionality reduction have been made in recent times,but these were not so far linked to general purpose, sig-nature file based, search engines. This paper introduces adifferent signature file approach that builds upon and ex-tends these recent advances. We are able to demonstratesignificant improvements in the performance of signature filebased indexing and retrieval, performance that is compara-ble to that of state of the art inverted file based systems,including Language models and BM25. These findings sug-gest that file signatures offer a viable alternative to invertedfiles in suitable settings and from the theoretical perspectiveit positions the file signatures model in the class of VectorSpace retrieval models.

Categories and Subject DescriptorsH.3.3 [Information Storage and Retrieval]: InformationSearch and Retrieval—Retrieval Models, Relevance Feedback,Search Process, Clustering

General TermsAlgorithms, Experimentation, Performance, Theory

KeywordsSignature Files, Random Indexing, Topology, Quantisation,Vector Space IR, Search Engines, Document Clustering, Doc-ument Signatures

1. INTRODUCTIONDocument signatures have been largely absent from main-

stream IR publications about general-purpose search enginesand ranking models for several years. The decline in the at-tention paid to this approach, which had received a lot ofattention earlier, started with the publication of the paper“Inverted Files Versus Signature Files for Text Indexing” byZobel et al [25]. This paper offers an extensive comparison

between Signature Files and Inverted Files for text indexing.The authors have systematically and comprehensively eval-uated Signature files and Inverted File approaches. Havingexamined several general approaches they concluded that in-verted files are distinctly superior to signature files. Signa-ture files are found, in their studies, to be slower, to offer lessfunctionality, and to require larger indexes. They concludethat the Bit Sliced signature files under-perform on almostall counts and offer very little if any advantages over invertedfiles. Further discussion of signature files is offered in [21],and a similar picture emerges there too. It is clear from theexperimental evidence that Bit Sliced signatures are not ableto compete with state of the art inverted file approaches interms of retrieval performance. Furthermore, the presumedadvantages of efficient bit-wise processing and the potentialfor index compression are not generally achievable in prac-tice. Signature files are found to be larger than inverted fileindexes. This is perhaps surprising because Signature fileswere largely motivated by the desire to represent entire doc-uments as relatively short bit strings, and having fixed thesignature length, the document signature is independent ofactual document length. As it turns out, it is not possible toachieve competitive performance goals with compact signa-tures and consequently signatures require even more spacethan compressed indexes.

For the sake of completeness, and since we offer a rad-ically different approach to the construction of file signa-tures, it is necessary to describe the conventional approachfirst. Conventional Bit Sliced signature files, as describedin [7] exploit efficient bit-wise operators that are availableon standard digital processors. Unlike probabilistic modelsof IR and Language Models, the Signature File approach ispresented in an ad-hoc manner and is computationally mo-tivated by efficient bit-wise processing, and without specificgrounding in Information Retrieval theory. In traditionalsignature files a document is allocated a fixed-size signatureof N bits. Each term that appears in the collection is as-signed a random signature of width N , where only a smallnumber of n << N bits are set to 1 with the use of a suitablehash function. Naturally, term signatures tend to collide onsome bit positions, but this is of course unavoidable un-less the number N is extremely large, and the number n isvery small. Given that the vocabulary of a document collec-tion typically contains millions of distinct terms, collisionswill occur, and frequently. The document signature in thisapproach is derived as the conjunction of all the term sig-natures within the document (bit-wise ORed). Query termsare similarly assigned a signature. A query is then evaluated

98

by comparing the query signature to each document signa-ture. Any document whose signature matches every bit thatis set in a query term, is taken as a potential match. It isonly a potential match because hash collisions in generat-ing term signatures can lead to false matches – situationswhere all the bits match, but the actual query term is notpresent in the document. Consequently, documents are re-trieved and checked directly against the query to eliminatefalse matches. This is a very expensive operation even if thecollection fits in memory, but with a disk based collection –the most likely scenario – this is prohibitively expensive.

Indeed, the method used in traditional file signatures isknown in other domains as a Bloom Filter. B.H. Bloom infirst described bloom filters 1970 [4], and this well predatesfile signatures.

It is clear from the experimental evidence that Zobel etal [25] and Witten et al [21] provide, that such signaturesare not likely to compete with state of the art invertedfile approaches in terms of retrieval performance. Further-more, the presumed advantages of efficient bit-wise process-ing and the potential for index compression are not generallyachieved in practice. Signature files are found to be largerthan inverted file indexes.

Recent approaches to similarity search [23] have exploredsimilar ideas to TopSig for mapping documents to N bitstrings for comparison using Hamming distance. The ap-proach taken by Zhang et al [23] and prior publications fo-cus on similarity comparisons between documents. Theirmodels have not been applied to general-purpose ad-hoc re-trieval. More importantly, Zhang et al [23] use a complicatedapproach to the static, off-line derivation of signatures, andwhich involves supervised and unsupervised learning to gen-erate document signatures. This in effect prevents the ap-plication of the approach to ad-hoc retrieval where a querysignature has to be derived at run-time. It is not practicalin a very large collection due to the excessive computationalload of supervised and unsupervised learning.

Unlike earlier attempts, we approach the design of Top-Sig document signatures from basic principles. TopSig isradically different from a Bloom filter in the constructionof file signatures and in the manner in which the searchis performed. We present results of extensive experiments,performed with large standard IR collections, where we com-pare TopSig with standard retrieval models such as BM25and various Language Models. We also describe documentclustering experiments that demonstrate the effectiveness ofthe approach relative to standard document representationfor clustering.

The remainder of this paper is organised as follows. Sec-tion 2 introduces the TopSig approach in detail. Sections 3,4 and 5 define and evaluate the use of this approach for ad-hoc retrieval and clustering. The paper is concluded with adiscussion in Section 6.

2. TOPSIGTopSig represents a radically different approach to the

construction of signature files. Unlike the traditional ad-hocapproach [7], TopSig is principled and signature files emergenaturally from a highly effective compression of the well un-derstood and commonly used Vector Space representationof documents.

We approach the design of document signatures from theperspective of dimensionality reduction. TopSig starts from

a straight forward application of a vector space representa-tion of the collection – the term-by-document weight matrix.We then derive the signatures through extreme and lossycompression, in two steps, to produce topology preservingbinary document signatures. While the actual mechanismthat is proposed is highly efficient in signature constructionand in searching, we first focus the discussion on the con-ceptual approach, its justification and theoretical grounding,while leaving the implementation and performance analysisdetails for later in the paper.

In this section we describe the concepts that underpinTopSig. These concepts are not new – Random Indexingand Numeric Quantisation – but when put together to formfile signatures, the results are remarkable.

2.1 Random Indexing vs LSALatent Semantic Analysis [6] is a popular technique that

is used with word space models. LSA [8] creates contextvectors from a document-by-term occurrence matrix by per-forming Singular Value Decomposition (SVD). Dimensional-ity reduction is achieved through projection of the document-by-term occurrence vectors onto the subspace spanned byrelatively few vectors having the largest singular values inthe decomposition. This projection is optimal in the sensethat it minimises the Frobenius norm of the difference be-tween the original and the projected matrix. SVD is compu-tationally expensive and this limits its application in largecollections. For instance, in our own experiments, the SVDof a collection of 25,000 English Wikipedia articles – lessthan 1% of the collection – using the highly efficient par-allel multi-processor implementation of the MATLAB svdsfunction, took about 7 hours on a top-end quad-processorworkstation with sufficient memory to be completely proces-sor bound.

Random Indexing (RI) [17] is an efficient, scalable andincremental approach to dimensionality reduction. Wordspace models often use Random Indexing as an alternativeto Latent Semantic Analysis. Both LSA and RI start fromthe term-by-document frequency matrix. Often term fre-quencies are replaced by term weights – for instance, oneof the many TF-IDF variants. With LSA, Singular ValueDecomposition is used to derive an optimal projection ontoa lower dimensional space. Random Indexing is based ona random projection - avoiding the computational cost ofmatrix factorisation. Having obtained a projection matrix,both LSA and RI proceed to project the term occurrencematrix onto a subspace of significantly reduced dimension-ality.

In practice, RI works with one document at a time, andone term at a time within the document. Terms are as-signed random vectors, and the projected document vectoris then the arithmetic sum of all term signatures within.The process is somewhat similar to the traditional signaturefile approach of [7], but the document vector is real valued;it is a superposition of all the random term vectors. Thereis no matrix factorisation and hence the process is efficient.It has linear complexity in the number of terms in a doc-ument and also in the collection size. This is a significantadvantage over LSA whose time complexity is prohibitive inlarge collections. As stated by Manning et al [14] in 2008, inrelation to LSA – “The computational cost of the SVD is sig-nificant; at the time of this writing, we know of no successfulexperiment with over one million documents”.

99

The RI process is conceptually very different from LSAand does not carry the same optimality guarantees. At thefoundation of RI is the Johnson-Lindenstrauss lemma [9].It states that if points in a high-dimensional space are pro-jected into a randomly chosen subspace, of sufficiently high-dimensionality, then the distances between the points areapproximately preserved. Although strictly speaking an or-thogonal projection is ideal, nearly orthogonal vectors canbe used and have been found to perform similarly [3]. Thesevectors are usually drawn from a random uniform distribu-tion. This property of preserving relative distances betweenpoints is useful when comparing documents in the reducedspace. RI offers dimensionality reduction at low computa-tional cost and complexity while still preserving the topo-logical relationships amongst document vectors under theprojection.

In RI, each dimension in the original space is given a ran-domly generated index vector. The index vectors are highdimensional, sparse, and ternary. Sparseness is controlledvia a parameter that specifies the number of randomly se-lected non-zero dimensions. Ternary term vectors consistof randomly and sparsely distributed +1 and -1 values in avector that otherwise consists mostly of zeros. This choiceensures that the random vectors are near orthogonal.

RI can be expressed as a matrix multiplication of a ran-domly generated term-by-signature matrix T by a term-by-document matrix D where R is the randomly projectedterm-by-document matrix.

RN×d = TN×tDt×d (1)

Each of the d column vectors in D represents a documentof dimensionality t, each of the t column vectors in T isa randomly generated term vector of dimensionality N . Ris the reduced matrix where each of the d column vectorsrepresents a randomly projected document vector of dimen-sionality N .

RI has several advantages over LSA. It can be performedincrementally and online as data arrives. Any document canbe indexed independently from all other documents in thecollection. This eliminates the need to build and store theentire term-by-document matrix. Additionally, newly en-countered terms are naturally accommodated without hav-ing to recalculate any of the projections of previously en-coded documents. By contrast, LSA requires global analysiswhere the number of documents and terms are fixed. Thetime complexity of RI is also very attractive. It is linearin the number of terms in a document and hence linear inthe collection size. RI makes virtually no demands on com-puter memory since each document is indexed in turn andthe signatures are independent of each other.

TopSig deviates from Sahlgren’s basic Random Indexingby introducing term weights into the projection. In Sahlgren’sscheme, the term-by-document matrix contains unweightedterm counts. Search engine evaluation consistently showsthat unweighted term frequencies do not produce the bestperformance. Better results are obtained if the terms fre-quencies are weighted and this of course underlies the mostsuccessful search engine models, such as BM25 and Lan-guage models. The weighting of terms in TopSig is describedin Section 2.3.

Term weighting has an apparent drawback – it may appearto compromise the ability to encode new documents inde-

pendently. The calculation of term weights, such as withTF-IDF or Language Models, requires global statistics. Weobserve however that in a large collection new documentshave very little impact on these global statistics. Upon in-serting a new document these global statistics are updatedand the new document is encoded. As the collection grows,it is periodically re-indexed from scratch to bring all signa-tures into line, but this is a relatively cheap operation. On amodern multi-processor PC using the ATIRE search engine[18] we can index the entire English Wikipedia of 2.7 mil-lion documents, spanning 50 gigabytes of XML documents,in under 15 minutes.

2.2 Random Indexing and Other ApproachesRandom Indexing shares many properties with other ap-

proaches. In this section we will highlight some of the moreinteresting properties shared with other dimensionality re-duction approaches.

RI or random projections are closely related to compressedsensing from the field of signal processing. Compressed sens-ing is able to reconstruct signals with less samples than re-quired by the Nyquist rate. Baraniuk et al [2] construct aproof showing how the Restricted Isometry Property thatunderlies compressed sensing is linked to the Johnson-Lindenstrauss lemma [9] which underlies RI.

A conceptually similar approach to RI is used for a spreadspectrum approach in Carrier Division Multiple Access [16].In contrast, CDMA uses orthogonal vectors for codes andincreases the bandwidth of the signal. In CDMA, the useof random orthogonal codes allows for division of the radiospectrum that is more resistant to noise introduced in radiofrequency transmission.

Many other approaches to dimensionality reduction exist.Again, many come from the field of signal processing. Manyof these approaches iteratively optimise an objective func-tion. LSA offers an optimal linear projection in preservingthe Frobenius norm. Other well known approaches includethe Discrete Cosine Transform, Wavelet Transform, Non-Negative Matrix Factorisation, Principal Component Anal-ysis and Cluster Analysis. The advantage to RI is that it stillpreserves the topological relationships between the vectorswithout having to directly optimise an objective function.This is where its computational efficiency comes from.

2.3 TopSig SignaturesDocument Signatures are fixed length bit patterns. In or-

der to transform the real-valued projected term-by-documentmatrix into a signatures matrix, we ask what numerical pre-cision is required to represent the term-by-document matrix.It is obvious that there is no need for double precision andone obtains identical results when evaluating searching orclustering performance with single precision. One quicklyfinds that even when scaling the values to the range [0,255]– i.e. a single byte – there is no appreciable difference. EvenNibbles (4-bit integers) have been shown to be sufficientwith little appreciable difference in performance. This isexploited by all state of the art search engines to compressindexes. The reduction in precision still leaves the term-by-document matrix with a highly faithful representation ofthe similarity relationships between the original documents.Both clustering and ranking applications are concerned notwith the actual similarity values, but rather with their rank

100

order. As long as rank order is preserved the distortion dueto reduced numerical precision is not problematic.

In section 2.1 we described how a real-valued documentvector is obtained through random projection, as the sumof random term signatures within. TopSig now takes thereduction in numerical precision to its ultimate conclusion,by taking this real-valued randomly indexed document, andreducing the precision all the way to a single bit. Binarysignatures are obtained by taking only the sign-bits of theprojected document vectors (!!). This is a key step in TopSigsignature calculation; it may appear to be highly excessiveprecision reduction, but it is in fact surprisingly effective, aswe shall demonstrate with search and clustering experimentsin the following sections.

2.3.1 Topological DistortionIn order to measure the impact of aggressive dimensional-

ity reduction we conduct the following experiment. We take1000 randomly chosen Wikipedia document vectors, in fullTF-IDF representation, and compute their mutual distancematrix. Each element in the matrix represents the distancebetween a pair of document vectors in the full space. Wethen randomly project the vectors onto a lower dimensionalsubspace and compute the corresponding mutual distancematrix in the projection subspace. The mutual distancematrices are normalised such that the sum of elements ineach matrix is equal to 1. If the mutual distances are per-fectly preserved then the normalised matrices will be identi-cal. However, with aggressive compression we expect a topo-logical distortion due to information loss. To measure theimpact, we calculate the topological distortion as the rootmean squared differences (RMSE) between distances in fullprecision, and the corresponding distances in the reduceddimensionality and reduced precision. This calculation isperformed for various dimensionality reduction values andvarious numeric precision values.

Figure 1 depicts the results of our experiment. On they-axis is the topological distortion, measured by RMSE. Onthe x-axis is the number of dimensions in the projection.Each of the curves on the plot corresponds to a different nu-merical precision. The bottom curve corresponds to doubleprecision, and then the plots above correspond to 8-bit quan-tisation, through 7-bit quantisation, and so on all the waydown to 1-bit quantisation. First we observe that as the di-mensionality of the projected subspace is increased (movingto the right with the curves), the distortion becomes smaller.This is true regardless of numerical precision and it is ex-pected. We also observe that most of the gain is achievedquite early with relatively small dimensionality. This is theexpected behaviour of both RI and LSA, where a relativelysmall number of dimensions typically is required to achievegood results with text documents. What is perhaps lessexpected is that as we reduce the numerical precision thedeterioration is very small. The lowest curve in Figure 1corresponds to double precision. It is only when precisionis dropped to 3-bit that the difference in RMSE becomesnoticeable. The curves from 8-bit down to to 4-bit quan-tisation are barely separated. The distortion only increasessignificantly when we drop to 3-bit, 2-bit and 1-bit precision,corresponding to the 3 higher curves in the figure. Evenwith 1-bit precision we are still able to significantly preservetopology quite early with very small dimensionality.

Figure 1: RMSE Drop Precision

2.3.2 Packing Ternary Vectors onto Binary StringsTo complete the generation of a document signature we

need to pack the ±1 representation of signatures, onto bi-nary strings. This is done by representing positive signs as1s, and negative signs as 0s. The final result is thus a bi-nary digital signature, but it still conceptually represents ±1signatures.

We note that there is a possibility that very short docu-ments will not occupy all bit positions in a signature. Wecan safely ignore this situation and encode zeros as positive(i.e. binary 1) although it may introduce some noise intothe representation. The effect is negligible and studying itis outside the scope of this paper. Suffice to say that in ourexperiments circumventing this by complicating the repre-sentation to also record the unoccupied positions resulted inno appreciable difference at all and there was no practicaladvantage to maintaining this information.

2.3.3 Summary of Binary SignaturesTo summarise, TopSig introduces a principled approach to

the generation of binary file signatures. The underlying datarepresentation starts exactly as with inverted files, from theterm-by-document weight matrix. This matrix is then sub-jected to aggressive lossy compression. Topology preservingdimensionality reduction is first achieved through RandomIndexing and it is immediately followed by aggressive nu-merical precision reduction by keeping only the sign bits ofthe projected term-by-document weight matrix. Unlike tra-ditional signatures, TopSig does not emerge from bit-wiseprocessor efficiency considerations, but rather, it emergesas a consequence of aggressive compression of a well under-stood document representation. In this scheme, documentsignatures are no more than highly concise approximationsof vector space document representations. TopSig maps anentire document collection onto corners of the {±1}N hy-percube.

3. AD-HOC RETRIEVALTo provide a concrete description of the implementation

and use of TopSig in ad-hoc retrieval we need to more pre-cisely define document and query signatures, term weights,the ranking process, and how pseudo-relevance feedback is

101

used. We then describe the evaluation of document retrievalusing the INEX Wikipedia collection and the TREC WallStreet Journal (WSJ) collection. We conclude this sectionwith the description of document clustering experiments.

3.1 Document SignaturesSo far we have not addressed the weighting of terms in

the vector space representation of the term-by-documentmatrix. Weighting is all-important to improving precisionand recall and it is the basis of the most successful rankingfunctions, such as BM25 and Language models, which com-pute term weights in many different ways. With TopSig, themost effective weighting function we have found is describedin Equations (2), (3) and (4)

P (t|D) =tdf

|D| (2)

P (t|C) =tcf

|C| (3)

W (t,D) = log

(P (t|D)

P (t|C)

)(4)

where W (t,D) is the weight for term t in document D. Wedefine tdf to be the term frequency for term t in documentD, |D| as the total number of term occurrences in documentD, tcf as the collection frequency for term t, and |C| as thetotal number of term occurrences in the collection. P (t|D)is an estimate of the probability of finding the term t givena document D, and P (t|C) is an estimate of the probabilityof finding term t given the collection C.

The weighting function W (t,D) produces a larger valueif the frequency of a term in a document is higher than ex-pected, and smaller if the frequency is lower than expected.The logarithm of the ratio of these expected values is taken,so as to dampen the effect of an inordinately large frequencyof a term in a document.

The representation of a document is thus a bag of words,where the weight assigns an individual importance score toeach term within a document. This effectively takes careof stop-words. We note that a term that occurs with ap-proximately the expected frequency will have a weight closeto zero. Negative weights that result from equation 4 areset to zero since that would indicate that the term occursin the document with even lower frequency than expected.This weighting scheme ensures that stop words are natu-rally discounted without special treatment. Anecdotally,a document that consists of only the sentence “To be, ornot to be, that is the question” will retain all terms withappreciable weights when generating this document’s signa-ture, but most terms will have virtually negligible weights inmuch larger documents and thus the terms will be effectivelystopped.

3.2 Alternative Term Weighting FunctionsSurprisingly, TopSig performs quite respectably with no

term weighting at all. The raw unweighted term frequen-cies and simply randomly indexed. One advantage of thisapproach is that is requires no global statistics at all – a doc-ument can be encoded purely by looking at the documentin isolation.

When using the BM25 weighting function to create a vec-tor space representation, we found the retrieval performance

was relatively very poor. This is not surprising as BM25 wasoriginally intended to be treated as a probabilistic model andwe did not use it in that manner.

The TF-IDF representation produces retrieval quality thatlies between raw term frequencies and the approach de-scribed in Section 3.1.

The detailed comparison of different weighting functionsis outside the scope of this paper. What we provide here isanecdotal evidence to paint a clearer picture of the approachwe have taken to developing TopSig.

3.3 Alternative Document RepresentationsWhile the representation we have described here for ad-

hoc retrieval is a bag of words model for keyword search,there is no limitation of encoding other representations us-ing the TopSig approach. For example, it is possible to cre-ate vector space representations of structured data such asXML and other textual features such as phrases. As withmany popular machine learning approaches, most increasesin quality with respect to human judges come from how thedata is represented.

3.4 Choice of Sparsity ParametersDuring our experiments we found that setting the sparsity

of the random codes for each term to 1 in 12 set to +1 and1 in 12 to -1 worked most effectively. As the density of therandom codes or index vectors increases, the potential forcross talk between the codes increases. When the sparsityis decreased too far there is not enough information for thequery to successfully match against. There is an optimalpoint for sparsity with respect to a given set of queries. De-tailed analysis of the effect of sparsity, including automatedmethods to learn the optimal sparsity are outside the scopeof this paper and will be investigated in further research.

3.5 Query SignaturesIn order to search the collection with a given query, we

need to generate a query signature. Query document vectorsare generated using standard TF-IDF weighting. This realvalued query vector is then converted to a signature usingexactly the same process as used with document signatures.All the weighted query term signatures are added to createa real valued randomly projected document. The sign-bitsare then taken to form the binary signature. It is of coursenecessary to use exactly the same process and parameters ingenerating the query signature as when generating documentsignatures.

The use of term weights in generating the query ensuresthat query terms that are a-priori more significant (as deter-mined through TF-IDF or some other weighting function)will tend to dominate the signature bits where there is acollision and a conflict. Of course there is no need for con-cern when the two terms agree on the sign when there isa collision. This is easily understood by looking at a casewhere we have two query terms, for instance, “space” and“shuttle”. If the term “shuttle” has a larger TF-IDF valuethen for any bit position where the two terms disagree onthe sign, the term “shuttle” will dominate the sign in thesignature. When there are multiple terms we effectively geta vote.

Document signatures are represented in binary form, where1-bits correspond to +1, and 0-bits correspond to -1. Querysignatures, before taking the sign bits, may contain a mix

102

of 3 classes of values: positive, negative, or zero. This de-pends on the signs of term signatures, and a value of zerois obtained when none of the query terms occupy some bitpositions. As a matter of fact, with short queries and sparseterm signatures this is almost invariably the case. Thesezero valued bit positions are those for which the query doesnot specify any preference. To account for this, a querymask is also generated to accompany the query signature.This mask has 1-bits in all positions other than those thatare not covered by any term in the query. The set bits inthe mask identify the subspace in signature space which thequery terms cover. When comparing the query signatureagainst document signatures, the similarity measure mustnot take account of differences in those bit positions. Con-ceptually, those are neither +1 nor −1.

3.6 RankingRanking with TopSig is performed with the Hamming dis-

tance, calculating the similarity score for each document.The Hamming distance is rank equivalent to the Euclideandistance since all signatures have the same vector length –we note that the signatures correspond to +1 and −1 values,not 1 and 0 values, and hence the length of each signatureof N bits is

√N . Since the mask is almost invariably differ-

ent for each query, the Hamming distance for each query willgenerally be calculated in a different subspace. The distancemetric is therefore a masked Hamming Distance.

If the document and query are identical in the query sub-space then the Hamming distance will be zero. The Ham-ming distance between two signatures of N bits is restrictedto the range [0, 1, 2, ..N ]. For a signature file with 1024bits per document there are at most 1025 possible distancesbetween the query and a document, and many less if thequery is short. This means that in a collection such as theWikipedia, with millions of documents, if we rank all thedocuments by the Hamming distance from the query, we arebound to get numerous ties.

Although document signatures are not completely ran-dom – they are biased by the document contents, and sim-ilar documents have similar signatures – we still expect thevast majority of the documents to be centred at about aHamming distance of N/2 from the query signature. Indeedthis is always observed. The distribution of distances al-ways resembles a binomial distribution, which we expect ifthe distribution of signatures was indeed random. It is notquite that, but we still observe strong resemblance to trulyrandom distribution.

We are interested in early precision and so TopSig canstill achieve granularity in ranking of documents. This isbecause a large number of documents fall much closer thanN/2 to the query signature, and the number of ties dimin-ishes rapidly as the distance becomes smaller. Some tiesstill remain nevertheless and these may be broken arbitrar-ily or by using simple heuristics or document features. Forinstance, page-rank can be used, or any one of hundreds ofdocument features that are reportedly used in commercialsearch engines.

3.6.1 Partial Index ScanningGiven an index where each document signature is N bits

wide it is possible scan only the first f bits of each signature.This allows for further decreases in time taken to rank. Amultiple pass approach is possible where the documents are

first ranked with relatively few bits such as 640. The top tenpercent of the documents ranked using 640-bits can then bere-ranked using the full precision of the document signaturesstored in the index.

3.7 Relevance FeedbackPseudo relevance feedback is known to improve the perfor-

mance of a retrieval system. TopSig can implement pseudo-relevance in the usual manner, through query expansion.This however is a generic approach and can be used with anysearch engine. There is however an additional opportunityto apply pseudo-relevance feedback, an opportunity thatis unique and specific to TopSig. Explanation of pseudo-relevance feedback is required to completely describe theapproach we have taken to ad-hoc retrieval with TopSig.

An initial TopSig search is first executed in the mannerpreviously described in Section 3.6. This search is performedin the subspace of the query signature, the subspace spannedby the query terms. This is achieved by using the maskedHamming distance to rank all the documents in the collec-tion. Now it is possible to proceed and apply pseudo rele-vance. The principle is the same as with all pseudo relevanceapproaches – use some of the top ranked results to inform asubsequent search.

We take the top-k ranked documents and create a newquery signature by computing the arithmetic mean of thecorresponding signatures by treating the signatures as in-teger valued vectors and then taking the sign-bits in themanner described in Section 2.3.2. Since this signature wasis generated from full document signatures, this signatureis now spanning the full signature space and takes into ac-count information from highly ranked results, including inbit positions that were not informed directly by the queryterms. Now the query signature is in fact based on the fullcontent of the nearest k documents, through their signa-tures. The new query is constructed by inserting only themissing bits into the original signature. Therefore, the newsignature consists of the original signature in all originallyunmasked positions, and the feedback signature in all pre-viously masked positions.

The ranking of documents in relation to the new query isthen repeated, but it is not necessary to search the entire col-lection again. It is sufficient to re-rank a very small fractionof the nearest signatures – usually those that were retainedin a shortened result list following the initial search. Thisstep is consuming a negligible amount of additional com-putation – several orders of magnitude less than the initialsearch. The feedback leads to statistically significant im-provement in performance.

The approach to pseudo-relevance feedback we have de-scribed exploits the binary representation used by TopSig.This is conceptually similar to standard pseudo-relevancefeedback where the goal is to learn meaningful weights forrelevant terms not in the query. However, the implementa-tion of the approach with TopSig is drastically different aswe work directly in the dimensionality reduced space of thebinary document signatures, rather than with specific termsnot in the original query.

4. EVALUATION AND RESULTSWe have evaluated TopSig using the INEX Wikipedia 2009

collection, and the TREC Wall Street Journal (WSJ) Col-lection. INEX Wikipedia collection contains 2,666,192 doc-

103

uments with a vocabulary of 2,132,352 terms. The meandocument length in the Wikipedia has 360 terms, the short-est has 1 term and the longest has 38,740 terms. We usedall 68 queries from INEX 2009 for which there are relevancejudgments. The Wall Street Journal Collection consists of173,252 documents and a vocabulary of 113,288 terms. Themean WSJ document length is 475 terms, the shortest has3 terms, and the longest has 12,811 terms. We used TRECWSJ queries 51-100.

To compare TopSig with state-of-the-art approaches, wehave used the ATIRE search engine [18] which was formerlyknown as ANT. ATIRE is a highly efficient state-of-the-artsystem which implements several ranking functions, over aninverted file system. The ATIRE search engine has beenthoroughly tested at INEX against other search engines, in-cluding several well known systems such as Zettair, Lucene,and Indri, and has been shown to produce accurate and re-liable results.

The references given herein to the ranking functions thatwere compared with TopSig, are to the actual papers thatwere followed in implementing the methods, as documentedin the ATIRE search engine manual. These are Jelineck-Mercer (LMJM) [22], DLH13 [12], Divergence from Ran-domness [1], and Bose-Einstein [1]. The ranking functionswere evaluated with relevance judgments from TREC andINEX, and the trec eval program.

4.1 Recall-PrecisionWe first look at recall and precision over the full range

of recall values. Figure 2 depicts the precision-recall curvesfor INEX 2009 topics, against a tuned BM25 system, usingk = 1.1 and b = 3, and with Rocchio pseudo relevance feed-back. This BM25 baseline curve is an optimistic over-fittedapproach – it is tuned with the actual queries, and indeedperforms better than any official run at INEX 2009. But weare concerned with evaluating TopSig and so this providesvery conservative yardstick by which to measure the perfor-mance. The figure shows several TopSig indexes, encodingthe signature with 64, 128, 256, 512, 768, 1024, 2048, 3072,and 4096 bits per signature. Only one in 12 vector elementswere set in the random term signatures, to either +1 or to−1, with the rest of the elements set to 0. It is interest-ing that even a 64 bit signature produced measurable earlyprecision. As the number of bits in the signature increases,so does the recall. The performance of the file signaturesis quite respectable once we allow for about 512 bits persignature – particularly at early precision.

All the other language model based ranking functions pro-duce a recall-precision curve that falls below BM25, and justabove the best TopSig curve, but are not shown on the plotso as to reduce the clutter. Note that the legends in the fig-ures are ordered in decreasing order of area under the curve.

4.2 Early RecallWhile Figure 2 may at first suggest that file signatures

produce inferior retrieval quality, we must focus our atten-tion on the early precision, and this requires some justifica-tion before we do that.

Moffat and Zobel [15] found that P@n correlates with usersatisfaction. A user who is given 7 relevant documents in thetop 10 is better off than one who is only given 2. They arguethat recall does not have a similar use case that reflects usersatisfaction. Even for a recall oriented task, a user is unlikely

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

Inte

rpola

ted P

recis

ion

BM25

TOPSIG 4096

TOPSIG 3072

TOPSIG 2048

TOPSIG 1024

TOPSIG 768

TOPSIG 512

TOPSIG 256

TOPSIG 128

TOPSIG 64

Figure 2: INEX 2009 Precision vs Recall

to look past the top 30 results. For most tasks, the first pageor top 10 results are most useful to the user. Users achieverecall not through searching the entire ranked list but byreformulating queries. Recent work by Zobel and Moffat[24] suggests recall is not important except for a few recalloriented tasks such as retrieval in medical and legal domains.If a system provides 100% recall, it implies that the user cancreate a perfect query. Even in recall-based tasks, users tendto re-probe the collection with multiple queries to minimisethe risk they have missed important documents.

The same argument is applied to discount the importancethat is attributed the commonly used measure of Mean Av-erage Precision (MAP) as it too depends on higher recalland a long tail of relevant results. Again, it is not clear whatuser satisfaction is correlated with MAP. Turpin and Scholer[19] performed retrieval experiments where users completedsearch tasks using search results with MAP scores between55% and 95%. They were unable to find a correlation be-tween MAP scores and a precision based task requiring thefirst relevant document to be found. For recall-based tasks,they only found a weak link between MAP and the num-ber of relevant documents found in a given time period.They conclude that MAP does not correlate with user per-formance on simple information finding web search tasks.

4.3 Analysis of Early RecallRecall is not likely to be important to users except in

some specific domains. Therefore, we focus our attentionon comparison of P@n results between TopSig and state ofthe art inverted file approaches. The results immediatelymake it obvious that TopSig is a viable option for commoninformation finding tasks

To assess TopSig at early precision we look at early preci-sion in the P@n plots on Figures 3 and 4, for the 68 INEX2009 ad-hoc queries and the TREC Wall Street Journalqueries 51-100. It is immediately clear that TopSig per-forms similarly. The only system that consistently outper-forms TopSig is the over-fitted BM25 baseline. The legendsin the figures are ordered by the area under the curve, sothat the best performing systems appear first in the legend.

In order to look more carefully at the differences, we fo-cused on the P@10 performance differences on the INEX

104

5 10 15 20 25 300.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

Documents Retrieved

Pre

cis

ion

BM25

TOPSIG 4096

TOPSIG 3072

DIVERGENCE

TOPSIG 2048

DLH13

TOPSIG 1024

LMJM

BOSE EINSTEIN

Figure 3: INEX 2009 P@n

5 10 15 20 25 300.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

Documents Retrieved

Pre

cis

ion

DIVERGENCE

BOSE EINSTEIN

TOPSIG 16384

BM25

TOPSIG 8192

LMJM

TOPSIG 4096

TOPSIG 2048

TOPSIG 1024

DLH13

Figure 4: Wall Street Journal P@n

0 10 20 30 40 50 60 700

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Topics ordered by BM25 P@10

P@

10

TOPSIG 4096

BM25

Figure 5: INEX 2009 P@10 by Topic

Wikipedia collection, between the best performing rankingfunction – BM25, and TopSig with 4096 bit signatures. Theaverage P@10 for BM25 is 0.54, and for TopSig it is 0.51. Welook at all 68 queries and performed two-tailed paired t-test.There is no statistically significant difference with p = 0.41.Figure 5 depicts the P@10 values for all 68 queries. Thetopics on the X-axis are ordered by increasing P@10 valuesfor BM25. The TopSig P@10 values are plotted in the sameorder. It is obvious that the two approaches produce verydifferent results on a per-topic comparison. The two sys-tems do not agree on which topics are difficult and whichare not, and both sometime fail (on different topics) to pro-duce any relevant result in the top-10. It is a common andwell understood phenomena that this should occur and it istrue for all the ranking functions that we tested. However,there is a much stronger correlation between all the languagemodels, and BM25, as to which topics are hard and whichare easy. No such correlation is observed for TopSig whichseems to behave quite differently despite producing similaroverall precision. This leads us to conjecture that combin-ing TopSig with BM25 (or any of the other models tested)may lead to better results than emerge from combining anyother pair of more correlated ranking functions. Testing thisconjecture is outside the scope of this paper.

By inspecting at Figures 2 through 4 one can observe thatas the number of bits in a document signature increases, thequality of the results increases logarithmically. As more andmore bits are added the increases in quality become smallerand smaller. This agrees with the Johnson-Lindenstrausslemma [9] that states that the number of dimensions neededto embed a high dimensional Euclidean space into one ofmuch lower dimension is logarithmic in the number of points.

4.4 Storage and Processing overheadsTopSig is efficient and compares well with the inverted file

approach. On a standard PC, a 1024 bit signature index canbe searched by brute force in about 175 milliseconds, with acollection of 2.7 million signatures of the English Wikipediadocuments. The signatures file size for this collection is only325 MB, less than 0.65% of the collection size, and so it easily

105

fits in memory. By contrast, the highly compressed invertedfile of ATIRE that underlies all the other models, occupies1.5GB, or about 3% of the uncompressed text collection size.ATIRE itself is highly efficient and for comparison, the In-dri index for the same collection occupies about 11% of thespace.

Searching with TopSig is also efficient. We have not imple-mented a parallel multi-processor search which offers linearspeedup in the number of CPUs. Even so, all 68 queriesfor INEX collection were completed in 12 seconds for theWikipedia collection and all 50 WSJ topics were completedin 4 seconds, on a basic Laptop. This is comparable to theperformance that is obtained with the inverted file system.

There is potential to further compress the index by sortingthe binary strings lexicographically. Huffman coding can beused after sorting to represent the differences between suc-cessive document signatures. This approach has been shownto reduce a similar index used for near duplicate detection byup to 50% when used with 64-bit codes [13]. Thorough test-ing of this style of approach is beyond the scope of this paperand is expected to be investigated in further research. It isalso possible that document clustering can provide effectiveways to further compress the index. Document signaturesin a cluster fall within a small Hamming distance of eachother. Therefore, only a few bits differ between the clusterrepresentative and document signatures it represents.

5. CLUSTERING EVALUATIONThe goal of clustering is to place documents into topi-

cal groups. To achieve this, clustering algorithms comparesimilarity between entire document vectors. Therefore, thespace and time efficiency of the TopSig representation al-lows it to outperform current approaches using sparse vectorrepresentations. It is also competitive in terms of documentcluster quality. We have modified the k-means algorithm towork with signatures. This approach is compared to the im-plementation of k-means in the CLUTO clustering toolkit[10] that is popular in the IR community. CLUTO uses fullprecision sparse vectors to represent documents.

The same approach is used to create document signaturesas for ad-hoc retrieval as described in Section 3. The sparsityof the signatures does not have a large impact on the clusterquality but we found that index vectors with 1 in 6 bits setperformed best. Index vectors with 1 in 3 and 1 in 12 bitsset were also tested.

The k-means algorithm [11] was modified to work withthe bit string representation of TopSig. Cluster centroidsand documents are N bit strings. Each bit in a centroid isthe median for all documents it represents. If more thanhalf the documents contain a bit set to 1 then the centroidcontains this value in the corresponding position. As the1 and 0 values represent +1 and -1, this is equivalent toadding all the vectors and taking the sign of each position.The standard Hamming distance measure is used to com-pare all vectors. The algorithm is initialized by selectingk random documents as centroids. This modified version ofk-means always converged when the maximum number of it-erations was not limited. Whether this modified version hasthe same convergence guarantees as the original algorithmis unknown.

An implementation of the k-means clustering algorithmusing bit-vectors is available from the K-tree project subver-

sion repository 1 . Note that this is an unoptimised Java im-plementation. It is expected further performance increasescan be gained by implementation in a lower level languagesuch as C.

5.1 ResultsWe have evaluated document clustering using the INEX

2010 XML Mining collection [5]. It is a 144,265 documentsubset of the INEX XML Wikipedia collection. Clusters areevaluated using two approaches. The standard approachof comparing clusters to a “ground truth” set of categoriesis measured via Micro Purity. Purity is the proportion of acluster that is the majority category label. The final score isMicro averaged where the Purity for each cluster is weightedby its size. On this collection, Purity produces approxi-mately the same relationships between different clusteringapproaches as F1, Normalized Mutual Information and En-tropy. There are 36 categories for documents that are ex-tracted from the Wikipedia category graph.

An alternative evaluation is performed that has a specificapplication in information retrieval. Ad-hoc retrieval rele-vance judgments are used to measure the spread of relevantdocuments over clusters. This is motivated by the clusterhypothesis [20], stating that documents relevant to the sameinformation need tend to cluster together. If this hypothesisholds then most of the results will be in a small number ofclusters. The Normalized Cumulative Cluster Gain measurerepresents how relevant documents are spread over clusters.It falls in the range [0, 1] where a score of 1 indicates all rel-evant documents were contained in 1 cluster and a score of0 indicates all relevant documents were evenly spread acrossall clusters. Complete details of the evaluation are availablein a track overview paper from INEX 2010 [5].

The sparse document vectors used to create the TopSigdocument signatures are used as input to the k-means im-plementation in CLUTO. Therefore, we are comparing thesame algorithm on the same data except for the fact the Top-Sig representation is extremely compressed and has a differ-ent centroid representation and distance measure. Both im-plementations of k-means are initialized randomly and areallowed to run for a maximum of 10 iterations. 36, 100,200, 500 and 1000 clusters were produced by each approachwhere 36 was chosen to match the number of categories.This allows the trend of the measures to be visualised as thenumber of clusters are varied.

5.2 Analysis of ResultsFigures 6 and 7 represent the quality of the clustering

approaches using the Micro Purity and NCCG measures re-spectively. The TopSig representation nears the quality ofCLUTO at 1024 bits and matches it at 4096 bits accordingto both measures. The best NCCG scores are all greaterthan 0.84 for all numbers of clusters, strongly supportingthe cluster hypothesis, even when splitting the collectioninto 1,000 clusters.

Figure 8 shows how many times faster the TopSig cluster-ing is than the traditional sparse vector approach in CLUTO.For example, using 4096 bit signatures to create 500 clus-ters is completed 20 times faster than CLUTO and 80 timesfaster at 1024 bits. This is one to two orders of magnitude

1http://ktree.svn.sourceforge.net/viewvc/ktree/trunk/java/ktree/

106

0 100 200 300 400 500 600 700 800 900 1000

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

Clusters

Mic

ro P

urity

CLUTO

TOPSIG 4096

TOPSIG 3072

TOPSIG 2048

TOPSIG 1024

TOPSIG 512

TOPSIG 64

Figure 6: INEX 2010 Micro Purity

Method Micro Purity NCCG

CLUTO 0.543± 0.008 0.955± 0.003TopSig 4096 0.540± 0.008 0.951± 0.007TopSig 3072 0.528± 0.009 0.939± 0.005TopSig 2048 0.520± 0.007 0.926± 0.007TopSig 1024 0.480± 0.007 0.867± 0.012

Table 1: Detailed Evaluation of 36 clusters

increase in efficiency while still achieving the same qualityas traditional approaches.

Figures 6, 7 and 8 can not be significance tested as theyare a single run of the algorithms. However, the graphs al-low the general trends to be visualised. CLUTO k-meanstakes approximately 5 hours to produce 1,000 clusters onthis relatively small collection. Therefore, the CLUTO andTopSig k-means algorithms were repeatedly run to produce36 clusters given different starting conditions. Given eachrandom initialisation, k-means converges to a different localminima. The k-means implementations were run 20 times tomeasure this variability. Table 1 contains the results of thisexperiment where TopSig approaches that are equivalent tothe CLUTO approach are highlighted in boldface. Equiva-lence was tested using the t-test with p > 0.05 indicating nostatistically significant difference.

The time to produce the document signatures from thesparse document vectors was not included in the evaluation.The time is negligible in comparison to the time it takesto cluster using sparse document representations. Further-more, when the k-means algorithm is limited in the numberof iterations it can run for, it’s complexity becomes linear.The complexity of the document signature generation is alsolinear in the number of non-zero (nnz) elements in the term-by-document matrix. As, O(nnz) + O(nnz) = O(nnz), thecomplexity of the clustering system is not changed by theintroduction of the generation of the document signatures.

0 100 200 300 400 500 600 700 800 900 10000.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Clusters

NC

CG

CLUTO

TOPSIG 4096

TOPSIG 3072

TOPSIG 2048

TOPSIG 1024

TOPSIG 512

TOPSIG 64

Figure 7: INEX 2010 NCCG

0 100 200 300 400 500 600 700 800 900 10000

20

40

60

80

100

120

140

160

Clusters

Sp

ee

d U

p

TOPSIG 4096

TOPSIG 3072

TOPSIG 2048

TOPSIG 1024

TOPSIG 512

Figure 8: INEX 2010 Clustering Speed Up

107

6. DISCUSSIONWe have described TopSig, an approach to the construc-

tion of file signatures that emerges from aggressive com-pression of the conventional term-by-document weight ma-trix that underlies the most common and most successfulinverted file approaches. When focusing on early precision,using P@n measures from P@5 up to P@30, TopSig is shownto be as effective as even the best models available, while re-quiring equal or less amounts of space for storing signatures.Significant reductions in signature index size can be achievedwith TopSig as a trade-off, reducing the signature file sizeby orders of magnitude, while accepting reduced early pre-cision. Remarkably, even a single double precision variable– a 64bit signature – is found to achieve 10% P@10 over 2.7million documents Wikipedia collection. Testing with stan-dard clustering benchmark tasks demonstrates TopSig to beequally effective and as accurate as a state-of-the-art clus-tering solution such as CLUTO, with processing speedup ofone to two orders of magnitude.

TopSig had been applied to documents of greatly varyinglengths. Both the WSJ and the Wikipedia collections havevery short to very long documents, varying in size by upto five orders of magnitude. It had been suggested that filesignatures are susceptible to this situation because of theincreased probability of collisions on terms, but TopSig stillperforms well on these collections. In particular, we havetested TopSig with WSJ – the same collection that was usedby Zobel et al to demonstrate the superiority of conventionalinverted files. TopSig clearly outperforms conventional filesignatures that were previously discredited. In this paperwe compare TopSig directly with inverted file approaches todemonstrate similar performance levels.

Unlike early approaches to searching with file signatures,TopSig does not necessitate the complicated and tedious re-moval of false matches, and supports ranked retrieval in astraight forward manner. All the performance evaluationresults that are reported in this paper were performed with-out any attention being paid to false matches. Not onlyis TopSig producing comparable results, but with respectto false matches it is also virtually indistinguishable from auser perspective because false matches do not occur unlessusing far too aggressive compression is applied, for instance,compressing documents into 64 bit signatures.

There are certain differences between TopSig and invertedfile based retrieval which may offer advantages in some ap-plication settings. TopSig performs the search in constanttime and independently of query length. Comparing fulldocuments to the collection in a filtering task, or process-ing long queries, take exactly the same time as comparing asingle term query. This may be useful in applications wherepredictability and quality of service guarantees are critical.Shortening the signature length can reduce the index size,with smooth degradation in retrieval performance. Signa-tures may offer significant advantages where storage spaceis at a premium and a robust trade-off is sought.

Distributed search is an attractive setting for TopSig –distributed indexing and retrieval have to resolve the prob-lems of collection splitting and result fusion. With TopSigthese operations are trivial to implement since the HammingDistance between signatures can be used as a universal met-ric across the system. Gathering of global statistics can beignored by using the raw term frequencies from each docu-ment. This further simplifies use of TopSig in a distributed

setting and the trade-off with quality may be acceptable de-pending on the particular use of the system. If each textobject in an enterprise carries its own signature – perhapsgenerated independently as a matter of routine by the ap-plications that maintain the objects – then crawling and in-dexing the enterprise collection is a simple as collecting thesignatures. Alternatively, TopSig can support the imple-mentation of massively parallel search simply by distribut-ing the query signature to every participating sub-systemthat maintains its own set of signatures. It is also trivialto implement distributed filtering with TopSig by maintain-ing a “watch list” of signatures that can be compared withincoming text objects at run time. TopSig is trivial to dis-tribute on multi-processor platforms for the very same rea-sons. The simplicity of the search process means that withshared memory processor architecture a linear speedup inthe number of concurrent hardware threads available can beachieved.

TopSig is particularly efficient in indexing. It places vir-tually no memory requirements during indexing, processingan entire collection in a single pass (assuming term statis-tics are stable, which they are in very large collections).The most significant remaining drawback to TopSig is thatit still requires a comparison with all signatures in the col-lection. Parallel processing offers a simple solution, but it isnot entirely satisfactory. Parallel search does not reduce theamount of computation that is required, it only distributesit. There are many reports in the research literature aboutmore efficient approaches to signature file searching, whichoperate in sub-linear time. Many tree based approaches havebeen described, and some solutions offer improvements. Itis not a solved problem by any means it is the subject ofongoing research with TopSig too.

This paper introduces TopSig, a new file signature ap-proach that represents a viable alternative to conventionalsearch engines. Our results demonstrate that with a differ-ent approach to signature construction and searching file sig-natures performance is comparable to that of conventionallanguage and probabilistic models at early precision. Top-Sig represents a principled approach to the construction offile signatures, placing it in the same conceptual frameworkas other models. This is very different from the conven-tional ad-hoc formulation of file signatures. Future workwith TopSig will address multi-processor implementation, atree structured approach to the search process, and evalu-ation in a massively parallel massively distributed setting.Early findings of experiments with longer documents indi-cate that even improved performance can be achieved withTopSig by splitting documents. This is the subject of ongo-ing research.

7. REFERENCES[1] G. Amati and C.J. van Rijsbergen. Probabilistic

models of information retrieval based on measuringthe divergence from randomness. ACM Trans. Inf.Syst., 20:357–389, October 2002.

[2] R. Baraniuk, M. Davenport, R. DeVore, andM. Wakin. A simple proof of the restricted isometryproperty for random matrices. ConstructiveApproximation, 28(3):253–263, 2008.

[3] E. Bingham and H. Mannila. Random projection indimensionality reduction: applications to image andtext data. In KDD 2001, pages 245–250. ACM, 2001.

108

[4] B.H. Bloom. Space/time trade-offs in hash codingwith allowable errors. Communications of the ACM,13(7):422–426, 1970.

[5] C.M. De Vries, R. Nayak, S. Kutty, S. Geva, andA. Tagarelli. Overview of the INEX 2010 XML miningtrack. INEX 2010, LNCS, 2011.

[6] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K.Landauer, and R. Harshman. Indexing by latentsemantic analysis. Journal of the American Society forInformation Science, 41(6):391–407, 1990.

[7] C. Faloutsos and S. Christodoulakis. Signature files:an access method for documents and its analyticalperformance evaluation. ACM Trans. Inf. Syst.,2:267–288, October 1984.

[8] G. Golub and W. Kahan. Calculating the singularvalues and pseudo-inverse of a matrix. Journal of theSociety for Industrial and Applied Mathematics:Series B, Numerical Analysis, 2(2):205–224, 1965.

[9] W.B. Johnson and J. Lindenstrauss. Extensions ofLipschitz mappings into a Hilbert space.Contemporary mathematics, 26(189-206):1–1, 1984.

[10] G. Karypis. CLUTO-A Clustering Toolkit. 2002.

[11] S. Lloyd. Least squares quantization in pcm. IEEETransactions on Information Theory, 28(2):129–137,March 1982.

[12] C. Macdonald and I. Ounis. Voting for candidates:adapting data fusion techniques for an expert searchtask. In CIKM 2006, pages 387–396. ACM, 2006.

[13] Gurmeet Singh Manku, Arvind Jain, and AnishDas Sarma. Detecting near-duplicates for webcrawling. In Proceedings of the 16th internationalconference on World Wide Web, WWW ’07, pages141–150, New York, NY, USA, 2007. ACM.

[14] C.D. Manning, P. Raghavan, and H. Schutze. Anintroduction to information retrieval. CambridgeUniversity Press Cambridge, UK, 2008.

[15] A. Moffat and J. Zobel. Rank-biased precision formeasurement of retrieval effectiveness. ACM Trans.Inf. Syst., 27:2:1–2:27, December 2008.

[16] T.S. Rappaport. Wireless communications: principlesand practice, volume 2. Prentice Hall Inc., 1996.

[17] M. Sahlgren. An introduction to random indexing. InTKE 2005, 2005.

[18] A. Trotman, X. Jia, and S. Geva. Fast and effectivefocused retrieval. In Focused Retrieval and Evaluation,LNCS, pages 229–241. 2010.

[19] A. Turpin and F. Scholer. User performance versusprecision measures for simple search tasks. In SIGIR2006, pages 11–18. ACM, 2006.

[20] C. J. van Rijsbergen. Information Retrieval.Butterworths, 1979.

[21] I.H. Witten, A. Moffat, and T.C. Bell. Managinggigabytes: compressing and indexing documents andimages. Morgan Kaufmann, 1999.

[22] C. Zhai and J. Lafferty. A study of smoothing methodsfor language models applied to information retrieval.ACM Trans. Inf. Syst., 22:179–214, April 2004.

[23] D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taughthashing for fast similarity search. In SIGIR 2010,pages 18–25. ACM, 2010.

[24] J. Zobel, A. Moffat, and L.A.F. Park. Against recall:is it persistence, cardinality, density, coverage, ortotality? SIGIR Forum, 43:3–8, June 2009.

[25] J. Zobel, A. Moffat, and K. Ramamohanarao. Invertedfiles versus signature files for text indexing. ACMTrans. Database Syst., 23:453–490, December 1998.

109

Chapter 13Document Clustering Evaluation:

Divergence from a Random

Baseline

I proposed and implemented all of the contributions in this paper. The co-

authors provided valuable feedback on arrangement and style. I wrote the

manuscript, wrote the software 1, devised the experimental design, conducted

experiments, and peformed the data analysis.

13.1 Evaluation

This paper introduces an approach to the correction of ineffective clusterings

called Divergence from a Random Baseline is introduced. It produces a baseline

that looks identical to the clustering except that the documents are allocated

to clusters uniformly randomly. The cluster size distribution of the baseline

matches that of the clustering being evaluated. This is able to differentiate

problematic clusterings using the NCCG evaluation measure. It also provides

an optimum to distortion as measured by RMSE that is not apparent otherwise.


110


Document Clustering Evaluation: Divergence from a Random Baseline

Christopher M. De Vries 1 and Shlomo Geva 1 and Andrew Trotman 2

School of Electrical Engineering and Computer ScienceQueensland University of Technology, Brisbane, Australia1

Department of Computer Science, University of Otago, Dunedin, New Zealand2

[email protected] [email protected] [email protected]

Abstract

Divergence from a random baseline is a tech-nique for the evaluation of document cluster-ing. It ensures cluster quality measures are per-forming work that prevents ineffective cluster-ings from giving high scores to clusterings thatprovide no useful result. These concepts are de-fined and analysed using intrinsic and extrinsicapproaches to the evaluation of document clus-ter quality. This includes the classical clustersto categories approach and a novel approach thatuses ad hoc information retrieval. The diver-gence from a random baseline approach is able todifferentiate ineffective clusterings encounteredin the INEX XML Mining track. It also appearsto perform a normalisation similar to the Nor-malised Mutual Information (NMI) measure butit can be applied to any measure of cluster qual-ity. When it is applied to the intrinsic measureof distortion as measured by RMSE, subtractionfrom a random baseline provides a clear optimumthat is not apparent otherwise. This approach canbe applied to any clustering evaluation. This pa-per describes its use in the context of documentclustering evaluation.

1 IntroductionThis paper extends, motivates and analyses a documentclustering evaluation approach that compensates for inef-fective document clusterings during evaluation. An inef-fective clustering is one that achieves a high score accord-ing to a measure of document cluster quality but providesno value as a clustering solution. Divergence from a ran-dom baseline is introduced and formally defined to addressineffective clusterings in evaluation. A notion of workperformed by a clustering is introduced where ineffectivecases appear to perform no useful learning. The paper isconcluded with a detailed analysis of the results from theINEX 2010 XML Mining track. This paper clearly definesand motivates this approach with theoretical and experi-mental analysis.

Ineffective document clusterings have been investigatedusing two extrinsic evaluations. The first is the standardclusters to categories approach where document clustersare compared to a ground truth set of category labels. Thesecond approach evaluates document clustering using adhoc information retrieval that has a use case for collec-tion selection where a document collection is distributedacross many machines. A broker needs to direct a search

query to machines containing relevant documents. If thedocuments are allocated to machines by document cluster,it is expected that only a few topical clusters need to besearched. This is motivated by the cluster hypothesis[20]that states relevant documents tend to be more similar toeach other than non-relevant documents. The NormalisedCumulative Cluster Gain (NCCG) measure evaluates doc-ument clustering with respect to this use case for ad hocinformation retrieval.

The paper proceeds as follows. Section 2 introducesthe collaborative XML document mining evaluation fo-rum at INEX. Section 3 introduces document clusteringin an information retrieval context and discusses differentapproaches. Evaluation of document clustering using theclusters to categories approach and ad hoc relevance judge-ments is discussed in Section 4. Sections 5, 6 and 7 intro-duce and define ineffective clusterings that perform no use-ful learning and can be adjusted for by applying divergencefrom a random baseline. Section 8 analyses the applicationof divergence from a random baseline using the INEX 2010XML mining track. The paper is concluded in Section 9.

2 INEX XML Mining TrackThe XML document mining track was run for six years atINEX, the Initiative for the Evaluation of XML Informa-tion Retrieval1 [12; 10; 11; 26; 8]. It explored the emerg-ing field of classification and clustering of semi-structureddocuments.

Document clustering has been evaluated at INEX usingthe standard clusters to categories approach, where cate-gories extracted from the Wikipedia were used as a groundtruth. Clusterings produced by different systems were eval-uated using measures such as Purity, Entropy, F1 and NMI,indicating how well the clusters match the categories.

A novel approach to document clustering evaluation wasintroduced at INEX in 2009[26] and 2010[8]. It usedad hoc information retrieval to evaluate document cluster-ing by using relevance judgments from retrieval systems inthe ad hoc track[34]. Ad hoc information retrieval evalua-tion is a system based approach that evaluates how differentsystems rank relevant documents. For systems to be com-pared, the same set of information needs and documentshave to be used. A test collection consists of documents,statements of information need, and relevance judgments[36]. Relevance judgments are often binary and any docu-ment is considered relevant if any of its contents can con-tribute to the satisfaction of the specified information need.

1http://inex.otago.ac.nz/tracks/wiki-mine/wiki-mine.asp.

111

However, the ad hoc track at INEX provides additional rel-evance information where assessors highlight the relevanttext in the documents. Information needs are also referredto as topics and contain a textual description of the informa-tion need, including guidelines as to what may or may notbe considered relevant. Typically, only the keyword basedquery of a topic is given to a retrieval system.

The ad hoc information retrieval based evaluation ofdocument clustering is motivated by the cluster hypothe-sis that suggests relevant documents are more similar toeach other than non-relevant documents; relevant docu-ments tend to cluster together. The spread of relevantdocuments over a clustering solution was measured usingthe Normalised Cumulative Cluster Gain (NCCG) measurein the INEX XML mining track in 2009 and 2010[26;8]. This evaluation approach also has a specific use casein information retrieval. It evaluates clustering of a doc-ument collection for collection selection. Collection se-lection involves selecting a subset of a collection given aquery. Typically, these subsets are distributed on differentmachines. The goal is to cluster documents such that onlya small fraction of clusters, and therefore machines, needto be searched to find most of the relevant documents for agiven query. This leads to improved run time performanceas only a fraction of the collection needs to be searched.The total load over a distributed system is decreased as onlya few machines need to be searched per query instead of ev-ery machine. It also provides a clear use case for documentclustering evaluation. By contrast, comparing documentclusters to predefined categories only evaluates clusteringas a match against a particular classification.

This paper uses the INEX 2010 XML Mining trackdataset[8]. It is a 146,225 document subset of the INEXXML Wikipedia collection determined by the reference runused for the ad hoc track[2]. The reference run containsthe 1500 highest ranked documents for each of the queriesin the ad hoc track. The queries were searched using animplementation of Okapi BM25 in the ATIRE[35] searchengine.

Topical categories for documents are one of many viewsof extrinsic cluster quality. They are derived from what hu-mans perceive as topics in a document collection. Whencategories are used for evaluation, a document clusteringsystem is given a score indicating how well the clustersmatch the predefined categories. This is the most preva-lent approach to evaluation of document clustering in theresearch literature.

The categories for the INEX 2010 XML Mining col-lection were extracted from the Wikipedia category graphwhich is noisy and nonsensical at times. Therefore, an ap-proach using shortest paths in the graph was used to extract36 categories[8].

3 Document Clustering

Document clustering is used in many different contexts,such as exploration of structure in a document collectionfor knowledge discovery[33], dimensionality reduction forother tasks such as classification[22], clustering of searchresults for an alternative presentation to the ranked list[19]and pseudo-relevance feedback in retrieval systems[23].

Recently there has been a trend towards exploiting semi-structured documents[27; 11]. This uses features such asthe XML tree structure and hyper-link graphs to derive datafrom documents to improve the quality of clustering.

Document clustering groups documents into topics with-out any knowledge of the category structure that exists ina document collection. All semantic information is derivedfrom the documents themselves. It is often referred to asunsupervised clustering. In contrast, document classifica-tion is concerned with the allocation of documents to prede-fined categories where there are labeled examples to learnfrom. Clustering for classification is referred to as super-vised learning where a classifier is learned from labeledexamples and used to predict the classes of unseen docu-ments.

The goal of clustering is to find structure in data to formgroups. As a result, there are many different models, learn-ing algorithms, encoding of documents and similarity mea-sures. Many of these choices lead to different inductionprinciples[14] which result in discovery of different clus-ters. An induction principle is an intuitive notion as to whatconstitutes groups in data. For example, algorithms such ask-means[24] and Expectation Maximisation[9] use a rep-resentative based approach to clustering where a prototypeis found for each cluster. These prototypes are referred toas means, centers, centroids, medians and medoids[14]. Asimilarity measure is used to compare the representativesto examples being clustered. These choices determine theclusters discovered by a particular approach.

A popular model for learning with documents is the Vec-tor Space Model (VSM)[30]. Each dimension in the vectorspace is associated with one term in the collection. Termfrequency statistics are collected by parsing the documentcollection and counting how many times each term appearsin each document. This is supported by the distributionalhypothesis[18] from linguistics that theorises that wordsthat occur in the same context tend to have similar mean-ings. If two documents use a similar vocabulary and havesimilar term frequency statistics then they are likely to betopically related. The end result is a high dimensional,sparse document-by-term matrix who’s properties can beexplained by Zipf distributions[41] in term occurrence.The matrix represents a document collection where eachrow is a document and each column is a term in the vocab-ulary. In the clustering process, document vectors are of-ten compared using the cosine similarity measure. The co-sine similarity measure has two properties that make it use-ful for comparing documents. Document vectors are nor-malised to unit length when they are compared. This nor-malisation is important since it accounts for the higher termfrequencies that are expected in longer documents. The in-ner product that is used in computing the cosine similarityhas non-zero contributions only from words that occur inboth documents. Furthermore, sparse document represen-tation allows for efficient computation.

Different approaches exist to weight the term frequencystatistics contained in the document-by-term matrix. Thegoal of this weighting is to take into account the relative im-portance of different terms, and thereby facilitate improvedperformance in common tasks such as classification, clus-tering and ad hoc retrieval. Two popular approaches areTF-IDF [29] and BM25[28; 38].

Clustering algorithms can be characterized by two prop-erties. The first determines if cluster membership is dis-crete. Hard clustering algorithms only assign each docu-ment to one cluster. Soft clustering algorithms assign doc-uments to one or more clusters in varying degree of mem-bership. The second determines the structure of the clustersfound as being either flat or hierarchical. Flat clustering.

112

algorithms produce a fixed number of clusters with no re-lationships between the clusters. Hierarchical approachesproduce a tree of clusters, starting with the broadest levelclusters at the root and the narrowest at the leaves.

K-means[24] is one of the most popular learning algo-rithms for use with document clustering and other cluster-ing problems. It has been reported as one of the top 10algorithms in data mining[39]. Despite research into manyother clustering algorithms it is often the primary choicefor practitioners due to its simplicity[17] and quick conver-gence[1]. Other hierarchical clustering approaches such asrepeated bisecting k-means[32], K-tree[7] and agglomer-ative hierarchical clustering[32] have also been used. Fur-ther methods such as graph partitioning algorithms[21],matrix factorisation[40], topic modeling[5] and Gaussianmixture models[9] have also been used.

The k-means algorithm[24] uses the vector space modelby iteratively optimisingk centroid vectors which representclusters. These clusters are updated by taking the mean ofthe nearest neighbours of the centroid. The algorithm pro-ceeds to iteratively optimise the sum of squared distancesbetween the centroids and the set of vectors that they arenearest neighbours to (clusters). This is achieved by it-eratively updating the centroids to the cluster means andreassigning nearest neighbours to form new clusters, un-til convergence. The centroids are initialized by selecting kvectors from the document collection uniformly at random.It is well known that k-means is a special case of Expecta-tion Maximisation[9] with hard cluster membership andisotropic Gaussian distributions.

The k-means algorithm has been shown to converge ina finite amount of time[31] as each iteration of the algo-rithm visits a possible permutation without revisiting thesame permutation twice, leading to a worst case analysisof exponential time. Arthur et. al.[1] have performed asmoothed analysis to explain the quick convergence of k-means theoretically. This is the same analysis that has beenapplied to the simplex algorithm, which has an2 worstcase complexity but usually converges in linear time onreal data. While there are point sets that can force k-meansto visit every permutation, they rarely appear in practicaldata. Furthermore, most practitioners limit the number ofiterations k-means can run for, which results in linear timecomplexity for the algorithm. While the original proof ofconvergence applies to k-means using squared Euclideandistance[31], newer results show that other similarity mea-sures from the Bregman divergence class of measures canbe used with the same complexity guarantees[3]. This in-cludes similarity measures such as KL-divergence, logis-tic loss, Mahalanobis distance and Itakura-Saito distance.Ding and He[13] demonstrate the relationship between k-means and Principle Component Analysis. PCA is usuallythought of as a matrix factorisation approach for dimen-sionality reduction where as k-means is considered a clus-tering algorithm. It is shown that PCA provides a solutionto the relaxed k-means problem, thus formally creating alink between k-means and matrix facortisation methods.

4 Document Clustering EvaluationEvaluating document clustering is a difficult task. Intrin-sic or internal measures of quality such as distortion or loglikelihood only indicate how well an algorithm optimised aparticular representation. Intrinsic comparisons are inher-ently limited by the given representation and are not com-parable between different representations. Extrinsic or ex-

ternal measures of quality compare a clustering to an ex-ternal knowledge source such as a ground truth labelingof the collection or ad hoc relevance judgments. This al-lows comparison between different approaches. Extrinsicviews of truth are created by humans and suffer from thetendency for humans to interpret document topics differ-ently. Whether a document belongs to a particular topic ornot can be subjective. To further complicate the problemthere are many valid ways to cluster a document collection.It has been noted that clustering is ultimately in the eye ofthe beholder[14].

When comparing a cluster solution to a labeled groundtruth, the standard measures of Purity, Entropy, NMI andF1 are often used to determine the quality of clusters withregard to the categories. Letω = {w1, w2, . . . , wK}be the set of clusters for the document collectionD andξ = {c1, c2, . . . , cJ} be the set of categories. Each clus-ter and category is a subset of the document collection,∀c ∈ ξ, w ∈ ω : c, w ⊂ D. Purity assigns a score based onthe fraction of a cluster that is the majority category label,

argmaxc∈ξ

|c ∩wk||wk|

, (1)

in the interval[0, 1] where 0 is absence of purity and 1 istotal purity. Entropy defines a probability for each categoryand combines them to represent order within a cluster,

− 1

log J

J∑

j=1

|cj ∩wk||wk|

log|cj ∩ wk||wk|

, (2)

which falls in the interval[0, 1] where 0 is total order and1 is complete disorder. F1 identifies a true positive (tp) astwo documents of the same category in the same cluster, atrue negative (tn) as two documents of different categoriesin different clusters and a false negative (fn) as two docu-ments of the same category in different clusters where thescore combines these classification judgements using theharmonic mean,

2× tp

2× tp+ fn+ fp. (3)

The Purity, Entropy and F1 scores assign a score to eachcluster which can be micro or macro averaged across all theclusters. The micro average weights each cluster by its size,giving each document in the collection equal importance inthe final score. The macro average is simply the arithmeticmean, ignoring the size of the clusters. NMI makes a trade-off between the number of clusters and quality in an infor-mation theoretic sense. For a detailed explanation of thesemeasures please consult Manning et. al.[25].

4.1 NCCGThe NCCG evaluation measure has been used for the eval-uation of document clustering at INEX[26; 8]. It is mo-tivated by van Rijsbergen’s cluster hypothesis[20]. If thehypothesis holds true, then relevant documents will appearin a small number of clusters. A document clustering solu-tion can be evaluated by measuring the spread of relevantdocuments for the given set of queries.

NCCG is calculated using manual result assessmentsfrom ad hoc retrieval evaluation. Evaluations of ad hocretrieval occur in forums such as INEX[2], CLEF [15]and TREC[6]. The manual query assessments are calledthe relevance judgments and have been used to evaluate adhoc retrieval of documents. The process involves defining.

113

a query based on the information need, a retrieval systemreturning results for the query and humans judging whetherthe results returned by a system are relevant to the informa-tion need.

The NCCG measure tests a clustering solution to deter-mine the quality of clusters relative to the optimal collec-tion selection. Collection selection involves splitting acol-lection into subsets and recommending which subsets needto be searched for a given query. This allows a retrieval sys-tem to search fewer documents, resulting in improved run-time performance over searching the entire collection. TheNCCG measure has complete knowledge of which docu-ments are relevant to queries and orders clusters in descend-ing order by the number of relevant documents it contains.We call this measure an “oracle” because it has completeknowledge of relevant documents. A working retrieval sys-tem does not have this property, so this measure representsan upper bound on collection selection performance.

Better clustering solutions in this context will tend togroup together relevant results for previously unseen ad hocqueries. Real ad hoc retrieval queries and their manual as-sessment results are utilised in this evaluation. This ap-proach evaluates the clustering solutions relative to a veryspecific objective – clustering a large document collectionin an optimal manner in order to satisfy queries while min-imising the search space. The measure used for evaluat-ing the collection selection is called Normalised Cumula-tive Cluster Gain (NCCG)[26].

The Cumulative Gain of a Cluster (CCG) is defined bythe number of relevant documents in a cluster, CCG(c, t) =∑n

i=1 Reli. A sorted vector CG is created for a clusteringsolution,c, and a topic,t, where each element representsthe CCG of a cluster. It is normalised by the ideal gainvector,

SplitScore(t, c) =|CG|∑ cumsum(CG)

n2r

, (4)

wherenr is total number of relevant documents for thetopic, t. The worst possible split places one relevant docu-ment in each cluster represented by the vector CG1,

MinSplitScore(t, c) =|CG1|∑ cumsum(CG1)

n2r

. (5)

NCCG is calculated using the previous functions,

NCCG(t, c) =SplitScore(t, c)− MinSplitScore(t, c)

1− MinSplitScore(t, c).

(6)It is then averaged across all topics.

4.2 Single and Multi Label EvaluationBoth the clustering approaches and the ground truth canbe single or multi label. Examples of algorithms that pro-duce multi label clusterings are soft or fuzzy approachessuch as fuzzy c-means[4], Latent Dirichlet Allocation[5]or Expectiation Maximisation[9]. A ground truth is multilabel if it allows more than one category label for each doc-ument. Any combination of single or multi label cluster-ings or ground truths are able to be used for evaluation.However, it is only reasonable to compare approaches us-ing the same combination of single or multi label clusteringand ground truths. Multi label approaches are less restric-tive than single label approaches as documents can exist inmore than one category. There is redundancy in the data

whether it is clustering or a ground truth. This redundancyhas a real and physical costs when clustering is used for col-lection selection. More storage and compute resources arerequired with a multi label clustering as one document hasto be stored and processed on more than one computer. Aground truth can be considered a clustering and comparedto another ground truth to measure how well the groundtruths fit each other. Furthermore, a ground truth can beused as a clustering solution and used for collection selec-tion.

The evaluation of document clustering using ad hoc in-formation retrieval can be viewed as being similar to anevaluation using a multi label category based ground truth.A document can be relevant to more than one query. How-ever, unlike a category based approach, each query is eval-uated separately and then averaged across all queries. Incontrast, all categories are evaluated at once and the scoreis not averaged across categories.

5 Ineffective Clustering

In this paper we introduce the concept of an ineffectiveclustering. An ineffective clustering produces a high scoreaccording to an evaluation measure but does not representany inherent value as a clustering solution.

The Purity evaluation measure has an obvious ineffectivecase. If each cluster contains one document then it is 100%pure with respect to the ground truth. A single documentis the majority of the cluster. As the goal of clustering is toproduce groups of documents or to summarise the collec-tion, this is obviously flawed as it does neither. The sameapplies to the Entropy measure as the probability of a la-bel for a cluster is 100%, resulting in the highest possibleEntropy score.

The NCCG measure is ineffective when one cluster con-tains all the documents except for every other cluster con-taining one document. The NCCG measure orders clustersby the number of relevant documents they contain. A largecluster containing most documents will almost always beranked first. Therefore, almost all relevant documents willexist in one cluster, achieving almost the highest score pos-sible.

6 Work Performed by a Clustering

To overcome ineffective clusterings in the previous section,we introduce the concept of work performed by a cluster-ing approach. Work is defined as an increase in quality ofa clustering over a simple approach that ignores the docu-ments being clustered. A useful clustering performs workbeyond an approach that is purely random and ignores doc-ument content. If a random approach that performs no use-ful learning performs equally to an approach that attemptsto learn from that data, it would appear that nothing hasbeen achieved by analysing the data. We suggest that anineffective clustering performs no useful learning. This issupported by a theoretical and experimental analysis in thefollowing sections.

Figures 1 and 2 illustrate an approach using a clusteringalgorithm and a random approach that ignores documentcontent. The difference in cluster quality between thesetwo approaches represents work completed by a clusteringalgorithm..

114

Figure 1: A Clustering Produced by a Clustering Algorithm

Figure 2: A Random Baseline Distributing Documents intoBuckets the Same Size as a Clustering

7 Divergence from a Random BaselineMany measures of cluster quality can give high qualityscores for particular clustering solutions that are not of highquality by changing the number of clusters or number ofdocuments in each cluster.

Measures that can be misled by creating an ineffectiveclustering can be adjusted by subtraction from a randomlygenerated clustering with the same number of clusters withthe same number of documents in each cluster. Figures1 and 2 highlight this example where the random base-line distributes documents into buckets the same size as theclusters found by the clustering algorithm. Apart from therandom assignment of documents to clusters, the randombaseline appears the same as the real solution. Therefore,each clustering evaluated requires a random baseline that isspecific to that clustering. The baseline is created by shuf-fling the documents uniformly randomly and splitting theminto clusters the same size as the clustering being measured.The score for the random baseline clustering is subtractedfrom the matching clustering being measured.

The divergence from a random baseline approach can beapplied to any measure of cluster quality whether it is in-trinsic of extrinsic. However, it does require an existingmeasure of cluster quality. It is not a measure by itself butan approach to ensure a clustering is doing something sen-sible. Although we have highlighted its use for documentclustering evaluation, it can be used for any clustering eval-

uation.There are two issues at play here. Firstly, different distri-

butions of cluster sizes can lead to arbitrarily high scores.The second issue is determining if the clustering algorithmis effectively learning with respect to a measure of quality.The divergence from a random baseline takes care of inef-fective solutions in either case. If the internal ordering ofclusters is no better than random noise then it achieves ascore of zero. A negative score could be achieved as therandom baseline scores a positive value using most mea-sures on most data sets. It is possible for a clustering tohave a worse score than the baseline. For example, a clus-tering approach could maximise dissimilarity of documentsin clusters. This will create a solution where the most dis-similar documents are placed together, resulting in a worsescore than random assignment. The random assignmentdoes not bias the clustering towards or away from the mea-sure of quality. If a clustering approach is in fact learningsomething with respect to the measure of quality, then itis expected that is will be biased towards it. Alternatively,if we reverse the optimisation process, it should be biasedaway from it.

Let ω = {w1, w2, . . . , wK} be the set of clusters for thedocument collectionD andξ = {c1, c2, . . . , cJ} be the setof categories. Each cluster and category is a subset of thedocument collection,∀c ∈ ξ, w ∈ ω : c, w ⊂ D. Wedefine the probability of a category in the baseline givena cluster as,Pb(cj |wk) =

|cj|∑i |ci|

. The probability of acategory given a cluster in the baseline only depends onthe size of the categories. The baseline is a uniformly ran-domly shuffled list of documents that has been split intoclusters that match the cluster size distribution in the solu-tion being evaluated. Thus, within each cluster in the base-line is random uniform noise. It is not biased by the docu-ment representation. So, it is expected categories will occurat a rate proportional to the category’s size. For example,if there are three categoriesA,B,C containing10, 20, 30documents, each cluster in the baseline is expected to con-tain approximately1060A,

2060B, and3060C. This only reflects

the size distribution of the categories.We let any measure of a cluster quality be interpreted as

a probability. Although this is not formally the case for allmeasures, it serves as a reasonable explanation. We definethe probability of a category in a cluster given the groundtruth as,Ps(cj |wk) = any measure of cluster quality.

The Purity measure assigns an actual probability to eachcluster when there is a single label ground truth. All theprobabilities combined accumulate to one,

∑j

|cj∩wk||wk| =

1, and the category with the largest maximum likelihoodestimate is assigned to each cluster,PPurity(cj |wk) =

argmaxcj|cj∩wk||wk| . This is the proportion of the cluster

that has the majority category label. It also represents thesame process of using clustering for classification with la-beled data where an unseen sample is labeled based on themajority category label of the cluster it is nearest neigh-bour to. We defined as a document inD. The groundtruth is restricted to being single label where a document,d, only has only one label in one category in the groundtruth,∀d ∈ D, ci ∈ ξ, cj ∈ ξ : d ∈ ci ∧ d /∈ cj ∧ ci 6= cj .

The adjusted measure is the difference between the sub-mission and the baseline. We define the adjusted prob-ability of a category given a cluster as,Pa(cj |wk) =Ps(cj |wk)− Pb(cj |wk).

An alternative formal view of divergence from a ran-.115

dom baseline can be defined by a quality function,m :PP(Z× Z) → R, that takes a set of clusters as a set ofset of (document, category label) pairs,s, and returns areal number indicating the quality of the clustering. Ex-amples of these cluster quality functions are Entropy, F1,NCCG, Negentropy, NMI and Purity. There exists a func-tion, r : PP(Z× Z) → PP(Z× Z), that generates a ran-dom baseline,b, given a clustering solution,s. The baselinehas the same number of clusters as the clustering solution,|b| = |s|. For every cluster in each of the original clus-tering, s, and the baseline,b, the corresponding clusterscontain the same number of documents,∀k : |sk| = |bk|.The adjusted measure,ma : PP(Z× Z) → R, becomes,ma(s) = m(s)−m(r(s)).

8 Application at the INEX 2010 XMLMining Track

Participants were asked to submit multiple clustering solu-tions containing approximately 50, 100, 200, 500 and 1000clusters. The categories extracted contained 36 categoriesdue to only using categories with greater than 3000 docu-ments. This choice was arbitrary and the decision for clus-ter sizes was made based on the number of documents inthe collection before the categories were extracted. Thenumber of categories in a document collection is subjec-tive. Therefore, a direct comparison of 36 clusters with 36categories is not necessary. Measuring how the categoriesbehave over multiple cluster sizes indicates the quality ofclusters and the trend can be visualised.

k−star

Random with Uniform Cluster Size

Structured Linked Vector Model

TopSig 1024 bit k−means

Figure 3: Legend

A legend for Figures4 to 9 can be found inFigure 3. The Struc-tured Linked VectorModel (SLVM) [37]incorporates documentstructure, links andcontent. The k-star[34]is an iterative clusteringmethod for groupingdocuments. The TopSig approach[16] produces binarystrings that represent documents and a modified k-meansalgorithm that works directly with this representation.

0.35

0.40

0.45

0.50

0.55

0.60

0.65

200 400 600 800 1000

Clusters

Mic

ro P

uri

ty

Figure 4: Purity

Submissions using the k-star method at INEX 2010[34]contained several large clusters and many other small clus-

0.00

0.05

0.10

0.15

0.20

0.25

0.30

200 400 600 800 1000

Clusters

Mic

ro P

uri

ty D

ive

rge

nce

Figure 5: Purity Subtracted from a Random Baseline

ters. This exposed weakness in the NCCG measure, whichresulted in inappropriately high scores. When the scoresare subtracted from a random baseline with the same prop-erties they performed no better than a randomly generatedsolution. This can be clearly seen in Figures 6 and 7 wherethe k-star method changes drastically between the originalscore and the score when subtracted from a random base-line.

The NMI measure is almost unaffected by subtractionfrom a random baseline where as other measures have alarger difference. Figures 8 and 9 highlight this property onsubmissions from INEX 2010. This suggests that the nor-malisation we have proposed is similar to that of NMI butis applicable to any measure of cluster quality whether it isintrinsic or extrinsic. Figures 4 and 5 demonstrate how thedifference between the adjusted and unadjusted measures islarger for measures that are not normalised. Each line rep-resents a different document clustering system. The bottommost line in each graph is a randomly generated cluster-ing submission where a category for a document is selecteduniformly at random from the set of categories. Note thatthis random clustering in the figures differs from the ran-dom baseline. The cluster size distribution is also uniform.A random baseline has a cluster size distribution that is spe-cific to the clustering being evaluated. When compared tothe random baseline the expected results are achieved, witha score of zero for all cluster sizes. Note that without ad-justing the cluster size distribution, it is not able to differ-entiate ineffective clusterings as per the NCCG metric inFigure 7. Subtracting the random submission with uniformcluster sizes from the NCCG submission does not reduceits score to zero as can be seen in Figure 6.

Figures 10 and 11 demonstrate the application of the di-vergence from random baseline approach on an intrinsicmeasure. RMSE is the Root Mean Squared Error of theclustering using the cosine similarity measure. The highervalue the better the clustering. A cosine similarity of 1 in-dicates the document and the cluster centre are identical.A score of 0 indicates they are orthogonal and thereforehave no overlap in vocabulary. This experiment was runon a 10,000 document randomly selected sample. The k-means algorithm was used to producek clusters between1 and 10,000. Subtraction from a random baseline assignsa score of zero to these ineffective cases. Furthermore, itprovides a clear maximum for RMSE..

116

0.2

0.4

0.6

0.8

200 400 600 800 1000

Clusters

NC

CG

Figure 6: NCCG

0.0

0.2

0.4

0.6

200 400 600 800 1000

Clusters

NC

CG

Div

erg

en

ce

Figure 7: NCCG Subtracted from a Random Baseline

0.00

0.05

0.10

0.15

200 400 600 800 1000

Clusters

NM

I

Figure 8: NMI

9 ConclusionIn this paper we introduced problems encountered in evalu-ation of document clustering. This is the concept of ineffec-tive clustering and a notion of work. The divergence fromrandom baseline approach deals with these corner cases

0.00

0.05

0.10

0.15

200 400 600 800 1000

Clusters

NM

I D

ive

rge

nce

Figure 9: NMI Subtracted from a Random Baseline

0.2

0.4

0.6

0.8

1.0

0 2000 4000 6000 8000 10000

Clusters

RM

SE

Figure 10: RMSE

0.00

0.05

0.10

0.15

0.20

0 2000 4000 6000 8000 10000

Clusters

RM

SE

Div

erg

en

ce

Figure 11: RMSE Subtracted from a Random Baseline

and increases the confidence that a clustering approach isachieving meaningful learning with respect to any view ofcluster quality. It is also applicable to any clustering eval-uation but was only discussed in the context of documentclustering in this paper..

117

Divergence from a random baseline was formally de-fined and analysed experimentally with both intrinsic andextrinsic measures of cluster quality. Furthermore, this ap-proach appears to be performing a normalisation similar tothat performed by NMI. It also provides a clear optimumfor distortion as measured by RMSE.

References[1] D. Arthur, B. Manthey, and H. Roglin. k-means has polyno-

mial smoothed complexity.IEEE FOCS, 0:405–414, 2009.

[2] P. Arvola, S. Geva, J. Kamps, R. Schenkel, A. Trotman, andJ. Vainio. Overview of the INEX 2010 ad hoc track.INEX2010, pages 1–32, 2011.

[3] A. Banerjee, S. Merugu, I.S. Dhillon, and J. Ghosh. Cluster-ing with Bregman divergences.JMLR, 6:1705–1749, 2005.

[4] J.C. Bezdek, R. Ehrlich, et al. FCM: The fuzzy c-meansclustering algorithm. Computers & Geosciences, 10(2-3):191–203, 1984.

[5] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet allo-cation. The Journal of Machine Learning Research, 3:993–1022, 2003.

[6] C.L.A. Clarke, N. Craswell, I. Soboroff, and G.V. Cormack.Overview of the TREC 2010 web track. Technical report,DTIC Document, 2010.

[7] C.M. De Vries and S. Geva. Document clustering with K-tree. INEX 2008, pages 420–431, 2009.

[8] C.M. De Vries, R. Nayak, S. Kutty, S. Geva, andA. Tagarelli. Overview of the INEX 2010 XML miningtrack: Clustering and classification of XML documents.INEX 2010, pages 363–376, 2011.

[9] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximumlikelihood from incomplete data via the EM algorithm.Jour-nal of the Royal Statistical Society. Series B (Methodologi-cal), pages 1–38, 1977.

[10] L. Denoyer and P. Gallinari. Report on the XML miningtrack at INEX 2007 categorization and clustering of XMLdocuments. InACM SIGIR Forum, volume 42, pages 22–28. ACM, 2008.

[11] L. Denoyer and P. Gallinari. Overview of the INEX 2008XML mining track. INEX 2008, pages 401–411, 2009.

[12] L. Denoyer, P. Gallinari, and A.M. Vercoustre. Report onthe XML mining track at INEX 2005 and INEX 2006.INEX2006, pages 432–443, 2007.

[13] C. Ding and X. He. K-means clustering via principal com-ponent analysis. InICML, page 29. ACM, 2004.

[14] V. Estivill-Castro. Why so many clustering algorithms: aposition paper. ACM SIGKDD Explorations Newsletter,4(1):65–75, 2002.

[15] P. Forner, J. Gonzalo, J. Kekalainen, M. Lalmas, andM. De Rijke. CLEF 2011, volume 6941. 2011.

[16] Shlomo Geva and Christopher M. De Vries. Topsig: topol-ogy preserving document signatures. InCIKM 2011, pages333–338. ACM, 2011.

[17] I. Guyon, U. von Luxburg, and R.C. Williamson. Clustering:Science or art. InNIPS Workshop on Clustering Theory,2009.

[18] Z.S. Harris. Distributional structure.Word, 1954.

[19] M.A. Hearst and J.O. Pedersen. Reexamining the cluster hy-pothesis: scatter/gather on retrieval results. InSIGIR, pages76–84. ACM, 1996.

[20] N. Jardine and C.J. van Rijsbergen. The use of hierarchicclustering in information retrieval.Information storage andretrieval, 7(5):217–240, 1971.

[21] G. Karypis, E.H. Han, and V. Kumar. Chameleon: Hi-erarchical clustering using dynamic modeling.Computer,32(8):68–75, 1999.

[22] A. Kyriakopoulou and T. Kalamboukis. Using clustering toenhance text classification. InSIGIR, pages 805–806. ACM,2007.

[23] K.S. Lee, W.B. Croft, and J. Allan. A cluster-based re-sampling method for pseudo-relevance feedback. InSIGIR2008, pages 235–242. ACM, 2008.

[24] S. Lloyd. Least squares quantization in PCM.IEEETransactions on Information Theory, 28(2):129–137, March1982.

[25] C. D. Manning, P. Raghavan, and H. Schutze. Introductionto Information Retrieval. Cambridge University Press, NY,USA, 2008.

[26] R. Nayak, C.M. De Vries, S. Kutty, S. Geva, L. Denoyer,and P. Gallinari. Overview of the INEX 2009 XML min-ing track: Clustering and classification of XML documents.INEX 2009, pages 366–378, 2010.

[27] R. Nayak, R. Witt, and A. Tonev. Data mining and XMLdocuments. 2002.

[28] S.E. Robertson, S. Walker, S. Jones, M.M. Hancock-Beaulieu, and M. Gatford. Okapi at trec-3. InProceedingsof the seventh Text REtrieval Conference, 1995.

[29] G. Salton and C. Buckley. Term-weighting approaches inautomatic text retrieval.Information processing & manage-ment, 24(5):513–523, 1988.

[30] G. Salton, A. Wong, and C.S. Yang. A vector space modelfor automatic indexing.CACM, 18(11):613–620, 1975.

[31] S.Z. Selim and M.A. Ismail. K-means-type algorithms: ageneralized convergence theorem and characterization of lo-cal optimality. IEEE Transactions on Pattern Analysis andMachine Intelligence, (1):81–87, 1984.

[32] M. Steinbach, G. Karypis, V. Kumar, et al. A comparison ofdocument clustering techniques. InKDD workshop on textmining, volume 400, pages 525–526. Boston, 2000.

[33] A.H. Tan. Text mining: The state of the art and the chal-lenges. InPAKDD, pages 65–70, 1999.

[34] M. Tovar, A. Cruz, B. Vazquez, D. Pinto, and D. Vilarino.An iterative clustering method for the XML-mining task ofthe INEX 2010.INEX 2010, 2011.

[35] A. Trotman, X.F. Jia, and S. Geva. Fast and effective focusedretrieval. INEX 2010, pages 229–241, 2010.

[36] E. Voorhees. The philosophy of information retrieval evalu-ation. InEvaluation of cross-language information retrievalsystems, pages 143–170. Springer, 2002.

[37] F. Wang, S. amd Liang and J. Yang. PKU at INEX 2010XML mining track. INEX 2010, 2010.

[38] J.S. Whissell and C.L.A. Clarke. Improving document clus-tering using Okapi BM25 feature weighting.InformationRetrieval, pages 1–22, 2011.

[39] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang,H. Motoda, G.J. McLachlan, A. Ng, B. Liu, P.S. Yu, et al.Top 10 algorithms in data mining.Knowledge and Informa-tion Systems, 14(1):1–37, 2008.

[40] Wei Xu, Xin Liu, and Yihong Gong. Document clusteringbased on non-negative matrix factorization. InSIGIR, pages267–273. ACM, 2003.

[41] G.K. Zipf. Human behavior and the principle of least effort:An introduction to human ecology. Addison-Wesley Press,1949.

.118

Chapter 14Pairwise Similarity of TopSig

Document Signatures

I contributed almost all contributions in this paper. Shlomo aided in the ex-

perimental design. I wrote the manuscript, wrote the software, aided in experi-

mental design, conducted experiments, and performed the data analysis.


This paper investigates the topology of TopSig document signatures by analyz-

ing the distributions of the pairwise distances between signatures. It highlights

that the indexing process of TopSig is influencing the random codes used to per-

form dimensionality reduction. It skews the binomial distribution for uniform

random binary strings such that the feature space is no longer uniform and is

clustered. It is this non-uniformity that allows the differentiation of meaning.

Furthermore, it highlights that the curse of dimensionality still applies for docu-

ment signatures with most signatures being equidistant to each other around the

middle of the distribution. Only the signatures in the left tail of the distribution

allow adequate separation of documents; i.e. only the local neighborhood of the

signatures are interpetable. This also gives a reasonable explanation of why

the TopSig model is only competitive at early precision for ad hoc information

retrieval when compared to language and probabilistic models. There is only

adequate separation between documents in the head of the ranked list. When

proceeding down the ranked list, documents become equidistant at higher recall.

119

Pairwise Similarity of TopSig Document Signatures

Christopher M. De VriesElectrical Engineering and Computer Science


[email protected]

Shlomo GevaElectrical Engineering and Computer Science


[email protected]

ABSTRACTThis paper analyses the pairwise distances of signatures pro-duced by the TopSig retrieval model on two document col-lections. The distribution of the distances are compared topurely random signatures. It explains why TopSig is onlycompetitive with state of the art retrieval models at earlyprecision. Only the local neighbourhood of the signaturesis interpretable. We suggest this is a common property ofvector space models.

Categories and Subject DescriptorsH.3.3 [Information Storage and Retrieval]: InformationSearch and Retrieval—Retrieval Models

KeywordsSignature Files, Topology, Vector Space IR, Random Index-ing, Document Signatures, Search Engines, Document Clus-tering, Near Duplicate Detection, Relevance Feedback

1. INTRODUCTIONThis paper investigates the properties of the pairwise sim-

ilarities of document signatures produced by TopSig. Top-Sig is a retrieval model where documents are represented byd-bit binary strings that lie on a d-dimensional collection hy-percube. The signatures are produced by a random processcalled random indexing [10] or random projection [1] whichcompresses the standard term-by-document matrix.

Pairwise similarity plays an important role in many in-formation retrieval related tasks such as ad hoc retrieval,clustering, classification, filtering, near duplicate detectionand relevance feedback.

The paper proceeds as follows. In Section 2, the TopSigretrieval model is introduced. Section 3 describes the docu-ment collections used in the experiments. The experimentalsetup is introduced in Section 4 and the results are presentedin Section 5. The paper is concluded by a discussion of theimplications of the results in Section 6.

2. TOPSIGTopSig [7] offers a radically different approach to the con-

struction of file signatures. Traditional file signatures [6]have been shown to be inferior to approaches using invertedindexes, both in terms of the time and space required toprocess and store the index [12, 13]. However, TopSig over-comes previous criticisms aimed at file signatures by taking

a principled approach using the vector space model, dimen-sionality reduction and numeric quantisation. Previous ap-proaches to file signatures were constructed in an ad hocfashion by combining random binary signatures using a bit-wise XOR which is a Bloom filter [2] for the terms con-tained in documents. In contrast, TopSig randomly indexesa weighted term-by-document matrix and then quantises it.TopSig is competitive with state of the art probabilistic andlanguage retrieval models at early precision, and clusteringapproaches [7].

Let D = {d1, d2, ..., dn} be a document collection of ndocuments signatures, D ⊂ {+1,−1}d, |D| = n. Let F ={f1, f2, ..., fn} be the same document collection as D whereeach document is represented by a v-dimensional real valuedvector, F ⊂ Rv, |F | = n, where v is the size of the vocabu-lary of the document collection. F is the term-by-documentmatrix in the full space of the collection vocabulary whichunderlies most modern retrieval systems.

TopSig indexes documents using a mapping function, m :Rv → {+1,−1}d, that maps a document from the originalv-dimensional continuous real valued term space, to a d-dimensional discrete binary valued space. The index is con-structed using a mapping function, D = {f ∈ F : m(f)}.The mapping function creates a sparse random ternary in-dex vector of d-dimensions for each term in the documentwith +1 and -1 values in random positions and the major-ity of positions containing 0 values. These randomly gen-erated codes are almost orthogonal to each other and havebeen shown to provide comparable quality to orthogonal ap-proaches such as principle component analysis [1]. The indexvector is multiplied by the term weight and added to a d-dimensional real valued vector that represents the document.Once all the terms in a document have been processed, thisreduced dimensionality document vector is then quantisedto a d-dimensional binary vector by thresholding each valuein each dimension to 1 if greater than 0 and 0 otherwise.The 1 and 0 values in the binary vector represent +1 and-1 values. This mapping function can be applied to eachdocument independently, meaning that new documents canbe indexed in isolation without having to update the exist-ing index. This is a key advantage to random indexing [10]over other dimensionality reduction techniques such as la-tent semantic analysis [5] which requires global analysis ofthe term-by-document matrix using the singular value de-composition [8].

The indexing process of TopSig is similar to that of SimHash[3]. However, TopSig uses signatures an order of magni-tude longer than SimHash and it uses much sparser random

120

codes. The search process for ad hoc retrieval also differs,where TopSig searches in the subspace of the query and ap-plies relevance feedback.

The binary vectors in D provide a faithful representa-tion of the original document vectors in F . The topolog-ical relationships in the original space are preserved in thereduced dimensionality space. This is supported by theJohnson-Lindenstrauss lemma [9] that states if points in ahigh-dimensional space are projected into a randomly cho-sen subspace, of sufficiently high-dimensionality, then thedistances between the points are approximately preserved.It also states that the number of dimensions required to re-produce the topology is asymptotically logarithmic in thenumber of points.

3. DOCUMENT COLLECTIONSWe have used the INEX Wikipedia 2009 collection and

the TREC Wall Street Journal (WSJ) Collection to eval-uate pairwise distances of TopSig signatures. The INEXWikipedia collection contains 2,666,190 documents with avocabulary of 2,132,352 terms. We have used 2 subsets ofthis collection during evaluation. The first is a 144,265 doc-ument subset used for the INEX 2010 XML Mining track [4].This is the reference run for the ad hoc track in 2010 pro-duced by an implementation of Okapi BM25 in the ATIREsearch engine [11]. It is denoted by INEXreference. The sec-ond is a randomly selected 144,265 document subset chosento match the size of the XML Mining subset. It is denoted byINEXrandom. Subsets of the INEX Wikipedia 2009 collec-tion were used for this experiment because calculating pair-wise distances has a time complexity of O(n2) and becomesintractable for millions of documents. The mean documentlength in the Wikipedia has 360 terms, the shortest has 1term and the longest has 38,740 terms. The Wall StreetJournal Collection consists of 173,252 documents and a vo-cabulary of 113,288 terms. The mean WSJ document lengthis 475 terms, the shortest has 3 terms, and the longest has12,811 terms.

The INEX Wikipedia 2009 collection consists of 12GB ofuncompressed text or 50GB of uncompressed XML whichincludes semantic markup. The 2,666,190 documents aresplit into 3,617,380 passages. 1024-bit TopSig signatures usea total of 441MB to index the collection. The TREC WallStreet Journal consists of 518MB of uncompressed text. The173,252 documents are split into 222,238 passages. 1024-bitTopSig signatures use a total of 27MB to index the collec-tion.

4. EXPERIMENTAL SETUPPairwise similarities define the topology of a set of doc-

uments. Each document is compared to every other doc-ument. These similarities define the relationships betweenall documents in a collection. If two documents are nearbyeach other they share the same semantic context. TopSiguses the Hamming distance to measure similarity betweentwo documents. It produces values in the range [0, d] where0 indicates the documents are indentical and values from 1to d indicate decreasing similarity between documents whered is the most dissimilar two documents can be.

The TopSig indexing process uses random codes to com-press document vectors. These random codes are also calledindex vectors in the random indexing process. The codes are

influenced by the original document vectors. Similar docu-ments are placed close together in the reduced binary vectorspace that are close together in the original vector space.Therefore, it is expected that the pairwise relationships be-tween documents will be biased by this process. If the in-dexing process has no effect then the document signatureswould appear no different to purely random signatures. Thepairwise distances between randomly generated random sig-natures can be described by the Binomial distribution. Thedistribution of pairwise distances produced by the TopSigindexing process can be estimated by creating a histogramof similarity counts at all Hamming distances.

All n2−n2

pairwise distances between document signaturesin D are calculated. This is all the similarities contained inthe upper triangular form of the pairwise distance matrixwithout the entries along main diagonal. The lower halfof the pairwise distance matrix does not need to be calcu-lated as the Hamming distance is symmetric. Measuring apair of signatures both ways around does not add any ex-tra information. The Hamming distance similarity function,s : {+1,−1}d × {+1,−1}d → N, is symmetric such thattwo documents compared in either order produce the sameresult, dx, dy ∈ D : s(dx, dy) = s(dy, dx). The estimatedprobability of finding a signature at Hamming distance, h,is the fraction of similarities at that distance over the to-tal number of distance comparisons. The probability massfunction, pmfe : P{+1,−1}d × N → R, produces the esti-mated probability from the pairwise distances in D where nis the number of signatures in the collection D, |D| = n,

pmfe(D,h) =

∣∣∣∣{

(dx, dy) :dx, dy ∈ D ∧ dx 6= dy∧s(dx, dy) = h

}∣∣∣∣n2−n

2

. (1)

Note that pmfe is the estimated probability for finding asignature at distance, h, when averaged across all documentsin the collection, D.

The probability of finding a random binary code of length,d, at Hamming distance, h, is described by the Binomialprobability mass function, pmfb : N× N→ R,

pmfb(d, h) =

(dh

)ph(1− p)d−h, (2)

where p is the probability of a bit being set, p = 0.5.The cumulative distribution function for either the esti-

mated, cdfe : P{+1,−1}d × N → R, or Binomial, cdfb :N × N → R, probability distributions are the sum of theprobability mass function from 0 to h,

cdfe(D,h) =h∑

0

pmfe(D,h), (3)

cdfb(d, h) =h∑

0

pmfb(d, h). (4)

An implementation of the TopSig 1 search engine was usedto index the document collections. It splits documents intopassages on a sentence boundary between a minimum andmaximum number of word tokens. If the maximum wordtoken limit is reached before the end of a sentence, it is splitat that point. Therefore, documents have multiple signa-tures. This has been found to be effective for retrieval of1http://topsig.googlecode.com

121

Collection Documents Signatures (Passages)

INEXreference 144,265 328,207INEXrandom 144,265 195,369WSJ 173,252 222,238

Table 1: Number of Signatures Generated by TopSig

400 420 440 460 480 500 520 540 560 580 6000

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Hamming distance

expecte

d d

ocum

ent count

estimated from data

Binomial

Figure 1: INEXrandom pmf

documents of varying length. The INEX collection was spliton a minimum of 256 and maximum of 280 word tokens.The WSJ collection was split on a minimum of 256 and amaximum of 384 word tokens. All indexes use 1024-bit sig-natures, resulting in the number of signatures as listed inTable 1.

The resulting probability distributions have been multi-plied by the number of signatures in a collection to producethe expected number of signatures at a given Hamming dis-tance. In this case, the pmf gives the average number ofsignatures expected at a particular Hamming distance whencomparing a signature to the entire collection. The cdf givesthe average number of signatures expected to lie within agiven Hamming distance when comparing a signature to theentire collection, i.e. the number of nearest neighbours toexpect within a particular Hamming distance.

5. EXPERIMENTAL RESULTSFigures 1 to 12 highlight the difference between the distri-

butions estimated from the pairwise distances and the dis-tributions expected from random binary signatures from theBinomial distribution. It can be seen that all the estimateddistributions are left skewed towards a Hamming distanceof 0. This indicates that the signatures produced by TopSigare biased in such a way that documents are more similar toeach other. There are more documents expected at a moresimilar, lower Hamming distance, than expected at random.

The probability mass functions in Figures 1, 2 and 3 repre-sent the expected number of signatures to be seen at a par-ticular Hamming distance. The graphs have been centredaround the middle of the distributions to allow better visu-alisation of the separation between the distributions. The

400 420 440 460 480 500 520 540 560 580 6000

1000

2000

3000

4000

5000

6000

7000

8000

9000

Hamming distance

expecte

d d

ocum

ent count

estimated from data

Binomial

Figure 2: INEXreference pmf

350 400 450 500 550 6000

1000

2000

3000

4000

5000

6000

Hamming distance

expecte

d d

ocum

ent count

estimated from data

Binomial

Figure 3: WSJ pmf

122

400 420 440 460 480 500 520 540 560 580 6000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2x 10

5

Hamming distance

expecte

d n

eare

st neig

hbours

estimated from data

Binomial

Figure 4: INEXrandom cdf

tails of the distributions tend towards 0 as expected. For ex-ample, the graph in Figure 3 has a y value for the estimateddistribition of 1033.39 at a Hamming distance of 441. Whencomparing a signature to the entire collection, it would beexpected on average to encounter 1033.39 signatures thatare exactly at a Hamming distance of 441. However, the ex-pected number of signatures at a Hamming distance of 441for purely random signatures is only 0.29. This suggests thatthe signatures produced by TopSig are not uniformly dis-tributed throughout the feature space. The number of near-est neighbours at a given Hamming distance, as described bythe pmf , quickly increases when starting from a Hammingdistance of 0 and proceeding to a Hamming distance of d.This is the same order that TopSig ranks signatures in theranked list, or, any other task that compares relative order-ings of documents such as clustering. This is true for boththe estimated and Binomial distributions. As the neigh-bourhood of analysis is increased, more and more documentsbecome equidistant; i.e. they share the same Hamming dis-tance. This is a property of vector space models known asthe “curse of dimensionality”. However, the left skewness ofestimated distributions indicates that the pairwise distancesof the document collections allow better differentiation be-tween documents than expected at random. It is this leftskewness of the distributions that allows TopSig to competewith state of the art retrieval models at early precision. Doc-uments are topically clustered and are not random bags ofwords. Neither of the document collections have signaturesfurther apart than a Hamming distance of 617, meaning thatthe indexing process has moved the random signatures fromthe right side of the distribution to the left. This again indi-cates that similar signatures are being placed closer togetherand are therefore more topically related and clustered.

The cumulative distribution functions in Figures 4, 5 and6 represent the area under the curve for each of the prob-ability mass functions. The y value at a given Hammingdistance indicates the average number of nearest neighboursexpected within a given Hamming distance when comparinga signature to the entire collection. For example, the graphin Figure 6 has a y value for the estimated distribition of25975.78 at a Hamming distance of 441. When comparinga signature to the entire collection, it would be expected on

400 420 440 460 480 500 520 540 560 580 6000

0.5

1

1.5

2

2.5

3

3.5x 10

5

Hamming distance

expecte

d n

eare

st neig

hbours

estimated from data

Binomial

Figure 5: INEXreference cdf

350 400 450 500 550 6000

0.5

1

1.5

2

2.5x 10

5

Hamming distance

expecte

d n

eare

st neig

hbours

estimated from data

Binomial

Figure 6: WSJ cdf

123

100 150 200 250 300 3500

10

20

30

40

50

60

70

80

90

100

Hamming distance

expecte

d n

eare

st neig

hbours

estimated from data

Binomial

Figure 7: INEXrandom cdfFirst 100 signatures from estimated distribution

50 100 150 200 250 300 350 4000

20

40

60

80

100

120

Hamming distance

expecte

d n

eare

st neig

hbours

estimated from data

Binomial

Figure 8: INEXreference cdfFirst 100 signatures from estimated distribution

average to encounter 25975.78 signatures that are nearestneighbours at a Hamming distance of 441. However, the ex-pected number of signatures at a Hamming distance of 441for purely random signatures is only 1.13. Again, the sep-aration between the curves indicates that TopSig is placingsemantically related documents close together and preserv-ing the topological relationships of the original documentvectors.

Figures 7, 8 and 9 zoom in on the cdf where the first100 nearest neighbours are expected for the distribution es-timated from the pairwise distances of the collections. In allcases almost zero signatures are expected at random wherethere are TopSig signatures expected in the range [1, 100].This indicates that the signatures produced by TopSig re-turn nearest neighbours at a Hamming distance much earlierthan expected by purely random signatures. The start andend points of these curves are listed in Table 2.

Figures 10, 11 and 12 zoom in on the cdf where the first100 nearest neighbours are expected for random binary sig-natures as described by the Binomial distribution. The start

0 50 100 150 200 250 300 3500

10

20

30

40

50

60

70

80

90

100

Hamming distance

expecte

d n

eare

st neig

hbours

estimated from data

Binomial

Figure 9: WSJ cdfFirst 100 signatures from estimated distribution

Collection cdfb @ cdfe = 1 cdfb @ cdfe = 100

INEXreference 1.66× 10−168 1.104× 10−15

INEXrandom 1.80× 10−158 2.06× 10−34

WSJ 0 1.59× 10−34

Table 2: Nearest Neighbours Expected from cdfb

440 442 444 446 448 450 452 454 456 458 4600

500

1000

1500

2000

2500

3000

3500

4000

Hamming distance

expecte

d n

eare

st neig

hbours

estimated from data

Binomial

Figure 10: INEXrandom cdfFirst 100 signatures from Binomial distribution

Collection cdfe @ cdfb = 1 cdfe @ cdfb = 100

INEXreference 567.86 2129.02INEXrandom 1793.26 3673.78WSJ 25495.67 50709.94

Table 3: Nearest Neighbours Expected from cdfe

124

440 442 444 446 448 450 452 454 456 4580

200

400

600

800

1000

1200

1400

1600

1800

2000

Hamming distance

expecte

d n

eare

st neig

hbours

estimated from data

Binomial

Figure 11: INEXreference cdfFirst 100 signatures from Binomial distribution

440 442 444 446 448 450 452 454 456 458 4600

1

2

3

4

5

6x 10

4

Hamming distance

expecte

d n

eare

st neig

hbours

estimated from data

Binomial

Figure 12: WSJ cdfFirst 100 signatures from Binomial distribution

Collection cdfb@h=512n

cdfe@h=512n

INEXreference 0.51 0.61INEXrandom 0.51 0.63WSJ 0.51 0.89

Table 4: Signatures Expected within d2

and end points of these curves are listed in Table 3. Thereare many more signatures expected to be nearest neigh-bours when using TopSig signatures. However, both theestimated and Binomial distributions have many equidis-tant documents around the middle of their distributions.This suggests that only the local neighbourhood of the sig-natures has semantic meaning. This can also be seen in thepmf distributions where most of the signatures exist aroundthe middle of the distribution. Another perspective is thatare too many ties at these distances for the feature space todifferentiate signatures.

The skewness of the estimated distributions suggests thatthe feature space is not uniform and is clustered. Some areasof the space are more dense than others. This is vital for anydocument representation because it is this non-uniformitythat allows differentiation of meaning.

Table 3 lists the number of nearest neighbours expectedfrom the distribution estimated from pairwise distances whenthe Binomial distribution expects 1 and 100 nearest neigh-bours, as listed in columns 2 and 3 respectively. For exam-ple, the INEXrandom collection expects on average 1793.26signatures to be nearest neighbours to other signatures whenpurely random signatures would expect 1. When purely ran-dom signatures expect on average 100 nearest neighbours,the INEXrandom collection expects 3673.78 nearest neigh-bours. These values are linearly interpolated as they existin between two Hamming distances under the cdf . Thesevalues are the start and end points for the curves in Figures10, 11 and 12. Table 2 lists the opposite, i.e., the numberof nearest neighbours expected from the Binomial distribu-tion when the distribution estimated from pairwise distancesexpects at 1 and 100 nearest neighbours.

Table 4 lists the number of signatures expected within aHamming distance of d

2. This summarises the distributions

in a single number, where the difference between the distri-butions indicates the fraction of the signatures shifted fromthe left hand side of the Binomial distribution to the rightby the indexing process. It is also the value under the pmfat d

2which is also the y value of the cdf at d

2.

The difference in distributions between the INEX refer-ence and random subsets indicates that the reference runis not suitable for estimating properties of the entire col-lection. This is to be expected as the reference run hasbeen biased by the queries used for ad hoc retrieval. Table 3shows that INEXreference expects 567.86 nearest neighbourswhere as INEXrandom expects 1793.26 nearest neighbourswhen purely random signatures expect 1 nearest neighbour.This indicates that the reference run is less clustered thanthan a random sample from the INEX Wikipedia collection.This can explained because the documents returned by thereference run are more diverse than a random sample fromthe collection. As the diverse topics are further apart, i.e.more dissimilar, there are more inter-topic distances than

125

intra-topic distances, leading to less signatures being locatednearby. Note that the reference run is determined by onlysearching in a few dimensions determined by the keywords inthe queries, where as the pairwise distances compare entiredocuments, using their entire vocabulary.

6. DISCUSSIONThe results presented indicate why TopSig is only com-

petitive at early precision in comparison to probabilistic andlanguage models for ad hoc retrieval. As the Hamming dis-tance increases when proceeding down the ranked list moreand more documents become equidistant. This can be seenin Figures 1, 2 and 3 containing plots of probability massfunctions indicating the expected number of documents at agiven Hamming distance. The curves quickly increase to thepoint where thousands of documents are equidistant. Thisis likely to be a property of any vector space model due tothe “curse of dimensionality”. Only the tails of the distri-bution of distances are useful for differentiation of relevantand non-relevant documents.

Approaches to near duplicate detection such as SimHash[3] use short signatures that are 64-bits in length. This onlyallows the few nearest neighbours to be differentiated whichis adequate for near duplicate detection. This can be ex-plained by the probability mass functions in Figures 1, 2and 3. The x axis for 64-bit signatures will only contain 65positions for the Hamming distances 0 to 64. As the num-ber of equidistant documents is a function of the x value, or,Hamming distance, many documents will appear equidistantmuch sooner than with signatures of 1024-bits in length.The same curve has to be squeezed into 65 positions insteadof 1025 positions. A duplicate is expected to be very sim-ilar to other documents it is a duplicate of, so these shortsignatures will suffice. In contrast, TopSig uses much longersignatures that allow for better separation for tasks such asad hoc retrieval and clustering.

Document clustering places similar documents into groupsof topically related documents. The results presented in thispaper suggest that only document clusters that exist withinthe local neighbourhood of a vector space are interpretable.As the documents within a cluster become more dissimilar,the grouping of these documents loses its meaning for thesame reason precision at higher recall suffers in ad hoc re-trieval, there are many equidistant documents that are un-able to be differentiated from one another. This suggeststhat only a large number of smaller document clusters aremeaningful. The maximum interpretable radius for a docu-ment cluster can be esimated heuristically from the distribu-tions of estimated from the pairwise data. This heuristic isto stop at the point where the distribution starts to sharplyincrease. In Figure 1 this would be approximately a Ham-ming distance of 450, or, the point before the elbow in theleft hand side of the distribution occurs.

Furthermore, TopSig is likely to be useful for increasedcomputational efficiency of document-to-document compar-isons. Examples of this include clustering, classification, fil-tering, relevance feedback, near duplicate detection and ex-plicit semantic analysis. All of these tasks can exploit theleft tails of the probabilty mass function distributions de-picted in Figures in 1, 2 and 3. In fact, TopSig has beenshown to provide a 1 to 2 magnitude increase in process-ing speed for document clustering [7] over traditional sparsevector representations.

The analysis presented in this paper is expected to be use-ful for any vector space model. It would be expected thatsimilar behaviour would be exhibited whether comparingentire documents in the full vocabulary space of the term-by-document matrix or comparing dimensionality reduceddocuments in a continuous space such as those producedby latent semantic analysis, principal component analysisor random indexing.

7. REFERENCES[1] E. Bingham and H. Mannila. Random projection in

dimensionality reduction: applications to image andtext data. In KDD 2001, pages 245–250. ACM, 2001.

[2] B.H. Bloom. Space/time trade-offs in hash codingwith allowable errors. Communications of the ACM,13(7):422–426, 1970.

[3] M.S. Charikar. Similarity estimation techniques fromrounding algorithms. In STOC, pages 380–388, NewYork, NY, USA, 2002. ACM.

[4] C.M. De Vries, R. Nayak, S. Kutty, S. Geva, andA. Tagarelli. Overview of the INEX 2010 XML miningtrack. INEX 2010, LNCS, 2011.

[5] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K.Landauer, and R. Harshman. Indexing by latentsemantic analysis. Journal of the American Society forInformation Science, 41(6):391–407, 1990.

[6] C. Faloutsos and S. Christodoulakis. Signature files:an access method for documents and its analyticalperformance evaluation. ACM Trans. Inf. Syst.,2:267–288, October 1984.

[7] S. Geva and C.M. De Vries. TopSig: topologypreserving document signatures. In CIKM 2011, pages333–338. ACM, 2011.

[8] G. Golub and W. Kahan. Calculating the singularvalues and pseudo-inverse of a matrix. Journal of theSociety for Industrial and Applied Mathematics:Series B, Numerical Analysis, 2(2):205–224, 1965.

[9] W.B. Johnson and J. Lindenstrauss. Extensions ofLipschitz mappings into a Hilbert space.Contemporary mathematics, 26(189-206):1–1, 1984.

[10] M. Sahlgren. An introduction to random indexing. InTKE 2005, 2005.

[11] A. Trotman, X. Jia, and S. Geva. Fast and effectivefocused retrieval. In Focused Retrieval and Evaluation,LNCS, pages 229–241. 2010.

[12] I.H. Witten, A. Moffat, and T.C. Bell. Managinggigabytes: compressing and indexing documents andimages. Morgan Kaufmann, 1999.

[13] J. Zobel, A. Moffat, and K. Ramamohanarao. Invertedfiles versus signature files for text indexing. ACMTrans. Database Syst., 23:453–490, December 1998.

126

Chapter 15Distributed Information Retrieval:

Collection Distribution, Selection

and the Cluster Hypothesis for

Evaluation of Document

Clustering

I contributed almost all of this paper. Shlomo and Andrew provided much

needed feedback and aided in experimental design. I wrote the manuscript,

wrote software, aided in experimental design, conducted experiments, and per-

formed the data analysis. I proposed, implemented and evaluated all of the

contributions in this paper such as the TopSig K-tree and the cluster ranking

process, CBM625.

15.1 Algorithms

TopSig K-tree is introduced to scale clustering to 50 million documents while

producing approximately 140,000 clusters in 10 hours in a single thread of execu-

tion. To the best of my knowledge, clustering of a document collection this large

into this many clusters has not been reported in the literature. A delayed up-

date mechanism is introduced as the K-tree algorithm regularly updates means

along the insertion path of vectors inserted into the tree. When using binary

vectors the cost of updating a mean is high as all the bits in vectors associated

127

with the mean must be unpacked and repacked.


This paper combines all of the previous sections of representations, algorithms

and evaluation to implement the ideas as a cluster-based search engine. The

final result is that only 0.1% of the 50 million document ClueWeb 2009 Category

B collection has to be searched. This is a 13 fold improvement over the previ-

ously best known result. It also addresses issues surrounding the evaluation of

document clustering and demonstrates the use case presented in the evaluation

at INEX.

128

A

Distributed Information Retrieval: Collection Distribution, Selectionand the Cluster Hypothesis for Evaluation of Document Clustering

CHRISTOPHER M. DE VRIES, Queensland University of TechnologySHLOMO GEVA, Queensland University of TechnologyANDREW TROTMAN, University of Otago

This paper addresses the problem of using document clusters for ad hoc information retrieval in large scaledistributed and parallel search engines. We present a novel approach to collection distribution and selec-tion via the use of the TopSig K-tree for document clustering for collection distribution, and the CBM625method for ranking clusters for collection selection. This approach allows a 13 fold decrease in the numberof documents that have to be retrieved over the previous best known approach. We suggest that the majorcontributor to this improvement is the TopSig K-tree. It allows the creation of fine grained clusters of largescale document collections without resorting to sampling. The TopSig K-tree can be updated efficiently ina dynamic and online fashion. TopSig also allows the creation of document signatures in isolation withoutperforming any global analysis. These properties are particularly useful for operational retrieval systems,where new documents must be regularly indexed. The whole world does not need to be stopped, and theindexes can be updated incrementally and efficiently. This research started with the evaluation of documentclustering at the INEX XML Mining track in 2009. This paper also presents the use of ad hoc relevance judg-ments as an alternative approach to the evaluation of document clustering. Furthermore, this evaluationis motivated by a real use case that is demonstrated in this paper; the reduction in the number of docu-ments required to be searched in a distributed retrieval system. These results further confirm the clusterhypothesis.

Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search andRetrieval

General Terms: Algorithms, Design, Experimentation, Performance, Theory

Additional Key Words and Phrases: Distributed Information Retrieval, Document Clustering, DocumentClustering Evaluation, Cluster Hypothesis, Collection Selection, Collection Distribution, Document Signa-tures, Signature Files, Search Engines, Vector Space IR, K-tree, k-means, TopSig, CBM625, BM25, ATIRE

ACM Reference Format:Christopher M. De Vries, Shlomo Geva, Andrew Trotman, 2013. Distributed Information Retrieval: Collec-tion Distribution, Selection and the Cluster Hypothesis for Evaluation of Document Clustering. ACM Trans.Inf. Syst. V, N, Article A (January YYYY), 38 pages.DOI:http://dx.doi.org/10.1145/0000000.0000000

This work is supported by the Queensland University of Technology Deputy Vice Chancellor’s Scholarship,the Queesland University of Technology High Performance Computing Centre and the Queensland Univer-sity of Technology Big Data Lab.Author’s addresses: C.M. De Vries and Shlomo Geva, School of Electrical Engineering and Computer Sci-ence, Queensland University of Technology, Brisbane, Australia; Andrew Trotman, Department of ComputerScience, University of Otago, Dunedin, New Zealand.Permission to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies show this notice on the first page or initial screen of a display along with the full citation. Copyrightsfor components of this work owned by others than ACM must be honored. Abstracting with credit is per-mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any componentof this work in other works requires prior specific permission and/or a fee. Permissions may be requestedfrom Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)869-0481, or [email protected]© YYYY ACM 1046-8188/YYYY/01-ARTA $15.00DOI:http://dx.doi.org/10.1145/0000000.0000000

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

129

A:2 C.M. De Vries et al.

1. INTRODUCTIONCategory based document clustering evaluation does not have a specific use case. Clus-ters are compared to a known solution referred to as the ground truth. The groundtruth is determined by humans who assign each document to one or more classesbased on the topics they believe exist in a document. In this paper, we offer a noveland alternative evaluation using ad hoc relevance judgements from information re-trieval evaluation forums. This approach is motivated by the cluster hypothesis, whichstates that documents relevant to information needs tend to be more similar to eachother than non-relevant documents – the relevant documents tend to cluster together.Therefore, if the cluster hypothesis holds, relevant documents for a given query willtend to appear in relatively few document clusters. This evaluation tests documentclustering for a specific use case. It evaluated how effective a cluster is at increas-ing the throughput of an information retrieval system by only searching a subset of adocument collection.

The cluster hypothesis can be exploited for searching a fraction of a document collec-tion while still covering and ranking most of the relevant documents. This use case istested with several experiments and evaluation strategies, including tests with work-ing retrieval systems and reusable information retrieval evaluation resources. Theprocess of selecting clusters to search in an information retrieval system is referred toas collection selection. It is also discussed in the literature under the names of shard-ing or index partitioning. Each cluster is treated as a sub-collection so that each clustercan be distributed to a different machine. Each sub-collection or cluster is processed bya collection selection algorithm where clusters are ranked by their estimated relevanceto a query. Only a small fraction of the most relevant clusters is searched. If the clus-ters hypothesis holds, then the most relevant clusters will contain most of the relevantdocuments. Whether these clusters are distributed over many machines or only existon a single machine, they can be exploited so that only a fraction of a collection has tobe searched. The clustering or partitioning of a collection does not strictly have to beproduced by a clustering algorithm. It can be created by random partitioning or sourcebased methods. The collection selection process simply takes a grouping of documentsand recommends which groupings to search given a query. The collection selection pro-cess can be seen as a broker or recommender where it recommends the collections orclusters to search given a query.

The study of collection distribution and selection is particularly relevant to dis-tributed information retrieval where many machines are cooperating to provide asearch service. It is expected that distributing a collection via document clusteringand search via collection selection can increase throughput. In this paper, we only con-sider the case of a cooperative distributed information retrieval system where full ac-cess to the indexes of each system is possible. This enables use of global statistics andany other approaches that require more than the use of the standard query interface.Uncooperative systems only allow use of the standard query interface. All necessaryinformation about a system that is acting as part of the distributed system has to begained via this interface. The collection selection approach has to build an appropriatemodel by querying the systems individually.

Document clustering is not typically thought of as an efficient and scalable process.Many popular clustering algorithms have a time complexity of O(n2) making themimpractical for collections containing millions to billions of documents. However, wepresent document clustering approaches that can scale to large document collectionsand we demonstrate their effectiveness for use with collection selection so that only afraction of a document collection has to be searched.


130

Distributed Information Retrieval A:3

The material presented here builds upon and extends the evaluation of documentclustering in the XML Mining track at the Initiative for the Evaluation of XML Infor-mation Retrieval (INEX) in 2009 [Nayak et al. 2010] and 2010 [De Vries et al. 2011].It also builds upon the divergence from random baseline approach to the evaluation ofdocument clustering published in 2012 [De Vries et al. 2012].

For a comprehensive review of clustering algorithms see Duda et al. [2001] and fora comprehensive review of information retrieval see Manning et al. [2008].

The paper proceeds as follows. Related publications are discussed in Section 2. Sec-tion 3 introduces the INEX XML Mining track where this line of research originatedwith a collaborative evaluation. Ad hoc information retrieval and its associated evalu-ation are briefly introduced in Section 6. The literature surrounding document cluster-ing is discussed in Section 4. The cluster hypothesis that links document clustering andad hoc information retrieval are discussed in Section 7. Section 6.1 discusses collectiondistribution and selection, and relevant publications in this area. The issues with cur-rent approaches to evaluation of document clustering are discussed in Section 4.1. Itis explained how this evaluation overcomes these issues in Section 8. The NormalisedCumulative Cluster Gain evaluation approach is introduced in Section 9. It providesan upper bound on retrieval performance by using an “oracle” collection selection ap-proach. Section 10 discusses the Divergence from Random Baseline approach to theevaluation of document clustering that adjusts for problematic clusterings in evalu-ation. The TopSig K-tree is presented for the first time in Section 11. A new clusterranking approach for collection selection named CBM625 is defined in section 12. Theprior theoretical results from INEX are tested on a cluster based retrieval system inSection 13. Finally, the implications of all the results are discussed in Section 14.

2. RELATED WORKOther researchers have investigated splitting document collections into shards. Thestudy that is most similar to this study is that of Kulkarni and Callan [2010]. It inves-tigates the use of topics learned in an unsupervised manner to search a small fractionof a large scale document collection. Documents are clustered using k-means with theKL-divergence similarity measure using a small sample of uniformly randomly sam-pled documents. The remaining documents are then assigned to these centroids foundusing the sample. The drawback of this approach is out of vocabulary terms when rank-ing clusters. The approach outlined in this paper does not suffer from this drawbackas it clusters all documents. Xu and Croft [1999] also perform a similar study findingsimilar results on smaller scale collections. All of these studies, including this study,have found the same result, that topic based allocation of documents via documentclustering outperforms uniform random assignment of documents.

Other studies have investigated alternative collection distribution strategies.Larkey et al. [2000] investigate the use of manual topical classification of patent docu-ments where it outperformed chronological allocation. Puppin et al. [2006] investigatethe use of query logs to partition a document collection with the drawback that not alldocuments can be clustered.

Ding et al. [2011] describe a histogram approach for collection selection. This issimilar to the approach we tried in Section 12. However, we found that squaring theweights to produce document scores to be more effective than a summation as in thehistogram approach. The authors propose that random allocation and random selec-tion of documents and servers are a good candidate due to its simplicity. Fine grainedclusters such as those produced by TopSig K-tree in Section 11 represent somethingmuch closer to a document than much larger clusters such as those used by Kulkarniand Callan [2010]. Therefore, randomly allocating fine grained clusters can still havethe advantages of this approach and topical clustering approaches.


131


Fuhr et al. [2011] introduce the “The optimum clustering framework”. It augmentsdocument clustering with queries and relevance judgements so that relevant docu-ments can be optimally placed into clusters to maximise the cluster hypothesis. Thisis no longer unsupervised learning and has been transformed into semi-supervisedlearning. Therefore, this study is essentially different from studies of Kulkarni andCallan [2010], Xu and Croft [1999] and this study. However, we suggest that the opti-mum clustering framework may also be more effective when using scalable clusteringapproaches such as TopSig K-tree.

?] take a different approach to dealing with the issues surrounding evaluation ofclustering using a ground truth. They suggest that using multiple different groundtruths as different views of clustering can overcome issues with clustering evaluation.However, this does not deal with the problem of labelling large data sets. It only in-creases assessor load as each example has to be labelled multiple times.

3. INEX XML MINING TRACKThe XML document mining track was run for six years at INEX, the Initiative for theEvaluation of XML Information Retrieval 1 [Denoyer et al. 2007; Denoyer and Galli-nari 2008; Denoyer and Gallinari 2009; Nayak et al. 2010; De Vries et al. 2011]. Itexplored the emerging field of classification and clustering of semi-structured docu-ments.

Document clustering has been evaluated at INEX using the standard clusters to cat-egories approach. Categories were extracted from the Wikipedia and used as a groundtruth. Clusterings produced by different systems were evaluated using measures suchas Purity, Entropy, F1 and NMI, indicating how well the clusters match the categories.

A novel approach to document clustering evaluation was introduced at INEX in 2009[Nayak et al. 2010] and 2010 [De Vries et al. 2011]. It used ad hoc information retrievalto evaluate document clustering by using relevance judgments from retrieval systemsin the ad hoc track [?]. Ad hoc information retrieval evaluation is a system basedapproach that evaluates how different systems rank relevant documents. For systemsto be compared, the same set of information needs and documents have to be used. Atest collection consists of documents, statements of information need, and relevancejudgments [Voorhees 2002]. Relevance judgments are often binary, and any documentis considered relevant if any of its contents can contribute to the satisfaction of thespecified information need. However, the ad hoc track at INEX provides additionalrelevance information where assessors highlight the relevant text in the documents.Information needs are also referred to as topics and contain a textual description ofthe information need, including guidelines as to what may or may not be consideredrelevant. Typically, only the keyword based query of a topic is given to a retrievalsystem.

The ad hoc information retrieval based evaluation of document clustering is moti-vated by the cluster hypothesis that suggests relevant documents are more similarto each other than non-relevant documents; relevant documents tend to cluster to-gether. The spread of relevant documents over a clustering solution was measuredusing the Normalised Cumulative Cluster Gain (NCCG) measure in the INEX XMLmining track in 2009 and 2010 [Nayak et al. 2010; De Vries et al. 2011]. This evalu-ation approach also has a specific use case in information retrieval. It evaluates clus-tering of a document collection for collection selection. Collection selection involvesselecting a subset of a collection given a query. Typically, these subsets are distributedon different machines. The goal is to cluster documents such that only a small fractionof clusters, and therefore machines need to be searched to find most of the relevant

1http://inex.otago.ac.nz/tracks/wiki-mine/wiki-mine.asp


132


documents for a given query. This leads to improved run time performance as only afraction of the collection needs to be searched. The total load over a distributed systemis decreased as only a few machines need to be searched per query instead of everymachine. It also provides a clear use case for document clustering evaluation. By con-trast, comparing document clusters to predefined categories only evaluates clusteringas a match against a particular classification.

Topical categories for documents are one of many views of extrinsic cluster quality.They are derived from what humans perceive as topics in a document collection. Whencategories are used for evaluation, a document clustering system is given a score indi-cating how well the clusters match the predefined categories. This is the most preva-lent approach to evaluation of document clustering in the research literature.

The categories for the INEX 2010 XML Mining collection were extracted from theWikipedia category graph which is noisy and nonsensical at times. Therefore, an ap-proach using shortest paths in the graph was used to extract 36 categories [De Vrieset al. 2011].

4. DOCUMENT CLUSTERINGDocument clustering is used in many different contexts, such as exploration of struc-ture in a document collection for knowledge discovery [Tan 1999], dimensionality re-duction for other tasks such as classification [Kyriakopoulou and Kalamboukis 2007],clustering of search results for an alternative presentation to the ranked list [Hearstand Pedersen 1996] and pseudo-relevance feedback in retrieval systems [Lee et al.2008].

Recently there has been a trend towards exploiting semi-structured documents[Nayak et al. 2002; Denoyer and Gallinari 2009]. This uses features such as the XMLtree structure and hyper-link graphs to derive data from documents to improve thequality of clustering.

Document clustering groups documents into topics without any knowledge of thecategory structure that exists in a document collection. All semantic information is de-rived from the documents themselves. It is often referred to as unsupervised cluster-ing. In contrast, document classification is concerned with the allocation of documentsto predefined categories where there are labeled examples. Clustering for classifica-tion is referred to as supervised learning where a classifier is learned from labeledexamples and used to predict the classes of unseen documents.

The goal of clustering is to find structure in data to form groups. As a result, thereare many different models, learning algorithms, encoding of documents and similaritymeasures. Many of these choices lead to different induction principles [Estivill-Castro2002] which result in discovery of different clusters. An induction principle is an in-tuitive notion as to what constitutes groups in data. For example, algorithms such ask-means [Lloyd 1982] and Expectation Maximization [Dempster et al. 1977] use a rep-resentative based approach to clustering where a prototype is found for each cluster.These prototypes are referred to as means, centers, centroids, medians and medoids[Estivill-Castro 2002]. A similarity measure is used to compare the representatives toexamples being clustered. These choices determine the clusters discovered by a partic-ular approach.

A popular model for learning with documents is the Vector Space Model (VSM)[Salton et al. 1975]. Each dimension in the vector space is associated with one termin the collection. Term frequency statistics are collected by parsing the document col-lection and counting how many times each term appears in each document. This issupported by the distributional hypothesis [Harris 1954] from linguistics that theo-rizes that words that occur in the same context tend to have similar meanings. Iftwo documents use a similar vocabulary and have similar term frequency statistics,


133


then they are likely to be topically related. The result is a high dimensional, sparsedocument-by-term matrix with properties that can be explained by Zipf distributions[Zipf 1949] in term occurrence. The matrix represents a document collection whereeach row is a document and each column is a term in the vocabulary. In the clusteringprocess, document vectors are often compared using the cosine similarity measure. Thecosine similarity measure has two properties that make it useful for comparing docu-ments. Document vectors are normalized to unit length when they are compared. Thisnormalization is important since it accounts for the higher term frequencies that areexpected in longer documents. The inner product that is used in computing the cosinesimilarity has non-zero contributions only from words that occur in both documents.Furthermore, sparse document representation allows for efficient computation.

Different approaches exist to weight the term frequency statistics contained in thedocument-by-term matrix. The goal of this weighting is to take into account the rel-ative importance of different terms, and thereby facilitate improved performance incommon tasks such as classification, clustering and ad hoc retrieval. Two popular ap-proaches are TF-IDF [Salton and Buckley 1988] and BM25 [Robertson et al. 1995;Whissell and Clarke 2011].

Clustering algorithms can be characterized by two properties. The first determines ifcluster membership is discrete. Hard clustering algorithms only assign each documentto one cluster. Soft clustering algorithms assign documents to one or more clustersin varying degree of membership. The second determines the structure of the clus-ters found as being either flat or hierarchical. Flat clustering algorithms produce afixed number of clusters with no relationships between the clusters. Hierarchical ap-proaches produce a tree of clusters, starting with the broadest level clusters at the rootand the narrowest at the leaves.

K-means [Lloyd 1982] is one of the most popular learning algorithms for use withdocument clustering and other clustering problems. It has been reported as one of thetop 10 algorithms in data mining [Wu et al. 2008]. Despite research into many otherclustering algorithms, it is often the primary choice for practitioners due to its simplic-ity [Guyon et al. 2009] and quick convergence [Arthur et al. 2009]. Other hierarchicalclustering approaches such as repeated bisecting k-means [Steinbach et al. 2000], K-tree [De Vries and Geva 2009] and agglomerative hierarchical clustering [Steinbachet al. 2000] have also been used. Note that the K-tree paper is by the author of this pa-per and other related content can be found in the associated Masters thesis [De Vries2010b]. Further methods such as graph partitioning algorithms [Karypis et al. 1999],matrix factorization [Xu et al. 2003], topic modeling [Blei et al. 2003] and Gaussianmixture models [Dempster et al. 1977] have also been used.

The k-means algorithm [Lloyd 1982] uses the vector space model by iteratively op-timizing k centroid vectors, which represent clusters. These clusters are updated bytaking the mean of the nearest neighbors of the centroid. The algorithm proceeds to it-eratively optimize the sum of squared distances between the centroids and the nearestneighbor set of vectors (clusters). This is achieved by iteratively updating the cen-troids to the cluster means and reassigning nearest neighbors to form new clusters,until convergence. The centroids are initialized by selecting k vectors from the docu-ment collection uniformly at random. It is well known that k-means is a special caseof Expectation Maximization with hard cluster membership and isotropic Gaussiandistributions [Press 2007].

The k-means algorithm has been shown to converge in a finite amount of time [Se-lim and Ismail 1984] as each iteration of the algorithm visits a possible permutationwithout revisiting the same permutation twice, leading to the worst case analysis ofexponential time. Arthur et al. [2009] have performed a smoothed analysis to explainthe quick convergence of k-means theoretically. This is the same analysis that has


134


been applied to the simplex algorithm, which has an n2 worst case complexity butusually converges in linear time on real data. While there are point sets that can forcek-means to visit every permutation, they rarely appear in practical data. Furthermore,most practitioners limit the number of iterations k-means can run for, which resultsin linear time complexity for the algorithm. While the original proof of convergenceapplies to k-means using squared Euclidean distance [Selim and Ismail 1984], newerresults show that other similarity measures from the Bregman divergence class ofmeasures can be used with the same complexity guarantees [Banerjee et al. 2005].This includes similarity measures such as KL-divergence, logistic loss, Mahalanobisdistance and Itakura-Saito distance. Ding and He [2004] demonstrate the relationshipbetween k-means and Principle Component Analysis. PCA is usually thought of as amatrix factorization approach for dimensionality reduction whereas k-means is con-sidered a clustering algorithm. It is shown that PCA provides a solution to the relaxedk-means problem by ignoring some constants, thus formally creating a link betweenk-means and matrix factorization methods.

4.1. EvaluationEvaluating document clustering is a difficult task. Intrinsic or internal measures ofquality such as distortion [Pelleg et al. 2000] or log likelihood [Biernacki et al. 2000]only indicate how well an algorithm optimized a particular representation. Intrinsiccomparisons are inherently limited by the given representation and are not compara-ble between different representations. Extrinsic or external measures of quality com-pare a clustering to an external knowledge source such as a ground truth labeling ofthe collection or ad hoc relevance judgments. This allows comparison between differ-ent approaches. Extrinsic views of truth are created by humans and suffer from thetendency for humans to interpret document topics differently. Whether a document be-longs to a particular topic or not can be subjective. To further complicate the problem,there are many valid ways to cluster a document collection. It has been noted thatclustering is ultimately in the eye of the beholder [Estivill-Castro 2002].

Most of the current literature on clustering evaluation utilizes the classes-to-clustersevaluation which assumes that the classes of the documents are known. Each docu-ment has known category labels. Any clustering of these documents can be evaluatedwith respect to this predefined classification. It is important to note that the class la-bels are not used in the process of clustering, but only for the purpose of evaluation ofthe clustering results.

4.1.1. Evaluation Metrics. When comparing a cluster solution to a labeled ground truth,the standard measures of Purity, Entropy, NMI and F1 are often used to determine thequality of clusters with regard to the categories. Let ω = {w1, w2, . . . , wK} be the set ofclusters for the document collection D and ξ = {c1, c2, . . . , cJ} be the set of categories.Each cluster and category are a subset of the document collection, ∀c ∈ ξ, w ∈ ω : c, w ⊂D. Purity assigns a score based on the fraction of a cluster that is the majority categorylabel,

argmaxc∈ξ

|c ∩ wk||wk|

, (1)

in the interval [0, 1] where 0 is the absence of purity and 1 is total purity. Entropydefines a probability for each category and combines them to represent order within aclustering,

− 1

log J

J∑

j=1

|cj ∩ wk||wk|

log|cj ∩ wk||wk|

, (2)


135


which falls in the interval [0, 1] where 0 is total order and 1 is complete disorder. F1identifies a true positive (tp) as two documents of the same category in the same cluster,a true negative (tn) as two documents of different categories in different clusters and afalse negative (fn) as two documents of the same category in different clusters wherethe score combines these classification judgments using the harmonic mean,

2× tp2× tp+ fn+ fp

. (3)

The Purity, Entropy and F1 scores assign a score to each cluster which can be microor macro averaged across all the clusters. The micro average weights each cluster byits size, giving each document in the collection equal importance, in the final score.The macro average is simply the arithmetic mean, ignoring the size of the clusters.NMI makes a trade-off between the number of clusters and quality in an informationtheoretic sense. For a detailed explanation of these measures please consult [Manninget al. 2008].

4.1.2. Single and Multi Label Evaluation. Both the clustering approaches and the groundtruth can be single or multi label. Examples of algorithms that produce multi labelclusterings are soft or fuzzy approaches such as fuzzy c-means [Bezdek et al. 1984],Latent Dirichlet Allocation [Blei et al. 2003] or Expectation Maximisation [Dempsteret al. 1977]. A ground truth is multi label if it allows more than one category label foreach document. Any combination of single or multi label clusterings or ground truthscan be used for evaluation. However, it is only reasonable to compare approaches usingthe same combination of single or multi label clustering and ground truths. Multi labelapproaches are less restrictive than single label approaches as documents can exist inmore than one category. There is redundancy in the data whether it is clustering or aground truth. A ground truth can be viewed as a clustering and compared to anotherground truth to measure how well the ground truths fit each other.

5. DOCUMENT REPRESENTATIONSThere are many different tehniques that effect the generation of representationsfor clustering. By pre-processing documents, using new representations for differentsources of information, and improving the representation of knowledge for learning,both the quality of clusters discovered and the run-time cost of producing clusters canbe improved.

When representing natural language, different components of the language lead todifferent representations. Popular models include the bag-of-words approach [Stein-bach et al. 2000], n-grams [Miao et al. 2005], part of speech tagging [Allan and Ragha-van 2002], and, semantic spaces [Lund and Burgess 1996].

Before any term based representation can be built, what constitutes a term must bedetermined via a number of factors such as term tokenization, conversion of words totheir stems [Porter 2006], removal of stop words [Fox 1989], and, conversion of charac-ters such as removing accents. Additionally, automated techniques may remove irrel-evant boilerplate content from documents such as navigation structure on a web pageare used [Yi et al. 2003].

There are many approaches to weight terms in the bag-of-words model. Whisselland Clarke [2013] investigate different weighting strategies for document clustering.TF-IDF [Salton and Buckley 1988] BM25 [Robertson et al. 1995; Whissell and Clarke2011] are two popular approaches to weighting terms in the bag-of-words model.

Links between documents can be represented for use in document clustering. Thereare algorithms that work directly with the graph structure in an iterative fashion toproduce clusters [Van Dongen 2008]. Other approaches embed the graph structure in a


136


document representation such as vector space representations [Riesen et al. 2007; Luoet al. 2003]. The representations are then used with general clustering algorithms. Thelinks can also be weighted by popular algorithms from ad hoc information retrievalsuch as PageRank [Page et al. 1999] and HITS [Kleinberg 1999].

Other internal structural information about documents can be represented implic-itly or explicitly. One effective implicit approach is to double the weight of title wordsin a bag-of-words representation [Cohen and Singer 1999]. Other approaches explic-itly represent structure, such as those that represent the XML structure of documents.For example, Kutty et al. [2011] represent both content and structure explicitly usinga Tensor Space Model. Tagarelli and Greco [2006] combine structure and content to-gether using tree mining and modified similarity measures.

There are many dimensionality reduction techniques aimed at reducing the num-ber of dimensions required to represent a document in a vector space model. Theyall aim to preserve the information available in documents. However, some particu-lar approaches such as Latent Semantic Analysis [Deerwester et al. 1990] based uponthe Singular Value Decomposition have been suggested to improve document repre-sentations by finding “concepts” in text. Popular approaches include Principle Com-ponent Analysis [Bingham and Mannila 2001], Singular Value Decomposition [Goluband Kahan 1965], Non-negative Matrix Factorization [Xu et al. 2003], Wavelet trans-form [Murtagh et al. 2000], Discrete Consine Transform [Fox 2005], Semantic Hash-ing [Salakhutdinov and Hinton 2009], Locality Sematnic Hashing [Indyk and Motwani1998], Random Indexing [Sahlgren 2005], and, Random Manhattan Indexing [Qasem-iZadeh and Handschuh 2014].

With the rise of the social web, many web pages now have social annotation such astags. These can be introduced into a document representation to further disambiguatethe topic of a document [Ramage et al. 2009].

Various other multimedia sources of information can be represented in documentssuch as audio [Lu et al. 2002], video [Yeung et al. 1998] and images [Liew et al. 2000].The processing of these signals constitutes many research fields to themselves. Butthere have been approaches incorporating this information into the document cluster-ing process [Barnard and Forsyth 2001].

Documents can be mapped onto ontologies to provide a conceptual framework for therepresentation of meaning in documents. This has been used to improve the quality ofdocument clusters [Hotho et al. 2003].

Additionally, incorporating infromation from other category systems such as theWikipedia has shown to improve the quality of clusters [Hu et al. 2009].

There are many approaches trying to use a single additional approach to improvedocument clustering via improved representation. However, there seems to be few ap-proaches combining multiple sources and comparing how they contribute to the iden-tification of clusters.

6. AD HOC INFORMATION RETRIEVALAd hoc information retrieval is concerned with the retrieval of documents given a userquery. The phrase ad hoc is of Latin origin and literally means “for this”. Users cre-ate a query specific to their information need and retrieval systems return relevantdocuments. Much of modern IR research has been concerned with ranked retrieval ofdocuments. Each document is assigned an estimate of relevance for a given query andthe results are displayed in descending order of relevance.

Retrieval systems or search engines provide automated ad hoc retrieval. The sys-tems are typically implemented in software, but there have been hardware based re-trieval systems [Weeks et al. 2002]. The inverted file is the data structure that under-pins most modern retrieval systems [Manning et al. 2008]. Like an index at the back of


137


a text book, an inverted file maps words to instances where they appear. The invertedfile maps each term in the collection vocabulary to a postings list that identifies doc-uments containing the term and the frequency with which it occurs. Various rankingfunctions re-weight query and document scores using the inverted file to improve adhoc retrieval effectiveness. Popular approaches include TF-IDF [Salton and Buckley1988], Language Models [Zhai and Lafferty 2004; Macdonald and Ounis 2006], Diver-gence from Randomness [Amati and van Rijsbergen 2002] and Okapi BM25 [Robertsonand Jones 1997]. However, signature file based approaches have been popular in thepast [Faloutsos and Christodoulakis 1984] and have recently reappeared in the main-stream IR literature using random projections or random indexing [Geva and De Vries2011], which is a contribution of this paper. Signature files use a n bit binary string torepresent documents and queries which are compared using the Hamming distance.The motivation of this approach is to exploit efficient bit-wise instructions on modernCPUs.

The study of retrieval models most commonly as different similarity functions fo-cusses on the quaity aspects of information retrieval systems. This studies how can thequality of results with respect to human judgement of a search engine be improved. Incontrast, indexing implementations focus on the efficiency and approximate indexingthat make a trade-off between quality of search results and efficiency.

Document signatures have been missing from main stream publications aboutsearch engines and the Information Retrieval field in general for many years. Zobelet al. clearly demonstrate the inferior performance of traditional document signaturesin their paper, “Inverted Files Versus Signature Files for Text Indexing” [Zobel 1998].They conclude that signature files or document signatures are slower, offer less func-tionality, and require more storage. Witten et al. [1999] come to the same conclusion.Traditional signature based indexes are not competitive with inverted files. They offerno distinct advantages over inverted indexes.

Traditional bit slice signature files [Faloutsos and Christodoulakis 1984] use effi-cient bit-wise operators. This is presented in an ad hoc manner and is motivated byefficient bit-wise processing without respect to Information Retrievel theory. Each doc-ument is allocated a signature of N bits. Each term is assigned a random signaturewhere only n << N bits are set. These are combined using the bit-wise XOR op-eration. This is a Bloom filter [Bloom 1970] for the set of terms in a document. Toavoid collisions resulting in errors, the difference between n and N must be very large.In contrast, the TopSig signatures [Geva and De Vries 2011] presented in this papercompress the same representation that underlies the inverted index, the document-by-term matrix. It first begins with full precision vectors and compresses them via randomprojections and numeric quantization. This is founded in the theory of preservation ofthe topological relationships of low dimensional embeddings of vectors presented inthe the Johnson-Lindenstrauss lemma [Johnson and Lindenstrauss 1984].

Other recent approaches to similarity search [Zhang et al. 2010] have focussed onmapping documents to N bit strings for comparison using the Hamming distance.However, they have only focussed on short strings for document-to-document searchfor tasks such as near duplicate detection or the “find other documents like these”queries. They do not investigate use of such structures for ad hoc information retrievalor clustering. Additionally they use machine learning approaches to find a better map-ping for signatures, which may be computatinally prohibitive for web scale documentcollections containin billions of documents.


138


6.1. Collection Distribution and SelectionA distributed information retrieval system consists of many different machines con-nected via a data communications network working together as one to provide searchservices for documents stored on each of the machines.

In a distributed retrieval setting, several different information resources can bestored in different retrieval systems and indexes. These resources are combined toprovide a single query interface. This often results in an uncooperative distributedsystem. Uncooperative systems only allow use of the standard query interface pro-vided by each system. All necessary information about a system that is acting as partof the distributed system has to be gained via these interfaces. The collection selectionapproach has to build an appropriate model by querying the systems individually.

Alternatively, a large collection such as the World Wide Web must be stored on morethan one machine so it can be stored and processed. A large collection such as this isoften processed in the setting of a cooperative distributed information retrieval sys-tem. Full access to the indexes of each system is possible. This enables use of globalstatistics and any other approaches that require more than the use of the standardquery interface.

Collection distribution is the strategy that determines which documents are as-signed to which machine. Common approaches include random [Kulkarni and Callan2010; Ding et al. 2011], source [Xu and Croft 1999] and topic based allocation [Kulkarniand Callan 2010; Xu and Croft 1999]. A common criticism of topic based allocation isthat automated methods for document clustering are not scalable for large collections.

Collection selection is the process that determines which machines to search in a dis-tributed system given a query. There are many approaches to collection selection suchas SHiRE [Kulkarni et al. 2012], gGlOSS [Gravano and Garcia-Molina 1999] and Cue-Validity Variance [Yuwono and Lee 1997] that use collection statistics to summarizethe collection stored on a particular machine.

6.2. EvaluationInformation retrieval evaluation allows comparative evaluation of retrieval systems.System effectiveness is quantified by one of two approaches. User based evaluationsmeasure user satisfaction with the system whereas system based evaluation deter-mines how well a system can rank documents with respect to relevance judgments ofusers. The system based approach has become most popular due to its repeatability ina laboratory setting and the difficulty and cost of performing user based evaluations[Voorhees 2002].

The Cranfield paradigm [Cleverdon 1967] is an experimental approach to the evalu-ation of retrieval systems. It has been adapted over the years and provides the theorythat supports modern information retrieval evaluation efforts such as CLEF [Forneret al. 2011], NTCIR [Kando et al. 1999], INEX [Arvola et al. 2011] and TREC [Clarkeet al. 2010]. It is a system based approach that evaluates how different systems rankrelevant documents. For systems to be compared, the same set of information needsand documents have to be used. A test collection consists of documents, statements ofinformation need, and relevance judgments [Voorhees 2002]. Relevance judgments areoften binary. A document is considered relevant if any of its contents can contribute tothe satisfaction of the specified information need. Information needs are also referredto as topics and contain a textual description of the information need, including guide-lines as to what may or may not be considered relevant. Typically, only the keywordbased query of a topic is given to a retrieval system.

Many different measures exist for evaluating the distribution of relevant documentsin ranked lists. The simplest measure is precision at n (P@n), defined as the fraction of


139


the documents that are relevant in the first n documents in the ranked list. A strongjustification for the use of P@n is that it reflects a real use case [Zobel et al. 2009].Most users only view the first few pages of results presented by a retrieval system. Onthe web, users may not even view the entire first page of results. Users reformulatetheir queries rather than continue searching the result list. Zobel et al. [2009] arguethat recall does not reflect a user’s need except in some specific circumstances such asmedical and legal retrieval that require exhaustive searching. Furthermore, achievingtotal recall requires complete knowledge of the collection. The mean average precisionmeasure or MAP takes into account both precision and recall. It is the area under theprecision versus recall curve. Precision is the fraction of relevant documents at a givenrecall level, and a recall level is the fraction of returned relevant documents in the setof all relevant documents. As recall increases, precision tends to drop since systemshave a higher density of relevant results early in the ranked list. This naturally is thegoal of a search engine.

The original Cranfield experiments required complete relevance judgments for theentire collection. This approach rules out large collections due to the necessity of ex-haustive evaluation. On large scale collections with millions to billions of documentsit is not possible to evaluate the relevance of every single document to every query.Therefore, an approach called pooling was developed [Sparck Jones and van Rijsber-gen 1975]. Each system in the experiment returns a ranked list of the first 1,000 to2,000 documents in decreasing order of relevance. A proportion of the most relevantdocuments, often around 100 documents, are taken from each system and pooled to-gether, and duplicates are removed since often there is a large overlap between theresults lists of different systems. These pooled results are presented to the assessorsin an unranked order. The assessors then determine relevance on this relatively smallpool.

There has been much debate and experimentation to determine the effect of pooling.As test collections are designed to be reused, new approaches may lead to the returnof relevant documents that were not represented in the pool that was assessed. Theseare un-judged relevant documents. Zobel [1998] found this is true for un-judged runs,but it does not unfairly affect a system’s performance on TREC collections. However,thought is required when designing an information retrieval evaluation experiment.The pool depth and diversity may need to change based on the type and size of thecollection. Furthermore, tests such as those suggested by Zobel [1998] can be carriedout to determine the reliability and reusability of a test collection.

Saracevic [2008] performed a meta-analysis of judge consistency. They concludedthat consistency changes dramatically depending on the expertise of the assessors.Assessors with higher expertise agreed more often on the relevance of a document.Consistency also changed from experiment to experiment using the same assessors.Cranfield based evaluation of information retrieval has been scrutinized heavily overthe years. However, assessor disagreement does not affect this type of evaluation be-cause the rank order of each retrieval system does not often change when using dif-ferent assessments from different judges. The stability only applies when averagingretrieval system performance over a set of queries. The retrieval quality can changedrastically from query to query given a different assessor.

7. CLUSTER HYPOTHESISThe cluster hypothesis connects ad hoc information retrieval and document clustering.Manning et al. [2008] states that “documents in the same cluster behave similarly withrespect to relevance to information needs”. If a document from a cluster is relevantto a query, then it is likely other documents in the same cluster are also relevantto the same query. This is due to the clustering algorithm grouping documents with


140


similar terms together. There may also be some higher order correlation due to effectscaused by the distributional hypothesis [Harris 1954] and limitation of analysis tothe size of a document; i.e. words are defined by the company they keep. It shouldbe noted that the relevance of a document is assumed in an unsupervised mannerin an operational retrieval system. While our evaluation of document clustering usesrelevance judgments from humans, these are not always available or are noisy as theyare derived from click through data.

The cluster hypothesis has been analyzed in three different forms. Originally thecluster hypothesis was tested at the collection level by clustering all documents in thecollection. It was then extended to cluster only search results to improve display andranking of documents. Finally, it has been extended to the sub-document level to dealwith the multi-topic nature of longer documents.

7.1. Measuring the Cluster HypothesisThere have been direct and indirect approaches to measuring the cluster hypothesis.Direct approaches measure similarity between documents to determine if relevant doc-uments are more similar than non-relevant documents. Indirect measures use otherIR related tasks such as ad hoc retrieval or document clustering to evaluate the hy-pothesis. If the cluster hypothesis is true for ad hoc search, then only a small fractionof the collection needs to be searched while the quality of search results will not beimpacted or potentially improved. The same idea applies to the spread of relevant doc-uments over a clustering of a collection. If the hypothesis is true, then the documentswill only appear in a few clusters.

Voorhees [1985] proposed the nearest neighbor (NN) test. It analyzes the five nearestneighbors of relevant documents according to a similarity measure. It determines thefraction of relevant documents located nearby other relevant documents. The NN testis comparable across different test collections and similarity measures making it use-ful for comparative evaluations. Voorhees [1985] states that the test chose five nearestneighbors arbitrarily. If the number of neighbors is increased, then the test becomesless local in nature. The NN test overcame problems with the original cluster hypothe-sis test suggested by [Jardine and van Rijsbergen 1971]. The original test plots the sim-ilarity between all pairs of relevant–relevant and relevant–non-relevant documentsseparately using a histogram approach. It does so for each query and then the resultsare averaged over all queries. A separation between the distributions indicates thatthe cluster hypothesis holds. The criticism of this approach is that there will always bemore relevant–non-relevant pairs than relevant–relevant pairs. Voorhees [1985] foundthat these two different measures give very different pictures of the cluster hypothesis.However, van Rijsbergen and Jones [1973] found that the original measure was usefulfor explaining the different performance of retrieval systems on different collections.

Smucker and Allan [2009] introduce a global measure of the cluster hypothesis basedon the relevant document networks and measures from network analysis. The NN testis a local measure of clustering as it only inspects the five nearest neighbors of eachrelevant document. The measure proposed builds a directed, weighted, fully connectedgraph between all relevant documents where vertices are documents and edges rep-resent the rank of the destination document when the source document is used as aquery. This represents all relationships between relevant documents and not just thosethat fall within the five nearest as in the NN test. The normalized, mean reciprocal dis-tance (nMRD) from small world network analysis [Latora and Marchiori 2001] is usedto measure the global efficiency based upon all shortest path distances between pairsof documents in the network. Each measure of the cluster hypothesis allows for a dif-ferent view, allowing for multiple comparisons across multiple approaches to ranking.


141


In the following sections, we review the measurement of the cluster hypothesis viaindirect measures.

7.2. Cluster Hypothesis in Ranked RetrievalMany retrieval systems return a ranked list of documents in descending order of es-timated relevance. Clustering provides an alternative approach to the organization ofretrieved results that aims to reduce the cognitive load of analyzing retrieved docu-ments. It achieves this by clustering documents into topical groups. The cluster hy-pothesis implies that most relevant documents will appear in a small number of clus-ters. A user can easily identify these clusters and find most of the relevant results.Alternatively, clusters are used to improve ranked results such as pseudo-relevancefeedback and re-ranking using clusters.

Early studies in clustering ad hoc retrieval results by Croft [1980] and Voorhees[1986] performed static clustering of the entire collection. More recent studies foundthat clustering results for each query is more effective for improving the quality ofresults [Hearst and Pedersen 1996; Leuski 2001; Tombros et al. 2002; Lee et al. 2008].However, clustering of an entire document collection allows for increased efficiency andits discussion has reappeared recently in the literature [Nayak et al. 2010; De Vrieset al. 2011; Kulkarni and Callan 2010].

Evaluation of the cluster hypothesis changed significantly after the result of Hearstand Pedersen [1996] that confirmed that the cluster hypothesis can improve resultranking significantly. It did this by clustering results for each query rather than en-tire collections. Hearst and Pedersen [1996] revise the cluster hypothesis. If two doc-uments are similar to one query, they are not necessarily similar to another query. Asdocuments have very high dimensionality, the definition of nearest neighbors changesdepending on the query. Their system dynamically clusters retrieval results for eachquery using the interactive Scatter/Gather approach. The Scatter/Gather approachdisplays five clusters of the top n documents retrieved by a system. A user selects oneor more clusters indicating that only those documents are to be viewed. This processcan be applied recursively, performing the Scatter/Gather process again on the selectedsubset. The experiments using this system were performed on the TREC/Tipster col-lection using the TREC-4 queries. The top ranked clusters always contained at least50 percent of the relevant documents retrieved. The results retrieved by the Scat-ter/Gather system were shown to be statistically significantly better than not usingthe system according to the t-test.

Crestani and Wu [2006] investigate the cluster hypothesis in a distributed informa-tion retrieval environment. Even though there are many issues that introduce noisewhen presenting results from heterogeneous systems in a distributed environment,they demonstrate that clustering is still an effective approach for displaying searchresults to users.

Lee et al. [2008] present a cluster-based resampling method for pseudo-relevancefeedback. Previous attempts to use clusters for pseudo-relevance feedback have notresulted in statistically significant increases in retrieval efficiency. This is achievedby repeatedly selecting dominant documents, which is motivated by boosting frommachine learning and statistics that repeatedly selects hard examples to move thedecision boundary toward hard examples. It uses overlapping or soft clusters wheredocuments can exist in more than one cluster. A dominant document is one that par-ticipates in many clusters. Repeatedly sampling dominant documents can emphasizethe topics in a query and thus this approach clusters the top 100 documents in theranked list. The most relevant terms are selected from the most relevant clusters andare used to expand the original query. It was most effective for large scale documentcollections where there are likely to be many topics present in the results for a query.


142


Re-ranking of the result list using clusters has found to be effective using both vec-tor space approaches [Lee et al. 2001; Lee et al. 2004], language models [Liu andCroft 2004] and local score regularization [Diaz 2005]. In these approaches, the origi-nal ranked listed is combined with the rank of clusters, rather than just the originalquery-to-document score.

Kulkarni and Callan [2010] investigate the use of topical clusters for efficient andeffective retrieval of documents. This work finds similar results to those discovered atthe INEX XML Mining track [Nayak et al. 2010; De Vries et al. 2011] when clusteringlarge document collections, that relevant results for a given query cluster tightly andonly fall in a few topical document clusters. Kulkarni and Callan solve the scalabil-ity problem of clustering by using sampling using k-means with the Kullback-Leiblerdivergence similarity measure. This work also implements and evaluates a collectionselection approach called ReDDE and finds that topical clusters outperform randompartitioning and that as little as one percent of a collection needs to be searched tomaintain early precision.

7.3. Sub-document Cluster Hypothesis in Ranked RetrievalLamprier et al. [2008] investigate the use of sub-document segments or passages tocluster documents in information retrieval systems. They extend the notion of the clus-ter hypothesis to the sub-document level where different segments of a document canhave different topics that are relevant to different queries. If the cluster hypothesisholds at the sub-document level, then relevant passages will tend to cluster together.The approach proposed by Lamprier et al. [2008] returns whole documents but usespassages to cluster multi-topic documents more effectively.

Lamprier et al. [2008] suggest several different approaches to the segmentation ofdocuments into passages. The authors chose to select an approach based on an evolu-tionary search algorithm that minimizes similarity between segments called SegGen[Lamprier et al. 2007]. They dismissed using arbitrary regions of overlapping texteven though this was shown to be effective for passage retrieval by Kaszkiel and Zobel[2001]. The reasons given are added complexity and that only disjoint subsets of thedocuments are appropriate in this case.

Lamprier et al. [2008] performed experiments using sub-document clustering on adhoc retrieval results with the aim of improving the quality. The authors did not con-sider sub-document clustering of the entire collection because earlier studies suggestedthis was not effective for whole document clustering. They found that sub-documentclustering allowed better thematic clustering that placed more relevant results to-gether than whole document clustering.

8. USING AD HOC RELEVANCE JUDGMENTS TO EVALUATE DOCUMENT CLUSTERINGIn Section 4.1 we introduced some of the issues with determining a ground truth setof category labels for large document collections. As described in Section 6.2, the infor-mation retrieval evaluation community has already dealt with the problem of assessorload in ad hoc relevance evaluation via the use of pooling. Pooling reduces assessor loadand topics evaluated are specific and well defined. By using ad hoc relevance to assessdocument clustering, instead of category labels for every document in a collection, theissues of assessor load is alleviated. Many categories that exist in ground truths fordocument collections are lofty and not well defined ideas such as “Arts”. What Arts isand is not could be debated ad infinitum.

The cluster hypothesis suggests that documents that cluster together tend to haverelevance to similar queries. The evaluation strategy we are proposing aims to evalu-ate the utility of clustering in collection selection. The goal of clustering is to minimisethe spread of relevant results of ad hoc queries over a clustering solution. The purpose


143


of clustering in this context is to determine the distribution of a collection over multi-ple machines. We have a dual optimisation problem – it is desirable to maximise thenumber of clusters while minimising the spread of relevant results of ad hoc queriesover the clusters. Search efficiency can be increased with the distribution of clusters onmore machines. However, since it is not possible to produce clusters that split the col-lection to satisfy all conceivable ad hoc queries. A good clustering solution is expectedto optimise the distribution such that for most ad hoc queries most of the results canbe found in a small set of clusters. The goal of collection selection is then to rank theclusters to identify the order in which they should be searched to satisfy any givenquery.

Guyon et al. [2009] argue that the context of clustering needs to be taken into ac-count during evaluation. The evaluation we are proposing tests the clustering hypoth-esis in the information retrieval specific. Clustering is intended to facilitate documentdistribution and collection selection for ad hoc retrieval, and it is tested in that setting.This differs greatly from evaluation where authors assign categories to documents andthe categories are then used as the ground truth for the evaluation of clustering. Guyonet. al. [Guyon et al. 2009] argue, and we agree, that ground truth based evaluationsare unsound. This is particularly true when it comes to an information retrieval settingwhere the number of potential topics are virtually unconstrained. It is a near impos-sible task to compare alternative clustering possibilities by inspecting large numbersof documents in clusters. In contrast, the evaluation of topics represented as queriesin an ad hoc retrieval system has been shown to be stable across different assessors[Saracevic 2008]. These relevance judgments have been the backbone of ad hoc infor-mation retrieval system evaluations for many years. They have also been exposed tocriticism and review by many of the top researchers in the field. By exploiting thishigh quality, human generated information, we can have great confidence that we aretesting clustering in the context of its use. The context is specifically clustering of doc-uments in an information retrieval setting.

A ground truth based evaluation treats clustering as if it were a classification prob-lem. If you want to predict known categories for a document, then it is highly likelythat supervised learning approaches will give higher classification accuracy. In con-trast, the ad hoc retrieval evaluation we are proposing is treating clustering as a clus-tering problem. It is directly motivated by the cluster hypothesis from informationretrieval.

As each query represents a topic, albeit very focussed, we argue that this evaluationis also sound for a general evaluation of document clustering. If relevant documentsare not coherent with respect to the cluster hypothesis, then they are probably not ofuse for any situation where document clustering is required. The cluster hypothesisrefers to the,

∑vn=1 nCv, number of possible queries whose ranked results are clusters

themselves, where v is the size of the vocabulary. The cluster ends where the relevantresults end in the ranked list. It supports the notion that ad hoc retrieval is ad hocclustering of results given a query.

The evaluation of document clustering using ad hoc information retrieval can beviewed as being similar to an evaluation using a multi label category based groundtruth. A document can be relevant to more than one query. The class imbalance ismuch greater when using relevance judgments. There are many more non-relevantthan relevant documents for any given query. Furthermore, if each query was treatedas a topic and relevant documents were assigned a class label if and only if they arerelevant to a query, the entire collection would not be covered. For example, the 52queries from the INEX 2010 ad hoc track only contain 5471 relevant documents forthe entire 2,660,190 document INEX XML Wikipedia collection.


144


9. NORMALISED CUMULATIVE CLUSTER GAINThe NCCG evaluation measure has been used for the evaluation of document clus-tering at INEX [Nayak et al. 2010; De Vries et al. 2011]. It is motivated by van Rijs-bergen’s cluster hypothesis [Jardine and van Rijsbergen 1971]. If the hypothesis holdstrue, then relevant documents will appear in a small number of clusters. A documentclustering solution can be evaluated by measuring the spread of relevant documentsfor the given set of queries.

NCCG is calculated using manual result assessments from ad hoc retrieval eval-uation. Evaluations of ad hoc retrieval occur in forums such as INEX [Arvola et al.2011], CLEF [Forner et al. 2011] and TREC [Clarke et al. 2010]. The manual query as-sessments are called the relevance judgments and have been used to evaluate ad hocretrieval of documents. The process involves defining a query based on the informationneed, a retrieval system returning results for the query and humans judging whetherthe results returned by a system are relevant to the information need.

The NCCG measure tests a clustering solution to determine the quality of clustersrelative to the optimal collection selection. Collection selection involves splitting a col-lection into subsets and recommending which subsets need to be searched for a givenquery. This allows a retrieval system to search fewer documents, resulting in improvedruntime performance over searching the entire collection. The NCCG measure hascomplete knowledge of which documents are relevant to queries and orders clusters indescending order by the number of relevant documents it contains. We call this mea-sure an “oracle” because it has complete knowledge of relevant documents. A workingretrieval system does not have this property, so this measure represents an upperbound on collection selection performance.

Better clustering solutions in this context will tend to group together relevant re-sults for previously unseen ad hoc queries. Real ad hoc retrieval queries and theirmanual assessment results are utilised in this evaluation. This approach evaluatesthe clustering solutions relative to a very specific objective – clustering a large doc-ument collection in an optimal manner in order to satisfy queries while minimisingthe search space. The measure used for evaluating the collection selection is calledNormalised Cumulative Cluster Gain (NCCG) [Nayak et al. 2010].

The Cumulative Gain of a Cluster (CCG) is defined by the number of relevant docu-ments in a cluster, CCG(c, t) =

∑ni=1Reli. A sorted vector CG is created for a clustering

solution, c, and a topic, t, where each element represents the CCG of a cluster. It is nor-malised by the ideal gain vector,

SplitScore(t, c) =|CG|∑ cumsum(CG)

n2r, (4)

where nr is total number of relevant documents for the topic, t. The worst possible splitplaces one relevant document in each cluster represented by the vector CG1,

MinSplitScore(t, c) =|CG1|∑ cumsum(CG1)

n2r. (5)

NCCG is calculated using the previous functions,

NCCG(t, c) =SplitScore(t, c)−MinSplitScore(t, c)

1−MinSplitScore(t, c). (6)

It is then averaged across all topics.


145


10. DIVERGENCE FROM A RANDOM BASELINEDivergence from a random baseline is a technique for the evaluation of document clus-tering. It ensures cluster quality measures are performing work that prevents ineffec-tive clusterings from giving high scores to clusterings that provide no useful result.These concepts have been defined and analyzed in previous work by De Vries et al.[2012]. It was defined and analyzed using intrinsic and extrinsic evaluations of clus-tering, including use of document categories and ad hoc relevance judgments.

An ineffective clustering is one that achieves a high score according to a measureof document cluster quality but provides no value as a clustering solution. Divergencefrom a random baseline addresses ineffective clusterings in evaluation. It introducesa notion of work performed by a clustering where ineffective cases appear to performno useful learning.

10.1. Ineffective ClusteringAn ineffective clustering produces a high score according to an evaluation measure butdoes not represent any inherent value as a clustering solution.

The Purity evaluation measure has an obvious ineffective case. If each cluster con-tains one document, then it is 100% pure with respect to the ground truth. A singledocument is the majority of the cluster. As the goal of clustering is to produce groupsof documents or to summarize the collection, this is obviously flawed as it does neither.The same applies to the Entropy measure as the probability of a label for a cluster is100%, resulting in the highest possible Entropy score.

The NCCG measure is ineffective when one cluster contains all the documents ex-cept for every other cluster containing one document. The NCCG measure orders clus-ters by the number of relevant documents they contain. A large cluster containingmost documents will almost always be ranked first. Therefore, almost all relevant doc-uments will exist in one cluster, achieving almost the highest score possible.

10.2. Work Performed by a ClusteringTo overcome ineffective clusterings in the previous section, we introduce the concept ofwork performed by a clustering approach. Work is defined as an increase in quality ofa clustering over a simple approach that ignores the documents being clustered. A use-ful clustering performs work beyond an approach that is purely random and ignoresdocument content. If a random approach that performs no useful learning performsequally to an approach that attempts to learn from that data, it would appear thatnothing has been achieved by analyzing the data. We suggest that an ineffective clus-tering performs no useful learning.

10.3. Divergence from a Random BaselineMany measures of cluster quality can give high quality scores for particular clusteringsolutions that are not of high quality by changing the number of clusters or number ofdocuments in each cluster.

Measures that can be misled by creating an ineffective clustering can be adjustedby subtraction from a randomly generated clustering with the same number of clus-ters with the same number of documents in each cluster. The random baseline dis-tributes documents into buckets the same size as the clusters found by the clusteringalgorithm. Apart from the random assignment of documents to clusters, the randombaseline appears the same as the real solution. Therefore, each clustering evaluatedrequires a random baseline that is specific to that clustering. The baseline is created byshuffling the documents uniformly randomly and splitting them into clusters the same


146


size as the clustering being measured. The score for the random baseline clustering issubtracted from the matching clustering being measured.

The divergence from a random baseline approach can be applied to any measure ofcluster quality whether it is intrinsic of extrinsic. However, it does require an existingmeasure of cluster quality. It is not a measure by itself, but an approach to ensurea clustering is doing something sensible. Although we have highlighted its use fordocument clustering evaluation, it can be used for any clustering evaluation.

There are two issues at play here. Firstly, different distributions of cluster sizescan lead to arbitrarily high scores. The second issue is determining if the clusteringalgorithm is effectively learning with respect to a measure of quality. The divergencefrom a random baseline takes care of ineffective solutions in either case. If the internalordering of clusters is no better than random noise, then it achieves a score of zero. Anegative score could be achieved as the random baseline scores a positive value usingmost measures on most data sets. It is possible for a clustering to have a worse scorethan the baseline. For example, a clustering approach could maximize dissimilarity ofdocuments in clusters. This will create a solution where the most dissimilar documentsare placed together, resulting in a worse score than random assignment. The randomassignment does not bias the clustering towards or away from the measure of quality.If a clustering approach is learning something with respect to the measure of quality,then it is expected that is will be biased towards it. Alternatively, if we reverse theoptimization process, it should be biased away from it.

Let ω = {w1, w2, . . . , wK} be the set of clusters for the document collection D andξ = {c1, c2, . . . , cJ} be the set of categories. Each cluster and category are a subset of thedocument collection, ∀c ∈ ξ, w ∈ ω : c, w ⊂ D. We define the probability of a categoryin the baseline given a cluster as, Pb(cj |wk) =

|cj |∑i |ci|

. The probability of a categorygiven a cluster in the baseline only depends on the size of the categories. The baselineis a uniformly randomly shuffled list of documents that have been split into clustersthat match the cluster size distribution in the solution being evaluated. Thus, withineach cluster in the baseline is random uniform noise. It is not biased by the documentrepresentation. Therefore, it is expected categories will occur at a rate proportionalto the category’s size. For example, if there are three categories A,B,C containing10, 20, 30 documents, each cluster in the baseline is expected to contain approximately1060A,

2060B,and 30

60C. This only reflects the size distribution of the categories.We let any measure of a cluster quality be interpreted as a probability. Although

this is not formally the case for all measures, it serves as a reasonable explanation. Wedefine the probability of a category in a cluster given the ground truth as, Ps(cj |wk) =any measure of cluster quality.

The Purity measure assigns an actual probability to each cluster when there is a sin-gle label ground truth. All the probabilities combined accumulate to one,

∑j|cj∩wk||wk| =

1, and the category with the largest maximum likelihood estimate is assigned to eachcluster, PPurity(cj |wk) = argmaxcj

|cj∩wk||wk| . This is the proportion of the cluster that has

the majority category label. It also represents the same process of using clustering forclassification with labeled data where an unseen sample is labeled based on the ma-jority category label of the closest cluster. We define d as a document in D. The groundtruth is restricted to being single label where a document, d, only has only one label inone category, in the ground truth, ∀d ∈ D, ci ∈ ξ, cj ∈ ξ : d ∈ ci ∧ d /∈ cj ∧ ci 6= cj .

The adjusted measure is the difference between submission and baseline. We de-fine the adjusted probability of a category given a cluster as, Pa(cj |wk) = Ps(cj |wk) −Pb(cj |wk).


147


An alternative formal view of divergence from a random baseline can be defined bya quality function, m : PP(Z× Z) → R, that takes a set of clusters as a set of set of(document, category label) pairs, s, and returns a real number indicating the qualityof the clustering. Examples of these cluster quality functions are Entropy, F1, NCCG,Negentropy, NMI and Purity. There exists a function, r : PP(Z× Z) → PP(Z× Z), thatgenerates a random baseline, b, given a clustering solution, s. The baseline has thesame number of clusters as the clustering solution, |b| = |s|. For every cluster in eachof the original clustering, s, and the baseline, b, the corresponding clusters contain thesame number of documents, ∀k : |sk| = |bk|. The adjusted measure,ma : PP(Z× Z)→ R,becomes, ma(s) = m(s)−m(r(s)).

10.4. ConclusionIn this section, we introduced problems encountered in the evaluation of documentclustering. This is the concept of ineffective clustering and a notion of work. The diver-gence from random baseline approach deals with these corner cases and increases theconfidence that a clustering approach is achieving meaningful learning with respectto any view of cluster quality. It is also applicable to any clustering evaluation but itwas only discussed in the context of document clustering. It allowed differentiation ofactual problem clusterings at the INEX 2010 XML Mining track. For a more detailedanalysis please consult the prior publication by De Vries et al. [2012]. Additionally, theevaluation software that implements the Divergence from Random Baseline approachis available on the Machine Learning Open Source Software website 2.

11. K-TREE AND TOPSIGWe introduce a new method for collection distribution using K-tree and TopSig to per-form efficient and scalable clustering of large scale document collections into largenumbers of clusters. Modifications to the K-tree algorithm and data structure havebeen made to work with the binary signatures produced by TopSig.

The previous work of Kulkarni and Callan [2010] only investigates clustering for col-lection distribution with relatively few document clusters. They produced 100 clustersfor the 50 million document ClueWeb09 Category B collection and 1000 clusters for the500 million document ClueWeb09 Category A collection. We present an approach thatcan cluster 50 million documents into approximately 140,000 clusters in ten hours.This is completed using a single thread of Java code. There has been no parallelisationor optimisation using native code to take full advantage of modern multi-core proces-sors and other low level features.

We first introduce TopSig and K-tree and then explain how we combined them toproduce an effective large scale clustering approach.

11.1. TopSigTopSig 3 [Geva and De Vries 2011] offers a radically different approach to the con-struction of file signatures when compared to traditional file signatures. Traditionalfile signatures [Faloutsos and Christodoulakis 1984] have been shown to be inferior toapproaches using inverted indexes, both in terms of the time and space required to pro-cess and store the index [Witten et al. 1999; Zobel 1998]. However, TopSig overcomesprevious criticisms aimed at file signatures by taking a principled approach using thevector space model, dimensionality reduction and numeric quantization. Previous ap-proaches to file signatures were constructed in an ad hoc fashion by combining ran-dom binary signatures using a bitwise XOR which is a Bloom filter [Bloom 1970] for

2http://www.mloss.org/software/view/468/3http://topsig.googlecode.com


148


the terms contained in documents. In contrast, TopSig randomly indexes a weightedterm-by-document matrix and then quantizes it. TopSig is competitive with state ofthe art probabilistic and language retrieval models at early precision, and clusteringapproaches [Geva and De Vries 2011].

Let D = {d1, d2, ..., dn} be a document collection of n documents signatures, D ⊂{+1,−1}d, |D| = n. Let F = {f1, f2, ..., fn} be the same document collection as D whereeach document is represented by a v-dimensional real valued vector, F ⊂ Rv, |F | = n,where v is the size of the vocabulary of the document collection. F is the term-by-document matrix in the full space of the collection vocabulary which underlies mostmodern retrieval systems.

TopSig indexes documents using a mapping function,m : Rv → {+1,−1}d, that mapsa document from the original v-dimensional continuous real valued term space, to a d-dimensional discrete binary valued space. The index is constructed using a mappingfunction, D = {f ∈ F : m(f)}. The mapping function creates a sparse random ternaryindex vector of d-dimensions for each term in the document with +1 and -1 values inrandom positions and the majority of positions containing 0 values. These randomlygenerated codes are almost orthogonal to each other and have been shown to providecomparable quality to orthogonal approaches such as principle component analysis[Bingham and Mannila 2001]. The index vector is multiplied by the term weight andadded to a d-dimensional real valued vector that represents the document. Once allthe terms in a document have been processed, this reduced dimensionality documentvector is then quantized to a d-dimensional binary vector by thresholding each valuein each dimension to 1 if greater than 0 and 0 otherwise. The 1 and 0 values in thebinary vector represent +1 and -1 values. This mapping function can be applied toeach document independently, meaning that new documents can be indexed in isola-tion without having to update the existing index. This is a key advantage to randomindexing [Sahlgren 2005] over other dimensionality reduction techniques such as la-tent semantic analysis [Deerwester et al. 1990] which requires global analysis of theterm-by-document matrix using the singular value decomposition [Golub and Kahan1965].

The indexing process of TopSig is similar to that of SimHash [Charikar 2002]. How-ever, TopSig uses signatures an order of magnitude longer than SimHash, and it usesmuch sparser random codes. The search process for ad hoc retrieval also differs, whereTopSig searches in the subspace of the query and applies relevance feedback. In con-trast, SimHash has predominantly been applied to nearest duplicate detection whererelatively short signatures are used to find the few nearest neighbours of a document.

The binary vectors in D provide a faithful representation of the original documentvectors in F . The topological relationships in the original space are preserved in thereduced dimensionality space. This is supported by the Johnson-Lindenstrauss lemma[Johnson and Lindenstrauss 1984] that states if points in a high-dimensional space areprojected into a randomly chosen subspace, of sufficiently high-dimensionality, thenthe distances between the points are approximately preserved. It also states that thenumber of dimensions required to reproduce the topology is asymptotically logarithmicin the number of points.

TopSig is advantageous for increased computational efficiency of document-to-document comparisons. Examples of this include clustering, classification, filtering,relevance feedback, near duplicate detection and explicit semantic analysis. TopSighas been shown to provide a 1 to 2 magnitude increase in processing speed for docu-ment clustering [Geva and De Vries 2011] over traditional sparse vector representa-tions when using the k-means algorithm. This paper introduced a modified k-meansalgorithm for working with binary document signatures. This is used when combiningK-tree with TopSig as K-tree uses k-means to perform splits in its tree structure.


149


11.2. K-treeK-tree 4 [De Vries et al. 2009] is a height balanced cluster tree. It was first intro-duced in the context of signal processing by Geva [Geva 2000]. It has been applied todocument clustering [De Vries and Geva 2009; ?; De Vries et al. 2010], image process-ing and computer vision [Girgensohn et al. 2011; Atsumi 2012a; Atsumi 2010; Atsumi2012b; Atsumi 2013] and movement recognition from gesture data [?]. The algorithmis particularly suitable to clustering of large collections due to its low complexity. Itis a hybrid of the B+-tree and k-means algorithm. The B+-tree algorithm is modifiedto work with multi dimensional vectors and k-means is used to perform node splits inthe tree. K-tree is also related to Tree Structured Vector Quantization (TSVQ) [Gershoand Gray 1993]. TSVQ recursively splits the data set, in a top-down fashion, usingk-means. TSVQ does not generally produce balanced trees.

K-tree achieves its efficiency through execution of the high cost k-means step oververy small subsets of the data. The number of vectors clustered during any step in theK-tree algorithm is determined by the tree order and it is independent of collectionsize. It is efficient in updating the collection while maintaining clustering propertiesthrough the use of a nearest neighbour search tree that directs new vectors to theappropriate leaf node.

The K-tree forms a hierarchy of clusters. This hierarchy supports multi-granularclustering where generalisation or specialisation is observed as the tree is traversedfrom a leaf towards the root or vice versa. The granularity of clusters can be decidedat run-time by selecting clusters that meet criteria such as distortion or cluster size.

11.3. Combining K-tree and TopSigK-tree has only previously been combined with floating point vectors in the RandomIndexing (RI) K-tree [De Vries et al. 2009]. Both RI and TopSig produce dense vectorsfrom sparse document vectors. The advantage of TopSig is that it exploits bit levelparallelism by processing 64 dimensions of the vector in each operation on a 64-bitCPU.

Both k-means and K-tree perform similar operations of assigning nearest neigh-bours to centroids and updating the centroids as means of the nearest neighbours.This corresponds to the expectation and maxmimisation steps of the EM algorithm[Dempster et al. 1977]. When using vectors that are not packed binary signatures, thenearest neighbour search that associates vectors with centroids dominates the execu-tion times of the algorithms. The updating of centroids as means is relatively quick.When using binary signatures this phenomenom is reversed. When calculating themean of a set of binary signatures, each bit is unpacked and added to a integer valuedvector for each dimension. Once all examples have been inspected, the result is thenpacked back into a binary signature by taking the sign of the integer valued vector.This transformation of a set of binary vectors into an integer valued vector is under-standably much slower than comparison of vectors using Hamming distance due to thelarger intermediate representation of an integer valued vector. This poses a problemwhen combining K-tree and TopSig. The K-tree algorithm updates all centroids alongthe insertion path for every vector inserted into the tree. As updating of means occursapproximately as frequently as nearest neighbour comparisons in the K-tree this posesa problem. This problem does not effect k-means when using binary signatures in sucha significant way, as there are relatively few updates of means to nearest neighbourcomparisons, k << n, when k is the number of clusters and n is the number of vec-tors being clustered. Therefore, to ensure the scalability of K-tree when cominbed with

4http://ktree.sf.net


150


TopSig, centroids in the trees have a delayed update mechanism. Upon insertion, cen-troids are only updated when a split happens with k-means, or there have been 1000vectors inserted below the centroid. The choice of 1000 is arbitrary but has workedeffectively for document clustering.

12. CBM625: CLUSTER BEST MATCH 625In this section, we introduce a new collection selection approach called Cluster BestMatch 625 (CBM625). It is based upon Okapi BM25 [Robertson et al. 1995] but com-bines document weights to rank clusters in the collection selection process. The nameis derived directly from Best Match 25 where the document weights are squared, there-fore, BM252 = BM625, and it is a cluster ranking process, hence the prefix.

Okapi BM25 has consistently performed well in the evaluation of ad hoc informationretrieval at TREC, INEX, NTCIR and CLEF. It is considered to be state of the art interms of document ranking functions. Therefore, we have built upon BM25 to createthe collection selection approach CBM625.

BM25 is a probabilistic model where each term in each document is assigned anestimate of relevance to a query. There is a matrix representing each document in acollection,Dt×n, where t is the number of terms and n is the number of documents. Thisis the inverted index that underlies most modern retrieval systems. We have used amodified implementation of Okapi BM25 in the ATIRE search engine 5 [Trotman et al.2010] which is defined in Equation 7.

RSVd =∑

t∈qlog

(n

dft

)(k1 + 1)tftd

k1

((1− b) + b

(Ld

Lavg

))+ tftd

(7)

In Equation 7, n is the total number of documents, and dft and tftd are the numberof documents containing the term t and the frequency of the term in document d, andLd and Lavg are the length of the document d and the average length of all documents.The empirical parameters k1 and b have been set to 0.9 and 0.4 respectively by trainingon the previous years INEX 2008 ad hoc queries, which uses a smaller 659,338 doc-ument Wikipedia collection from 2006. These parameters have been used for all testcollections.

When ranking clusters, these estimates of relevance from BM25 need to be combinedin some manner; i.e. a cluster representative must be built. Therefore, another clus-tered inverted index, Ct×k, is built where t is the number of terms and k is the numberof clusters in the collection. We have found that using squared document weights isthe most effective approach for combining BM25 documents weights to rank clusters.Squaring documents weights has the affect that heavily weighted documents have ahigher impact in the ranking of a cluster; i.e. 22 << 102. We initially experimentedwith a summation of all weights in the cluster, as each weight for each term in D isan estimate of relevance of that term to a query containing the term. However, usingsquared weights proved more effective on the Wikipedia where more relevant docu-ments were contained in highly ranked clusters. A justification for why this increasesretrieval quality is that heavily weighted documents have a larger impact on the top ofthe ranked list. A document with a heavy weight is more important to that particularquery, and therefore should be given higher importance. We also experimented withrepresenting each document as the mean of all documents in the cluster. This reducedperformance, requiring more clusters to be searched to find relevant documents. Anexplanation for this phenomena is that taking the mean normalises the BM25 weights

5http://www.atire.org


151


by the number of documents in a cluster. As BM25 already has length normalisation,we propose that normalising by cluster size, destroys the initial document length nor-malisation of BM25. CBM625 is defined in Equation 8 where RSVc is the RetrievalStatus Value assigned to clusters that is used to rank clusters in descending order ofrelevance. The terms in Equation 8 are the same terms as used in Equation 7.

RSVc =∑

d∈c

∑

t∈q

log

(n

dft

)(k1 + 1)tftd

k1

((1− b) + b

(Ld

Lavg

))+ tftd

2

(8)

12.1. The Ranking ProcessRanking documents using collection selection becomes a two step process. First theclustered index, C, is consulted to rank the clusters of the collection. These can be anyclusters or any grouping of documents. For example, random assignment of documentsto clusters or source based groupings can be used. Once the clusters have been ranked,then a cutoff for when to stop searching clusters must be chosen. We simply search thefirst f most highly ranked clusters. The first step in the ranking process generates alist of clusters to be searched. In a distributed system, the separate indexes for eachcluster are searched using standard BM25, and the results are combined to produce asingle ranking. As we are able to index all of the test collections on a single machine,we simply filter the BM25 index to the set of documents contained in the first f clustersfor each query. This simulates searching each of the clusters determined by the f mosthighly ranked clusters per query.

13. EVALUATION AND RESULTSWe have evaluated the collection distribution, selection and search approach basedupon TopSig K-tree and CBM625 using the INEX 2009 Wikipedia collection, and theClueWeb09 Category B collection. The INEX Wikipedia collection contains 2.67 milliondocuments with a vocabulary of 2.1 million terms. The mean document length in theWikipedia has 360 terms, the shortest has 1 term and the longest has 38,740 terms. Weused all 52 queries from INEX 2010 [Arvola et al. 2011] for which there are relevancejudgments. The ClueWeb09 Category B collection consists of 50 million documents anda vocabulary of 96.1 million terms. The mean ClueWeb09 Category B document lengthis 918 terms. We used queries 1-50 from the TREC 2009 Web Track [Clarke et al. 2009].

All experiments were conducted in the Queensland University of Technology BigData Lab on a dual socket Xeon E5-2665 system. It has 16 CPU cores, 32 hardwarethreads and 256GB of main memory. Each CPU package has 20MB L3 cache and 8cores connected in a ring architecture with a standard frequency of 2.4GHz, but whena single thread is running it has a frequency of 3.1GHz. It has become clear in thisstudy that the root node of a search tree such should be adjusted to the cache size ofthe processors. This means that the search tree can have a root node containing manymore centroids than the rest of the tree. This will improve cache hits and also has theadvantage of dealing with the curse of dimensionality investigated by De Vries andGeva [2012]. However, we have left detailed investigations of the implications of thesephenomena to later research. Note that at least 128GB of main memory is required tocomplete these experiments.

We have released a package containing all the software from the experiments. Thisincludes ATIRE, TopSig, K-tree, CBM625 and all the scripts used to conduct the ex-periments. It is available for download from the K-tree project on SourceForge 6.

6http://sourceforge.net/projects/ktree/files/docclust ir/docclust ir.tar.gz


152


13.1. Experiments using the INEX 2009 WikipediaIn this section, we conduct several experiments to determine the effect of differentparameters and clustering approaches on the INEX Wikipedia collection. These resultsare used to justify our choice of approach on the ClueWeb document collection, whereexperimentation is not particularly practical due to much longer processing times.

We have conducted an experiment to determine if there is any difference betweenusing k-means and K-tree clusters for ranking with CBM625. K-tree has been shown[?] to produce lower quality clusters both in terms of internal quality via distortion asthe Root Mean Squared Error of the clustering and in terms of external quality viathe traditional classes-to-clusters approach. However, when producing a large numberof clusters on a collection containing millions of documents there is a huge differencein execution time [?]. Interestingly, as the K-tree order becomes smaller, producingmore clusters, the execution time is reduced because the clustering process to splitnodes in the tree converges quicker [?]. Additionally, we present TopSig K-tree for thefirst time in this paper that further increases the computational efficiency of documentclustering. 1559 clusters of the Wikipedia collection were produced using k-means andK-tree using 4096 bit TopSig signatures. The k-means implementation works directlywith the bit signature representations of TopSig where both centroids and documentvectors are all bit signatures and are compared using Hamming distance. Clusteringusing this modified k-means and 4096 bit signatures have previously been shown toproduce no statistically significant difference in cluster quality compared to state ofthe art clustering approaches using sparse real value vectors while providing one totwo orders of mangnitude decrease in execution time [Geva and De Vries 2011]. Bothof these 1559 clusterings of the collection were ranked using the CBM625 approachwhere only the first f highest ranked clusters were searched. 104 points were sampledalong this curve, every 15 clusters, starting with only searching the highest rankedcluster; i.e. the first 1, 15, 30, 45, 60...1545 clusters. Exhaustive search by searching allclusters produces the same MAP score as reported by Trotman et al. [2010] of 0.38,and this is to be expected as we are using the same BM25 index exported from ATIRE.Figure 1 demonstrates that using clusters produced by k-means or K-tree has verylittle difference in retrieval behaviour when using the CBM625 ranking approach. Thex axis is the average fraction of the 2.67 million documents searched when searchingthe first f clusters, where a value of 1 indicates the whole collection was searched.The y axis is the fraction of the maximum MAP, where are value of 1 is equivalentto exhaustive searching. Therefore, when ranking clusters using CBM625 there is noadvantage to use k-means over K-tree, while there is a large computational advantagein clustering time to using K-tree.

We have conducted an experiment to determine the effect of the number of clusterson the distribution of relevant documents when these clusters are ranked by CBM625.Varying the number of clusters in the CBM625 cluster ranking process changes itsproperties. When every document is in its own cluster it becomes similar to the BM25ranking process except that the term-document weights are squared. In this case,every document is represented in the posting lists produced by CBM625. Note thatCBM625 produces an extra inverted index that is consulted to determine which docu-ments to search using the standard BM25 ranking process. As the number of clustersdecreases, each element in the posting list represents more and more documents, withthe extreme case of only having one cluster, where all documents are represented byone element in each posting list, always forcing exhaustive search. So, as the num-ber of clusters increases, the number of documents summarised in each position inthe posting lists of CBM625 decreases. We have clustered the Wikipedia into 100 and1559 clusters using TopSig k-means as in the previous experiment. We have also pro-


153


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Average Fraction of Collection Searched

Fra

ctio

n o

f M

axim

um

MA

P

TopSig k−means

TopSig K−tree

Fig. 1. Wikipedia – k-means versus K-tree clusters

duced clusters that do not consult document content and uniformly randomly assigndocuments to each of the clusters. This is similar to the divergence from random base-line approach except that the cluster size distribution is not matched. However, theTopSig k-means clustering approach was shown not to have any issues caused by clus-ter size distribution when evaluating using document categories or ad hoc relevancejudgements at the INEX 2010 XML Mining track [De Vries et al. 2011]. For the 1559clusters, retrieval performance was sampled every 15 clusters staring with 1 cluster,as in the previous experiment. For 100 clusters, retrieval performance was sampledafter every cluster. Figure 2 demonstrates that using more clusters allows more rele-vant documents to be ranked in earlier clusters. By increasing the number of clustersthe CBM625 ranking process is able to find clusters containing relevant documentsand rank them higher. Both of the clusterings allow relevant documents to be rankedearlier than the uniform random baseline. This indicates that useful learning is takingplace by clustering documents with respect to the ranking of relevant documents. Thisfurther confirms the clusters hypothesis.

We have conducted an experiment to determine the effect of using the sum ofsquared BM25 weights to represent clusters in the ranking process. As in the previ-ous experiments we sampled 104 points from the 1559 clusters starting with 1 cluster.Clusters were represented as a sum of BM25 weights and the sum of squared BM25weights. Figure 3 highlights that quality is improved for early recall. Therefore, weconducted a further experiement where each of the first 1 through 100 clusters of the1559 clusters were selected in the search process. We measure the difference betweenthe sum and sum of squares approach using P@10 in Figure 4, P@30 in Figure 5 andP@100 in Figure 6. As can been seen in all of the graphs of early precision, the sum ofsquares approach finds more relevant documents earlier than using the sum of BM25weights.


154


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Fra

ctio

n o

f M

axim

um

MA

P

TopSig k−means 1559 clusters

TopSig k−means 100 clusters

uniform random 1559 clusters

uniform random 100 clusters

Fig. 2. Wikipedia – Varying the number of clusters

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Fra

ctio

n o

f M

axim

um

MA

P

sum squared

sum

Fig. 3. Wikipedia – MAP: BM25 sum versus sum of squares

13.2. Experiments using ClueWeb 2009 Category BIn this section, we conduct an experiment that has the same experimental setup as theexperiment performed by Kulkarni and Callan [Kulkarni and Callan 2010] using theClueWeb collection.

We have indexed the 50 million document collection using both TopSig and ATIRE.We have used BM25 with ATIRE to produce a quantized index. It has been used withexactly the same settings as previous experiments on the Wikipedia, where the tuningparameters are k1 = 0.9 and b = 0.4. TopSig was used to create 4096 bit document


155


0 0.02 0.04 0.06 0.08 0.1 0.12

0.4

0.5

0.6

0.7

0.8

0.9

1


Fra

ctio

n o

f M

axim

um

P@

10

sum squared

sum

Fig. 4. Wikipedia – P@10: BM25 sum versus sum of squares

0 0.02 0.04 0.06 0.08 0.1 0.120.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Fra

ctio

n o

f M

axim

um

P@

30

sum squared

sum


signatures. These signatures were used to construct a TopSig K-tree with a node sizem = 1000 in 10 hours. This created 139,641 document clusters of the collection byselecting the lowest level “codebook” clusters from the K-tree. The K-tree was 3 lev-els deep where the first level is the root of the tree, the second level is the codebookclusters and the third levels is the 50 million document signatures inserted into thetree. We evaluated P@10, P@20 and P@30 when only searching the first f clusters asdetermined by CBM625. We have evaluated the efficiency of only searching the firstf clusters in the same manner as Kulkarni and Callan [2010]. This is determined by


156


0 0.02 0.04 0.06 0.08 0.1 0.120.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Fra

ction

of

Ma

xim

um

P@

10

0

sum squared

sum


the number of documents searched in the first f clusters when averaged across all 50queries, giving the average number of documents searched per query. We have per-formed significance tests for each of the rankings produced by CBM625 by comparingit to the exhaustive search produced by ATIRE BM25. This is performed by a two tailedpaired t-test between the scores for each of the 50 queries for both CBM625 and ATIREBM25. The results are displayed in Table 13.2 where the bold values highlight a p-value greater than 0.05 which indicates there is no statistically significant differencebetween the truncated CBM625 ranking and the exhaustive ATIRE BM25 ranking.The column labelled “Clusters” indicates the first f clusters searched by CBM625. Thecolumn labelled “Documents” indicates the average number of documents searchedwhen averaged over all 50 queries. Interestingly, ATIRE produced P@10, P@20 andP@30 scores of 0.43, 0.44 and 0.44 which were higher than those produced by Indri inthe experiments by Kulkarni and Callan with all of the scores at 0.3. However, as wehave not produced rankings with Indri we are not able to include them in this evalua-tion. However, the same behaviour is observed, where only a fraction of the collectionhas to be searched to achieve the same early precision. The previous results of Kulka-rni and Callan [2010] clustered the collection into 100 clusters using k-means andsolved the scalability problem of clustering 50 million documents by using sampling.Their approach searched the first 1 of 100 clusters to produce a ranking with no statis-tically significant difference to exhaustive search while searching 1.3% of documentsin the collection. Our approach searched the first 8 of 139,641 clusters to produce aranking with no statistically significant difference to exhaustive search while search-ing 0.1% of documents in the collection. This is a 13 fold decrease in the averagenumber of documents searched per query!

As with the previous experiments on the Wikipedia we have graphed P@10, P@20and P@30 when varying the number of clusters. The x axis is the average fraction of thecollection searched and the y axis is the fraction of highest P@n score achieved. Figure7 highlights this for P@10, Figure 8 highlights this for P@20 and Figure 9 highlightsthis for P@30. As can be seen in the graphs, early precision rises very quickly and only0.1% of the collection has to be searched to retain early precision.


157


Table I. ClueWeb 2009 Category B

Approach Clusters Documents P@10 P@20 P@30

CBM625 1 15073.7 0.15 0.14 0.13CBM625 2 22959 0.28 0.25 0.24CBM625 3 30624.5 0.35 0.32 0.31CBM625 4 33514.7 0.36 0.34 0.33CBM625 5 37340.8 0.38 0.35 0.34CBM625 6 41610.4 0.40 0.39 0.37CBM625 7 45559.7 0.41 0.40 0.39CBM625 8 49359.7 0.45 0.42 0.41CBM625 9 54047.4 0.44 0.42 0.41CBM625 10 58189.7 0.45 0.43 0.42CBM625 20 88368.9 0.43 0.43 0.42CBM625 30 118113 0.42 0.42 0.42CBM625 40 142540 0.44 0.43 0.42CBM625 50 166020 0.43 0.43 0.42CBM625 60 188790 0.43 0.44 0.43CBM625 70 207586 0.44 0.43 0.43CBM625 80 229300 0.44 0.44 0.43CBM625 90 247519 0.44 0.43 0.43CBM625 100 266513 0.42 0.44 0.44CBM625 200 436066 0.44 0.42 0.42CBM625 300 587127 0.43 0.42 0.43CBM625 400 733771 0.44 0.44 0.43CBM625 500 874275 0.45 0.44 0.43CBM625 600 1.01E+06 0.42 0.43 0.42CBM625 700 1.14E+06 0.41 0.42 0.41CBM625 800 1.27E+06 0.41 0.43 0.41CBM625 900 1.40E+06 0.42 0.42 0.40CBM625 1000 1.52E+06 0.41 0.41 0.40

ATIRE BM25 NA 50220423 0.43 0.44 0.44

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035

0.4

0.5

0.6

0.7

0.8

0.9

1


Fra

ctio

n o

f M

axim

um

P@

10

Fig. 7. ClueWeb 2009 Category B – P@10


158


0 0.005 0.01 0.015 0.02 0.025 0.03 0.035

0.4

0.5

0.6

0.7

0.8

0.9

1


Fra

ctio

n o

f M

axim

um

P@

20


0 0.005 0.01 0.015 0.02 0.025 0.03 0.0350.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Fra

ctio

n o

f M

axim

um

P@

30


14. DISCUSSIONIn this paper, we have presented theoretical evaluation strategies for clustering basedon the upper bound performance of an “oracle” collection selection approach. The re-sults from this evaluation at INEX indicated actual retrieval systems can take ad-vantage of a more fine grained clustering to better exploit the cluster hypothesis toreduce the number of documents that need to be searched for a retrieval system. Thishas been demonstrated on two collections, the 2.67 million document INEX 2009 XMLWikipedia collection and the 50 million document ClueWeb09 Category B collection.


159


Both were evaluated using relevance judgments from human assessors using reusableinformation retrieval experiments from INEX and TREC. The TopSig K-tree solvedthe scalability problem of clustering 50 million documents in 10 hours, which to as faras we know, is not reported in the literature in either clustering a collection this largewithout sampling, or clustering it into such a large number of clusters, or simply clus-tering all 50 million documents. The CBM625 approach allowed us to rank clusterseffectively and efficiently.

We have also introduced using ad hoc relevance judgments as an alternative to classor category based evaluation of document clustering. The primary advantage of thisapproach comes from pooling used in information retrieval system evaluation thatreduces assessor load. Only several thousand documents need to be assessed, whereas the entire collection is labelled for category based evaluations. Furthermore, thisad hoc retrieval based evaluation is motivated by a real use case. It minimises theresources used in a retrieval system. We have confirmed the theoretical “oracle” resultsof the evaluation at INEX by implementing a cluster based search engine.

We have been able to replicate the results of cluster based search on large scale doc-ument collections presented by Kulkarni and Callan [2010]. We confirmed that clus-ter based search indeed works using different indexing, clustering and ranking ap-proaches. Furthermore, we were able to decrease the number of documents searchedby the retrieval system by 13 times.

Standard compression [Trotman 2003] and impact ordering [Anh et al. 2001] of post-ing lists for both the cluster, Ct×k, and document, Dt×n, inverted indexes can be imple-mented to further improve efficiency of retrieval. As the cluster scores in the rankedlist ofRSVc values is much more heavily skewed towards the head of the list than whenusing RSVd values, we suggest that stopping early via impact ordering will be highlyeffective for CBM625. Impact ordering with CBM625 may be able terminate earlierthan impact order with BM25 relative to length of the postings list. Most weight inthe cluster inverted index for a given query is contained in a few clusters. This isalso confirmed by earlier experiments where only the 8 most highly ranked clustersof the approximately 140,000 clusters for ClueWeb09 Category B need to be searchedto achieve early precision with no statistically significant difference from exhaustiveranking using BM25. Additionally, shard cutoff estimation presented by Kulkarni et al.[2012] as an extension to the original approaches in their cluster based search paper[Kulkarni and Callan 2010] can be implemented to further improve efficiency. Theshard cutoff estimations improve the results by another 50% where as our approachimproves results by 1300%. However, this shard cutoff estimation is likely to furtherimprove results. We suggest as the clusters are smaller and finer grained, shard cutoffestimation may be able to work more effectively as it is not possible to stop half waythrough larger clusters.

We suggest that most of the performance improvement we have observed is con-tributed by the fine grain clustering produced by TopSig K-tree. With sampling ap-proaches such as those used by Kulkarni et al. [2012], there are simply not enoughdocuments available to provide adequate statistics to produce a clustering with thismany clusters. It is the scalability of TopSig K-tree that allows us to produce theseresults. The effect of the number of clusters can be seen in Figure 2 where increasingthe number of clusters from 100 to 1559 on the Wikipedia collection allows clusterscontaining relevant documents to be ranked much earlier in the CBM625 approach.

Apart from TopSig K-tree there are other factors different in our approach thatmay have contributed to the improved results. The use of the TopSig document signa-tures themselves may provide different clusters to other approaches by using the loglikelihood weighting described by Geva and De Vries [2011]. The effectiveness of thisweighting for document clustering has not been studied in detail. The ranking process,


160


CBM625, may have contributed to improved results. In Figures 4 to 5 we demonstratedthat it increases early precision in highly ranked clusters on the Wikipedia collection.Additionally, the use of ATIRE rather than Indri may have contributed to the results.However, the approach we have described is able to drastically decrease the number ofdocuments required to be searched compared to that of by Kulkarni and Callan [2010].

Additionally, the TopSig K-tree has other unique properties that make it partic-ularly appealing for use with clustering of document collections for information re-trieval. Document collections are rarely static in practice and have continual updates.The K-tree algorithm allows insertions and deletions with the tree in an online anddynamic fashion. New documents can be inserted without having to perform globalanalysis. The average case analysis of K-tree [De Vries 2010a] has an O(log n) timecomplexity for insertion of a single document. These properties allow it to efficientlycluster volatile document collections. Furthermore, which has not been discussed inthis paper at all, is that the tree structure of the K-tree can be used to further increasesearch efficiency by recursively searching the clusters in the tree using the CBM625approach. Alternatively, a signature based representation may be able to be foundthat is effective for ranking clusters in collection selection, eliminating the need for in-verted indexes all together. K-tree provides a multi-granular view of clustering wherethe root node contains the most coarse and largest clusters. As the tree is traversedthe clusters become more and more fine grained at lower levels. Furthermore, locat-ing an entire sub-tree on a single machine allows semantically similar documents tobe placed together and are likely to be relevant to similar queries due to the clusterhypothesis. In a distributed system, the higher levels of a K-tree may partition thecollection into many separate machines. Then each sub-tree stored on each machinecan still be used to increase search efficiency on each machine by exploiting the finegrained clusters at lower levels in the K-tree. This improves throughput by two means.Firstly it improves throughput by spreading the load over many machines. Secondly itimproves throughput by increasing the efficiency on each machine in the parallel anddistributed system.

REFERENCESJames Allan and Hema Raghavan. 2002. Using part-of-speech patterns to reduce query ambiguity. In Pro-

ceedings of the 25th annual international ACM SIGIR conference on Research and development in infor-mation retrieval. ACM, 307–314.

G. Amati and C.J. van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuringthe divergence from randomness. ACM Transactions on Information Systems (TOIS) 20 (Oct. 2002),357–389. Issue 4. DOI:http://dx.doi.org/10.1145/582415.582416

V.N. Anh, O. de Kretser, and A. Moffat. 2001. Vector-space ranking with effective early termination. InProceedings of the 24th annual international ACM SIGIR conference on Research and development ininformation retrieval. ACM, 35–42.

D. Arthur, B. Manthey, and H. Roglin. 2009. k-Means Has Polynomial Smoothed Complex-ity. Annual IEEE Symposium on Foundations of Computer Science 0 (2009), 405–414.DOI:http://dx.doi.org/10.1109/FOCS.2009.14

P. Arvola, S. Geva, J. Kamps, R. Schenkel, A. Trotman, and J. Vainio. 2011. Overview of the INEX 2010 adhoc track. Comparative Evaluation of Focused Retrieval (2011), 1–32.

M. Atsumi. 2010. Probabilistic learning of visual object composition from attended segments. In Advancesin Visual Computing. Springer, 696–705.

M. Atsumi. 2012a. Learning visual categories based on probabilistic latent component models with semi-supervised labeling. GSTF Journal on Computing (JoC) 2, 1 (2012).

M. Atsumi. 2012b. Visual categorization based on learning contextual probabilistic latent component tree.In Artificial Neural Networks and Machine Learning – ICANN 2012. Springer, 419–426.

M. Atsumi. 2013. Attention-Guided Organized Perception and Learning of Object Categories Based on Prob-abilistic Latent Variable Models. Journal of Intelligent Learning Systems and Applications 5 (2013),123–133.


161


A. Banerjee, S. Merugu, I.S. Dhillon, and J. Ghosh. 2005. Clustering with Bregman divergences. The Journalof Machine Learning Research 6 (2005), 1705–1749.

Kobus Barnard and David Forsyth. 2001. Learning the semantics of words and pictures. In Eighth IEEEInternational Conference on Computer Vision, 2001. ICCV 2001. Proceedings., Vol. 2. IEEE, 408–415.

J.C. Bezdek, R. Ehrlich, and others. 1984. FCM: The fuzzy c-means clustering algorithm. Computers &Geosciences 10, 2-3 (1984), 191–203.

Christophe Biernacki, Gilles Celeux, and Gerard Govaert. 2000. Assessing a mixture model for clusteringwith the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelli-gence 22, 7 (2000), 719–725.

E. Bingham and H. Mannila. 2001. Random projection in dimensionality reduction: applications to imageand text data. In KDD 2001. ACM, 245–250. DOI:http://dx.doi.org/10.1145/502512.502546

D.M. Blei, A.Y. Ng, and M.I. Jordan. 2003. Latent dirichlet allocation. The Journal of Machine LearningResearch 3 (2003), 993–1022.

B.H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970),422–426.

M.S. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In STOC. ACM, NewYork, NY, USA, 380–388. DOI:http://dx.doi.org/10.1145/509907.509965

C.L. Clarke, N. Craswell, and I. Soboroff. 2009. Overview of the TREC 2009 web track. Technical Report.DTIC Document.

C.L.A. Clarke, N. Craswell, I. Soboroff, and G.V. Cormack. 2010. Overview of the TREC 2010 Web track.Technical Report. DTIC Document.

C. Cleverdon. 1967. The Cranfield tests on index language devices. In Aslib proceedings, Vol. 19. MCB UPLtd, 173–194.

William W Cohen and Yoram Singer. 1999. Context-sensitive learning methods for text categorization. ACMTransactions on Information Systems (TOIS) 17, 2 (1999), 141–173.

F. Crestani and S. Wu. 2006. Testing the cluster hypothesis in distributed information retrieval. InformationProcessing & Management 42, 5 (2006), 1137–1150.

W.B. Croft. 1980. A model of cluster searching based on classification. Information systems 5, 3 (1980), 189–195.

C.M. De Vries. 2010a. Application of K-tree to document clustering. Master’s thesis. School of ComputerScience, Queensland University of Technology, Brisbane, Australia.

C.M. De Vries, L. De Vine, and S. Geva. 2009. Random Indexing K-tree. ADCS 2009 (2009), 43.C.M. De Vries and S. Geva. 2009. Document Clustering with K-tree. Advances in Focused Retrieval:

7th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2008,Dagstuhl Castle, Germany, December 15-18, 2008. Revised and Selected Papers (2009), 420–431.DOI:http://dx.doi.org/10.1007/978-3-642-03761-0 43

C.M. De Vries and S. Geva. 2012. Pairwise similarity of TopSig document signatures. In Proceedings of theSeventeenth Australasian Document Computing Symposium. ACM, 128–134.

C.M. De Vries, S. Geva, and L. De Vine. 2010. Clustering with random indexing K-tree and XML structure.In Focused Retrieval and Evaluation. Springer, 407–415.

C.M. De Vries, S. Geva, and A. Trotman. 2012. Document Clustering Evaluation: Divergence from a RandomBaseline. Workshop ”Information Retrieval 2012” (IR-2012), 12-14 September, 2012, Technical Universityof Dortmund, Dortmund, Germany. (2012).

C.M. De Vries, R. Nayak, S. Kutty, S. Geva, and A. Tagarelli. 2011. Overview of the INEX 2010 XML miningtrack: Clustering and classification of XML documents. Comparative Evaluation of Focused Retrieval(2011), 363–376.

Christopher Michael De Vries. 2010b. Application of K-tree to document clustering. (2010).S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman. 1990. Indexing by latent se-

mantic analysis. Journal of the American Society for Information Science 41, 6 (1990), 391–407.A.P. Dempster, N.M. Laird, and D.B. Rubin. 1977. Maximum likelihood from incomplete data via the EM

algorithm. Journal of the Royal Statistical Society. Series B (Methodological) (1977), 1–38.L. Denoyer and P. Gallinari. 2008. Report on the XML mining track at INEX 2007 categorization and clus-

tering of XML documents. In ACM SIGIR Forum, Vol. 42. ACM, 22–28.L. Denoyer and P. Gallinari. 2009. Overview of the INEX 2008 XML Mining Track. Advances in Focused

Retrieval: 7th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX2008, Dagstuhl Castle, Germany, December 15-18, 2008. Revised and Selected Papers (2009), 401–411.DOI:http://dx.doi.org/10.1007/978-3-642-03761-0 41


162


L. Denoyer, P. Gallinari, and A.M. Vercoustre. 2007. Report on the XML mining track at INEX 2005 andINEX 2006. INEX 2006 (2007), 432–443.

F. Diaz. 2005. Regularizing ad hoc retrieval scores. In Proceedings of the 14th ACM international conferenceon Information and knowledge management. ACM, 672–679.

C. Ding and X. He. 2004. K-means clustering via principal component analysis. In Proceedings of the twenty-first international conference on Machine learning. ACM, 29.

S. Ding, S. Gollapudi, S. Ieong, K. Kenthapadi, and A. Ntoulas. 2011. Indexing strategies for graceful degra-dation of search quality. In Proceedings of the 34th international ACM SIGIR conference on Researchand development in Information. ACM, 575–584.

R.O. Duda, P.E. Hart, and D.G. Stork. 2001. Pattern classification, 2nd Edition. Wiley-Interscience.V. Estivill-Castro. 2002. Why so many clustering algorithms: a position paper. ACM SIGKDD Explorations

Newsletter 4, 1 (2002), 65–75.C. Faloutsos and S. Christodoulakis. 1984. Signature files: an access method for documents and its analytical

performance evaluation. ACM Transactions on Information Systems (TOIS) 2 (October 1984), 267–288.Issue 4. DOI:http://dx.doi.org/10.1145/2275.357411

P. Forner, J. Gonzalo, J. Kekalainen, M. Lalmas, and M. De Rijke. 2011. Multilingual and Multimodal In-formation Access Evaluation: Proceedings of the Second International Conference of the Cross-languageEvaluation Forum, Clef 2011, Amsterdam, the Netherlands, September 19-22, 2011. Vol. 6941. Springer-Verlag New York Inc.

Christopher Fox. 1989. A stop list for general text. In ACM SIGIR Forum, Vol. 24. ACM, 19–21.T.W. Fox. 2005. Document vector compression and its application in document clustering.

Canadian Conference on Electrical and Computer Engineering (May 2005), 2029–2032.DOI:http://dx.doi.org/10.1109/CCECE.2005.1557384

N. Fuhr, M. Lechtenfeld, B. Stein, and T. Gollub. 2011. The optimum clustering framework: implementingthe cluster hypothesis. Information Retrieval (2011), 1–23.

A. Gersho and R.M. Gray. 1993. Vector quantization and signal compression. Kluwer Academic Publishers.S. Geva. 2000. K-tree: a height balanced tree structured vector quantizer. In Proceedings of the 2000 IEEE

Signal Processing Society Workshop Neural Networks for Signal Processing X, 2000., Vol. 1. IEEE, 271–280.

S. Geva and C.M. De Vries. 2011. TOPSIG: topology preserving document signatures. In Proceedings of the20th ACM international conference on Information and knowledge management (CIKM ’11). ACM, NewYork, NY, USA, 333–338. DOI:http://dx.doi.org/10.1145/2063576.2063629

A. Girgensohn, F. Shipman, and L. Wilcox. 2011. Adaptive clustering and interactive visualizations to sup-port the selection of video clips. In Proceedings of the 1st ACM International Conference on MultimediaRetrieval. ACM, 34.

G. Golub and W. Kahan. 1965. Calculating the singular values and pseudo-inverse of a matrix. Journal ofthe Society for Industrial and Applied Mathematics: Series B, Numerical Analysis 2, 2 (1965), 205–224.

L. Gravano and H. Garcia-Molina. 1999. Generalizing GlOSS to vector-space databases and broker hierar-chies. (1999).

I. Guyon, U. von Luxburg, and R.C. Williamson. 2009. Clustering: Science or art. In NIPS Workshop onClustering Theory.

Z.S. Harris. 1954. Distributional structure. Word (1954).M.A. Hearst and J.O. Pedersen. 1996. Reexamining the cluster hypothesis: scatter/gather on retrieval re-

sults. In Proceedings of the 19th annual international ACM SIGIR conference on Research and develop-ment in information retrieval. ACM, 76–84.

A. Hotho, S. Staab, and G. Stumme. 2003. Ontologies improve text document clustering. Third IEEE Inter-national Conference on Data Mining, 2003. ICDM 2003. (Nov. 2003), 541–544.

Xiaohua Hu, Xiaodan Zhang, Caimei Lu, Eun K Park, and Xiaohua Zhou. 2009. Exploiting Wikipedia asexternal knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD internationalconference on Knowledge discovery and data mining. ACM, 389–396.

Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse ofdimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing. ACM,604–613.

N. Jardine and C.J. van Rijsbergen. 1971. The use of hierarchic clustering in information retrieval. Informa-tion storage and retrieval 7, 5 (1971), 217–240.

W.B. Johnson and J. Lindenstrauss. 1984. Extensions of Lipschitz mappings into a Hilbert space. Contem-porary mathematics 26, 189-206 (1984), 1–1.


163


N. Kando, K. Kuriyama, T. Nozue, K. Eguchi, H. Kato, and S. Hidaka. 1999. Overview of IR tasks at the firstNTCIR workshop. In Proceedings of the first NTCIR workshop on research in Japanese text retrieval andterm recognition. 11–44.

G. Karypis, E.H. Han, and V. Kumar. 1999. Chameleon: Hierarchical clustering using dynamic modeling.Computer 32, 8 (1999), 68–75.

M. Kaszkiel and J. Zobel. 2001. Effective ranking with arbitrary passages. Journal of the American Societyfor Information Science and Technology 52, 4 (2001), 344–364.

Jon M. Kleinberg. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5 (1999), 604–632.DOI:http://dx.doi.org/10.1145/324133.324140

A. Kulkarni and J. Callan. 2010. Document allocation policies for selective searching of distributed indexes.In Proceedings of the 19th ACM international conference on Information and knowledge management.ACM, 449–458.

A. Kulkarni, A.S. Tigelaar, D. Hiemstra, and J. Callan. 2012. Shard ranking and cutoff estimation for top-ically partitioned collections. In Proceedings of the 21st ACM international conference on Informationand knowledge management. ACM, 555–564.

Sangeetha Kutty, Richi Nayak, and Yuefeng Li. 2011. XML documents clustering using a tensor space model.In Advances in Knowledge Discovery and Data Mining. Springer, 488–499.

A. Kyriakopoulou and T. Kalamboukis. 2007. Using clustering to enhance text classification. InSIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Re-search and development in information retrieval. ACM, New York, NY, USA, 805–806.DOI:http://dx.doi.org/10.1145/1277741.1277918

S. Lamprier, T. Amghar, B. Levrat, and F. Saubion. 2007. Seggen: A genetic algorithm for linear text seg-mentation. In International Joint Conference on Artificial Intelligence (IJCAI’07).

S. Lamprier, T. Amghar, B. Levrat, and F. Saubion. 2008. Using Text Segmentation to Enhance the ClusterHypothesis. Artificial Intelligence: Methodology, Systems, and Applications (2008), 69–82.

Leah S Larkey, Margaret E Connell, and Jamie Callan. 2000. Collection selection and results merging withtopically organized US patents and TREC data. In Proceedings of the ninth international conference onInformation and knowledge management. ACM, 282–289.

V. Latora and M. Marchiori. 2001. Efficient behavior of small-world networks. Physical Review Letters 87,19 (2001), 198701.

K.S. Lee, W.B. Croft, and J. Allan. 2008. A cluster-based resampling method for pseudo-relevance feedback.In Proceedings of the 31st annual international ACM SIGIR conference on Research and development ininformation retrieval. ACM, 235–242.

K.S. Lee, K. Kageura, and K.S. Choi. 2004. Implicit ambiguity resolution using incremental clustering incross-language information retrieval. Information processing & management 40, 1 (2004), 145–159.

K.S. Lee, Y.C. Park, and K.S. Choi. 2001. Re-ranking model based on document clusters. Information pro-cessing & management 37, 1 (2001), 1–14.

A. Leuski. 2001. Evaluating document clustering for interactive information retrieval. In CIKM ’01: Pro-ceedings of the tenth international conference on Information and knowledge management. ACM, NewYork, NY, USA, 33–40. DOI:http://dx.doi.org/10.1145/502585.502592

AWC Liew, SH Leung, and WH Lau. 2000. Fuzzy image clustering incorporating spatial continuity. IEEEProceedings-Vision, Image and Signal Processing 147, 2 (2000), 185–192.

X. Liu and W.B. Croft. 2004. Cluster-based retrieval using language models. In Proceedings of the 27thannual international ACM SIGIR conference on Research and development in information retrieval.ACM, 186–193.

S. Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory 28, 2 (March1982), 129–137.

Lie Lu, Hong-Jiang Zhang, and Hao Jiang. 2002. Content analysis for audio classification and segmentation.IEEE Transactions on Speech and Audio Processing 10, 7 (2002), 504–516.

Kevin Lund and Curt Burgess. 1996. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers 28, 2 (1996), 203–208.

Bin Luo, Richard C Wilson, and Edwin R Hancock. 2003. Spectral embedding of graphs. Pattern recognition36, 10 (2003), 2213–2230.

C. Macdonald and I. Ounis. 2006. Voting for candidates: adapting data fusion techniques for an expert searchtask. In CIKM 2006. ACM, 387–396. DOI:http://dx.doi.org/10.1145/1183614.1183671

C.D. Manning, P. Raghavan, and H. Schutze. 2008. An introduction to information retrieval. (2008).


164


Yingbo Miao, Vlado Keselj, and Evangelos Milios. 2005. Document clustering using character N-grams: acomparative evaluation with term-based and word-based clustering. In Proceedings of the 14th ACMinternational conference on Information and knowledge management. ACM, 357–358.

Fionn Murtagh, Jean-Luc Starck, and Michael W Berry. 2000. Overcoming the curse of dimensionality inclustering by means of the wavelet transform. Comput. J. 43, 2 (2000), 107–120.

R. Nayak, C.M. De Vries, S. Kutty, S. Geva, L. Denoyer, and P. Gallinari. 2010. Overview of the INEX 2009XML mining track: Clustering and classification of XML documents. Focused Retrieval and Evaluation(2010), 366–378.

R. Nayak, R. Witt, and A. Tonev. 2002. Data mining and XML documents. (2002).Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking:

Bringing order to the web. (1999).Dan Pelleg, Andrew W Moore, and others. 2000. X-means: Extending K-means with Efficient Estimation of

the Number of Clusters.. In ICML. 727–734.M.F. Porter. 2006. An algorithm for suffix stripping. Program: Electronic Library and Information Systems

40, 3 (2006), 211–218.William H Press. 2007. Numerical recipes 3rd edition: The art of scientific computing. Cambridge university

press.Diego Puppin, Fabrizio Silvestri, and Domenico Laforenza. 2006. Query-driven document partitioning and

collection selection. In Proceedings of the 1st international conference on Scalable information systems.ACM, 34.

Behrang QasemiZadeh and Siegfried Handschuh. 2014. Random Manhattan Indexing. In 25th InternationalWorkshop on Database and Expert Systems Applications.

Daniel Ramage, Paul Heymann, Christopher D Manning, and Hector Garcia-Molina. 2009. Clustering thetagged web. In Proceedings of the Second ACM International Conference on Web Search and Data Min-ing. ACM, 54–63.

Kaspar Riesen, Michel Neuhaus, and Horst Bunke. 2007. Graph embedding in vector spaces by means ofprototype selection. In Graph-Based Representations in Pattern Recognition. Springer, 383–393.

S.E. Robertson and K.S. Jones. 1997. Simple, proven approaches to text retrieval. Update (1997).S.E. Robertson, S. Walker, S. Jones, M.M. Hancock-Beaulieu, and M. Gatford. 1995. Okapi at TREC-3. In

Proceedings of the seventh Text REtrieval Conference.M. Sahlgren. 2005. An introduction to random indexing. In TKE 2005.Ruslan Salakhutdinov and Geoffrey Hinton. 2009. Semantic hashing. International Journal of Approximate

Reasoning 50, 7 (2009), 969–978.G. Salton and C. Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information pro-

cessing & management 24, 5 (1988), 513–523.G. Salton, A. Wong, and C.S. Yang. 1975. A vector space model for automatic indexing. Commun. ACM 18,

11 (1975), 613–620.T. Saracevic. 2008. Effects of inconsistent relevance judgments on information retrieval test results: A his-

torical perspective. (2008).S.Z. Selim and M.A. Ismail. 1984. K-means-type algorithms: a generalized convergence theorem and charac-

terization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence 1 (1984),81–87.

M. Smucker and J. Allan. 2009. A new measure of the cluster hypothesis. Advances in Information RetrievalTheory (2009), 281–288.

K. Sparck Jones and C. van Rijsbergen. 1975. Report on the need for and provision of an” ideal” informationretrieval test collection. British Library Research and Development Report 5266, Computer Laboratory,University of Cambridge (1975).

M. Steinbach, G. Karypis, V. Kumar, and others. 2000. A comparison of document clustering techniques. InKDD workshop on text mining, Vol. 400. Boston, 525–526.

Andrea Tagarelli and Sergio Greco. 2006. Toward Semantic XML Clustering.. In SDM. SIAM, 188–199.A.H. Tan. 1999. Text mining: The state of the art and the challenges. In Proceedings of the PAKDD 1999

Workshop on Knowledge Disocovery from Advanced Databases. 65–70.A. Tombros, R. Villa, and C.J. Van Rijsbergen. 2002. The effectiveness of query-specific hierarchic clustering

in information retrieval. Information processing & management 38, 4 (2002), 559–582.A. Trotman. 2003. Compressing inverted files. Information Retrieval 6, 1 (2003), 5–19.A. Trotman, X.F. Jia, and S. Geva. 2010. Fast and effective focused retrieval. Focused Retrieval and Evalua-

tion (2010), 229–241.


165


Stijn Van Dongen. 2008. Graph clustering via a discrete uncoupling process. SIAM J. Matrix Anal. Appl. 30,1 (2008), 121–141.

C.J. van Rijsbergen and K.S. Jones. 1973. A test for the separation of relevant and non-relevant documentsin experimental retrieval collections. Journal of Documentation 29, 3 (1973), 251–257.

E.M. Voorhees. 1985. The cluster hypothesis revisited. In Proceedings of the 8th annual international ACMSIGIR conference on Research and development in information retrieval. ACM, 188–196.

E.M. Voorhees. 1986. The effectiveness and efficiency of agglomerative hierarchic clustering in documentretrieval. Ph.D. Dissertation.

E.M. Voorhees. 2002. The philosophy of information retrieval evaluation. In Evaluation of cross-languageinformation retrieval systems. Springer, 143–170.

M. Weeks, V.J. Hodge, and J. Austin. 2002. A hardware-accelerated novel IR system. In 10th EuromicroWorkshop on Parallel, Distributed and Network-based Processing, 2002. Proceedings. IEEE, 283–289.

J.S. Whissell and C.L.A. Clarke. 2011. Improving document clustering using Okapi BM25 feature weighting.Information Retrieval (2011), 1–22.

John S Whissell and Charles LA Clarke. 2013. Effective measures for inter-document similarity. In Proceed-ings of the 22nd ACM international conference on Conference on information & knowledge management.ACM, 1361–1370.

I.H. Witten, A. Moffat, and T.C. Bell. 1999. Managing gigabytes: compressing and indexing documents andimages. Morgan Kaufmann.

X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J. McLachlan, A. Ng, B. Liu, P.S. Yu,and others. 2008. Top 10 algorithms in data mining. Knowledge and Information Systems 14, 1 (2008),1–37.

J. Xu and W.B. Croft. 1999. Cluster-based language models for distributed retrieval. In Proceedings of the22nd annual international ACM SIGIR conference on Research and development in information re-trieval. ACM, 254–261.

W. Xu, X. Liu, and Y. Gong. 2003. Document clustering based on non-negative matrix factor-ization. In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conferenceon Research and development in informaion retrieval. ACM, New York, NY, USA, 267–273.DOI:http://dx.doi.org/10.1145/860435.860485

Minerva Yeung, Boon-Lock Yeo, and Bede Liu. 1998. Segmentation of video by clustering and graph analysis.Computer vision and image understanding 71, 1 (1998), 94–109.

Lan Yi, Bing Liu, and Xiaoli Li. 2003. Eliminating noisy information in web pages for data mining. In Pro-ceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining.ACM, 296–305.

B. Yuwono and D.L. Lee. 1997. Server ranking for distributed text retrieval systems on the internet. In Pro-ceedings of the Fifth International Conference on Database Systems for Advanced Applications (DAS-FAA). 41–50.

C. Zhai and J. Lafferty. 2004. A study of smoothing methods for language models applied to informa-tion retrieval. ACM Transactions on Information Systems (TOIS) 22 (April 2004), 179–214. Issue 2.DOI:http://dx.doi.org/10.1145/984321.984322

D. Zhang, J. Wang, D. Cai, and J. Lu. 2010. Self-taught hashing for fast similarity search. In SIGIR 2010.ACM, 18–25. DOI:http://dx.doi.org/10.1145/1835449.1835455

G.K. Zipf. 1949. Human behavior and the principle of least effort: An introduction to human ecology. Addison-Wesley Press.

J. Zobel. 1998. How reliable are the results of large-scale information retrieval experiments?. In Proceedingsof the 21st annual international ACM SIGIR conference on Research and development in informationretrieval. ACM, 307–314.

J. Zobel, A. Moffat, and L.A.F. Park. 2009. Against recall: is it persistence, cardinality, density, coverage, ortotality? SIGIR Forum 43 (June 2009), 3–8. Issue 1. DOI:http://dx.doi.org/10.1145/1670598.1670600


166

Chapter 16The EM-tree Algorithm

I contributed the majority of this paper. I proposed and wrote the algorithm

and convergence proof and Lance and Shlomo aided this process. Lance made

significant contributions to the implementation of the software 1. I wrote the

manuscript, wrote the proof, wrote software, designed and conducted experi-

ments, and performed data analysis.

16.1 Algorithms

A new algorithm called the EM-tree algorithm is introduced in this paper. It

iteratively optimizes an entire m-way nearest neighbor search tree. It is proven

to converge. Furthermore, it provides efficiency advantages over the TopSig

K-tree when using TopSig bit vectors because means are updated less often.

1https://github.com/cmdevries/LMW-tree

167

https://github.com/cmdevries/LMW-tree

Journal of Machine Learning Research 1 (2013) 1-48 Submitted 9/13; Published 00/00

The EM-tree Algorithm

Christopher M. De Vries [email protected] Engineering and Computer ScienceQueensland University of TechnologyBrisbane, Australia

Lance De Vine [email protected] Engineering and Computer ScienceQueensland University of TechnologyBrisbane, Australia

Shlomo Geva [email protected]

Electrical Engineering and Computer Science

Queensland University of Technology

Brisbane, Australia

Editor:

Abstract

This paper introduces the EM-tree algorithm. The algorithm iteratively optimizes them-way nearest neighbor search tree in an Expectation Maximization like fashion. Boththe data structure and algorithm are formally defined. A convergence proof for the algo-rithm is presented. The algorithm is validated experimentally using large scale documentcollections. The documents are indexed using the TopSig approach where documents aremapped onto binary vectors for efficiency in processing. The EM-tree produces lower dis-tortion clustering solutions in less time than previous large scale clustering approaches. Thealgorithm also has advantages over other clustering algorithms when used in the streamingsetting where document vectors are streamed sequentially from disk. Furthermore, due tothe nature of the optimization process, the EM-tree is particularly amenable to paralleland distributed implementations for large scale learning.

Keywords: Clustering, Large Scale Learning, Random Projection, Document Signatures,Document Clustering

1. Introduction

This paper introduces the EM-tree algorithm. It is a tree structured clustering algorithmthat optimizes a m-way nearest neighbor search tree in an Expectation Maximization likefashion. The structure of the tree adapts to the data with a pruning process that removesempty branches of the tree between each iteration. The data is assigned to leaves in the treeaccording to the current Expectation of the parameters. After the Expectation step, anyempty branches are pruned from the tree. The parameters are then Maximized given thecurrent assignment. The EM-tree algorithm is also similar to the k-means algorithm whichis well known to be a special case of the Expectation Maximization algorithm (Dempsteret al., 1977) with hard cluster assignment and isotropic Gaussian distributions. Whatdifferentiates EM-tree from previous tree structured clustering algorithms is that the entire

c©2013 Christopher M. De Vries, Lance De Vine and Shlomo Geva.

.168

De Vries, De Vine and Geva

tree is optimized at each iteration which is followed by a pruning process that removesempty branches.

Several different clustering algorithms can store their data in a m-way tree. The k-meansalgorithm (Lloyd, 1982) can be constructed as a two level m-way tree. If the intermediatelevels of the tree are ignored in the m-way tree, then the EM-tree algorithm becomes ex-actly the same as k-means at the root level. We show this formally later in the paper andit becomes the basis for a convergence proof of the EM-tree algorithm. Other hierarchi-cal clustering algorithms that work with vectorial data such as TSVQ (Gersho and Gray,1993), agglomerative hierarchical clustering (Steinbach et al., 2000) and K-tree (Geva, 2000;De Vries et al., 2009) 1 produce m-way tree structures as well.

The EM-tree algorithm was first devised to improve point assignment to leaves in the K-tree algorithm. Early in the construction process, the K-tree makes decisions about whereto places points within the tree without entire knowledge of the data set as vectors arepresented one by one. After some time it became apparent to the authors that the EM-treealgorithm could be applied to any m-way tree, including a randomly initialized tree. Atthis point the algorithm became totally independent from K-tree. Initially it was not clearwhether the algorithm would converge or not. After some thought it became apparent thatk-means is in fact happening in the root node of the tree. This lead to the inductive proofin Theorem 4 demonstrating that the algorithm will eventually converge.

Firstly, in Sections 2 and 3, we introduce the k-means problem and the k-means algo-rithm that finds local approximations of the globally optimal solution. We then introduceand define the m-way nearest neighbor search tree data structure in Section 4. Section 5introduces and defines the EM-tree algorithm. A convergence proof for the EM-tree algo-rithm is contained in Section 6. The TSVQ algorithm is defined in Section 7 and it is shownthat it is not exactly equivalent to the EM-tree algorithm.

The paper then changes from a theoretical focus to perform experiments in documentclustering to provide support for the theoretical claims. The experimental setup for theremaining sections of the paper is described in Section 8. Section 9 evaluates the convergenceproperties of the EM-tree algorithm. In Section 10, the EM-tree is compared to the K-tree, TSVQ and k-means algorithms. The EM-tree algorithm is compared to the K-treealgorithm on a large scale document collection containing 50 million documents in Section11. A streaming variant of the EM-tree is discussed in Section 12 and how this providesadvantages over TSVQ. The implications of parallel and distributed versions of the EM-treeare discussed in Section 13.

Finally, future work is discussed in Section 14 and the paper is concluded in Section 15.

2. The k-means problem

The k-means problem is a fundamental problem in geometry that minimizes the squaredEuclidean distance between k centroids and a set of points X. What follows is a formaldefinition of the problem.

1. http://ktree.sf.net

2

.169


Let C be a set of k centroids in d dimensions, {c1, . . . , ck}, where, C ⊂ Rd, and, |C| = k.Let s be the squared Euclidean distance between two points,

s(x, y) = ||x− y||2,

and, x, y ∈ Rd. The nearest function produces the nearest centroid, c ∈ C, for point, x ∈ X,for the squared Euclidean distance function, s. Note that ties are broken in a consistentmanner by taking the first member of the set if there are two or more centroids at equalsimilarity, hence the 1 subscript,

nearest(x,C) = {c : ∃c ∈ C ∧ ∀d ∈ C ∧ s(c, x) ≤ s(d, x) ∧ c 6= d}1.

The function neighbors produces the nearest neighbor set for a centroid, c ∈ C, where,

neighbors(c,X,C) = {x ∈ X : nearest(x,C) = c}.

The k-means problem is to find the set C that minimizes the squared Euclidean distance,s, between each point in, X, and its nearest centroid in C, where,

C = argminC

∑

c∈C

∑

x∈neighbors(c,X,C)

s(c, x).

Finding the globally optimal solution to the k-means problem is known to be NP-hard(Aloise et al., 2009). Therefore, the k-means algorithm (Lloyd, 1982) is used to approximatethe optimal solution by converging to a local optima (Selim and Ismail, 1984).

3. The k-means algorithm

The k-means algorithm is one of the most popular unsupervised learning algorithms due toits simplicity, effectiveness and linear time complexity when the number of iterations is lim-ited. However, it was only recently proven that the k-means algorithm has polynomial timecomplexity when the number iterations is not limited using a smoothed analysis (Arthuret al., 2009). It has been listed as one of the top 10 algorithms in data mining (Wu et al.,2008) and is used in many different applications.

It is often used to initialize more expensive algorithms such as Gaussian mixtures, RadialBasis Functions, Learning Vector Quantization and some Hidden Markov Models. Bottouand Bengio (1995) suggest that k-means performs most of the optimization in a smallfraction of the time when it seeds more expensive algorithms.

The k-means algorithm has been defined in psuedocode in the procedure kmeans. Thealgorithm can be cast as gradient descent, probabilistic Expectation Maximization, andNewton’s method (Bottou and Bengio, 1995).

kmeans(k,X)

1 select k points uniformly at random from X to initialize the centroid set C2 while C has changed between iterations3 assign each vector in X to its nearest neighbor in the centroid set C4 recalculate each centroid in C as the mean of all its nearest neighbors5 return C

3

.170


At each iteration of k-means the centroid set is updated. What follows is a formaldefinition of the process. This will be used for later proofs. The function update producesan updated centroid set that is created at each iteration of k-means,

update(C,X) = {c ∈ C : mean(neighbors(c,X,C))}.

4. m-way nearest neighbor search tree

A m-way nearest neighbor search tree is a recursive data structure that indexes a set of npoints in d dimensions where, X = {x1, . . . , xn}, X ⊂ Rd and |X| = n. From here on out inthe text it will be referred to as a m-way tree which is not to be confused with the m-waytree from the B-tree data structure. Let N be the set of all tree nodes. Each node is asequence of records which are (vector, child node) pairs. For example, if root is the rootnode of a tree containing 4 records,

root = 〈record1, record2, record3, record4〉,

then, root1, returns the first pair in the sequence, record1, and, key(root1), returns thevector that is the key and, child(root1), returns the child node associated with the key ofrecord1,

root1 = record1 = (key ∈ Rd, child ∈ N).

The length of a node is denoted by bars, |root| = 4. There are internal and leaf nodes in am-way tree. Leaf nodes, L, contain the data inserted into the tree with empty child nodeswhere, L ⊂ N , and,

∀leaf ∈ L : ∀record ∈ leaf : key(record) ∈ X ∧ |child(record)| = 0 ∧ 1 ≤ |record| ≤ |X|.

Note that leaf nodes are not restricted by the tree order and can contain the entire datasetbut must not be empty. Internal nodes, I, have non-empty child nodes and contain between1 and m records where, I ⊂ N , and,

∀internal ∈ I : ∀record ∈ internal : |child(record)| 6= 0 ∧ 1 ≤ |record| ≤ m.

The leaves function takes any node and returns all data points associated with that node,i.e., all descendant points in leaf nodes, where, leaves(node) ⊂ X,

leaves(node) = {point : ∀record ∈ node :

{point = key(record) if node ∈ Lleaves(child(record)) if node ∈ I

}.

All internal nodes contain keys as means of points in X, where,

∀internal ∈ I : ∀record ∈ internal : key(record) = mean(leaves(child(record))).

Figure 1 illustrates how each key in an internal node is the mean of all keys in descendantleaf nodes. Figure 2 illustrates how k-means can be constructed as a two level m-way tree,where the centroids are in the root of the tree, and the data points being clustered are inthe leaves.

4

.171


Figure 1: A m-way tree illustrating means

Figure 2: A two level m-way tree illustrating the structure of the k-means algorithm

5

.172


5. The EM-tree algorithm

The EM-tree algorithm proceeds in a similar manner to most iterative optimization al-gorithms. The m-way tree model is initialized and it is then iteratively optimized untilconvergence. However, the EM-tree algorithm has an additional step during the iterativeoptimization process that removes sections of the model that no longer have data associ-ated with them. This is the pruning process that removes empty branches from the m-waytree. Empty branches may result from the re-insertion step which is described below. Bya process of insertion (Expectation), update (Maximization) and pruning, the m-way treemodel adapts to the data as the clusters converge.

The EM-tree places further restrictions on the definition of the m-way tree. The pointscontained in each child tree associated with each key are the nearest neighbors of theassociated key,

∀internal ∈ I : ∀record ∈ internal : leaves(child(record)) = neighbors(key(record), Y, C).

This invariant above is broken during the optimization process but holds once the tree hasconverged. The optimization process in EM-tree assigns points to leaves using the nearestneighbor rule during the Expectation step. However, the updating of means in the tree inthe Maximization breaks the nearest neighbor assignment, and points must be reassignedto the new means in the tree.

The EM-tree algorithm can take any m-way tree as input and optimize it. For example,the tree produced by the K-tree algorithm can be taken and further optimized to correctdecisions that were made without full knowledge of the entire data set. Additionally, thealgorithm can be applied to any sub tree in a m-way tree. In a setting where a changingdata set is being clustered, branches of the trees effected by insertions and deletions can berestructured to the data independently of the rest of the tree. Another possible use case isthat the tree is built using a sufficient sample of a large collection to run to convergence.The remaining data can be inserted into the tree. It can either be left as is or a smallnumber of iterations can adapt the tree to the remaining data.

The procedure seed initializes the EM-tree algorithm where m is the order of the tree,node is the current tree node, Y is the set of points to choose centroids from where, Y ⊂ X,and depth is the number of levels deep the tree will be. Note that this seeding procedureproduces a height balanced tree as all leaves are at the same depth. However, it is possibleto use other criteria for stopping the recursion such as the number of points in Y . However,this does not guarantee a height balanced tree.

6

.173


seed(m, depth, Y )

1 node = 〈〉2 if depth == 13 for point ∈ Y4 append (point, 〈〉) to node5 return node6 else7 C is the centroid set8 C = m points selected uniformly at random from Y9 for centroid ∈ C

10 neighbors = neighbors(centroid, Y, C)11 record = (mean(neighbors), seed(m, depth− 1, neighbors))12 append record to node13 return node

The insert procedure inserts a set of vectors, Y , into a height balanced m-way treewho’s root is, node, removing all existing points it indexes. Points are inserted by followingthe nearest neighbor search path, where at each node in the tree, the branch with thenearest key is followed.

insert(node, Y )

1 // the tree is height balanced so only need to check first record for no children2 if |child(node1)| == 03 // insert all points in Y into this leaf node4 node = 〈〉5 for point ∈ Y6 append (point, 〈〉) to node7 else8 // not a leaf node so keep recursing and splitting Y according to nearest keys9 C = {}

10 for record ∈ node11 insert key(record) into C12 for record ∈ node13 insert(child(record), neighbors(key(record), Y, C))

The procedure update updates the means in the tree according to the current assign-ment of data points in the leaves.

update(node)

1 if |child(node1)| == 02 return3 else4 for record ∈ node5 update(child(record))6 key(record) = mean(leaves(child(record)))

7

.174


The prune procedure removes any means with no associated data points from the tree.This is completed bottom up, where above leaf means are removed first. If any higher levelleaves contain no means, then they are removed as well in the next pass. Once the m-waytree has been updated by the update procedure, and the data points have been reinsertedvia insert procedure, branches of the tree may contain no points. These empty branchesare removed, and this allows the tree structure to adapt to the data.

prune(node, depth)

1 // means start one level higher than depth2 depth = depth− 13 while depth 6= 04 prunelevel(node, depth)5 depth = depth− 1

prunelevel(node, depth)

1 if depth == 12 for record ∈ node3 if |child(record)| == 04 remove record from node5 else6 for record ∈ node7 prunelevel(child(record), depth− 1)

The emtree procedure initializes and iteratively optimizes an EM-tree of order m thatis depth levels deep.

emtree(m, depth,X)

1 root = seed(m, depth,X)2 converged = false3 while not converged4 root′ = root5 insert(root′, X)6 prune(root′)7 update(root′)8 if root == root′

9 converged = true10 else11 root = root′

12 return root

6. Convergence proof

The proofs outlined in this section demonstrate that the EM-tree algorithm will converge.They show that the k-means algorithm is being performed at the root node of the tree, andthen at successively lower internal nodes of the tree after the nodes above have stabilized.

8

.175


Until the parent nodes have stabilized, the input vectors of a given node can changebetween iterations. The k-means update procedure does happen for non-root nodes, but itis not exactly the same as the k-means algorithm because the input vectors being optimizedcan change between iterations. However, at some point, the input vectors for each node dostabilize. From this point on-wards, the optimization process for any tree node becomesexactly the same as the k-means algorithm seeded with the means at the point the inputvectors stabilize.

Definition 1 Internal nodes, I, in an EM-tree index a subset of all points in X, ∀node ∈I : leaves(node) ⊂ X, where the root node indexes all points in X. The subset indexed bya particular node, Y , is a subset of all points, Y ⊂ X. The centroid set, C, of an internalnode, is all the vectors contained in the keys of the internal node,

C = {∀record ∈ internal : key(record)}.

Each internal node is a set of cluster means. The points contained in each child tree arethe nearest neighbors of each point in the centroid set,

∀internal ∈ I : ∀record ∈ internal : leaves(child(record)) = neighbors(key(record), Y, C).(1)

All the keys are means of the leaves contained in the associated sub tree,

∀internal ∈ I : ∀record ∈ internal : key(record) = mean(leaves(child(record))). (2)

Lemma 2 At each iteration of the EM-tree algorithm the k-means Expectation and Max-ismiation step, update(C, Y ) = {c ∈ C : mean(neighbors(c, Y, C))}, is performed at eachinternal node in the m-way tree, ∀internal ∈ I : {c ∈ C : mean(neighbors(c, Y, C))}.

Proof By Definition 1, the means in the EM-tree are updated in the same manner ask-means. The leaves in a child node are nearest neighbors of the corresponding key in thenode, which are cluster means.

∀internal ∈ I : ∀record ∈ internal : key(record) = mean(leaves(child(record)))

By substition of Equation 1 into Equation 2.

∀internal ∈ I : ∀record ∈ internal : key(record) = mean(neighbors(key(record), Y, C))(3)

Each key contained in a given internal node has a one to one correspondence with the cen-troids of the k-means algorithm as in Definition 1, C = {∀record ∈ internal : key(record)}.Therefore, by subsituation of key(record) for c in Equation 3, it is reduced to the following.

∀internal ∈ I : {c ∈ C : mean(neighbors(c, Y, C))}

This takes exactly the same form as the update(C, Y ) function that performs the updateof k-means. All internal nodes perform update(C, Y ) and therefore the same operations ask-means.

9

.176


Note that Lemma 2 does not prove the equivalence of the EM-tree algorithm to theentire k-means algorithm, only the update procedure applied at each iteration. For completeequivalence to the k-means algorithm, the subset of data indexed by a given node, Y , hasto be fixed between each iteration. This is not the case for all internal nodes and is provenlater.

Definition 3 The nodes N in a m-way tree change between iterations of the EM-tree al-gorithm. Therefore, N represents the nodes at the current iteration and N ′ at the previousiteration.

Theorem 4 The points contained in any node, n ∈ N , in the m-way tree will convergewhen using the EM-tree algorithm. In a finite amount of time the following will hold,N ′ = N .

Proof By structural induction on the definition of n ∈ N . The induction hypothesis isthat after a finite amount of time

P (N) ::= N ′ = N.

Base case (k-means is peformed in the root node):It has previously been proven that the k-means algorithm converges (Selim and Ismail,1984). Lemma 2 states that the k-means update procedure is performed at any node in thetree. It says nothing about the set of points being included in the optimization process ateach iteration. By Definition 1, the EM-tree indexes a fixed set of points, X, so the rootnode is guaranteed to converge as it indexes these points over all iterations. This set ofpoints, X, being indexed will never change between iterations. The root node is perform-ing k-means on the entire data set, X, until convergence by repeated application of theupdate(C, Y ) function. The set of vectors indexed by the child trees of the root node willstabilize by the convergence of k-means happening in the root node. Therefore, the set ofindexed vectors for each child tree, Y ⊂ X, will no longer change between iterations.

Constructor Case:Assume the induction hypothesis that any node, n ∈ N , in the tree converges so that thekeys in n are fixed and the nearest neighbors assigned to the child trees becomes fixed. Con-sider the set of all the children of node, n, in the tree, N+1 = {∀record ∈ n : child(record)}.The set of data points indexed by each child in this set, ∀child ∈ N+1 : leaves(child), is nowfixed and does not change between iterations of the EM-tree algorithm. Lemma 2 provesthat any tree node performs the k-means operations and the node will converge once theset of input points has become fixed. This means all child nodes, N+1, are also guaranteedto converge.

This proves that P (N) holds as required for the constructor case. By structural induc-tion we conclude that P (N) holds for all nodes, n ∈ N .

10

.177


7. Tree Structured Vector Quantization

TSVQ (Gersho and Gray, 1993) applies k-means to an entire data set until convergencesplitting it into m clusters. It then applies k-means recursively on each subset. Thisrecursively splits the data into m clusters until a stopping criteria has been met for therecursive application of k-means to stop. Like the EM-tree, this process also forms a m-waytree. As Lemma 2 indicates, the k-means update rule is being applied at each iterationof the optimization process in the EM-tree algorithm. Therefore, it is easy to incorrectlyconclude that the EM-tree algorithm is the same as TSVQ. Firstly, EM-tree has a pruningprocess that can remove clusters contained in internal nodes. However, this can also happenwith TSVQ, as k-means can have cluster centroids having no data points associated withthem. In this case, these clusters are usually dropped from the solution, and therefore inthe TSVQ algorithm not all splits are guaranteed to be m-way.

TSVQ(m, depth, Y )

1 node =<>2 if depth == 13 for point ∈ Y4 append (point,<>) to node5 return node6 else7 C is the centroid set8 C = kmeans(m,Y )9 for centroid ∈ C

10 neighbors = neighbors(centroid, Y, C)11 record = (centroid,TSVQ(m, depth− 1, neighbors)12 append record to node13 return node

Theorem 5 The EM-tree algorithm is not equivalent to the TSVQ algorithm.

Proof The only node guaranteed to have the same input set of vectors to optimize ateach iteration of the EM-tree algorithm is the root node, which has all vectors in the dataset, X, as input. The set of points for non-root nodes, Y , can change between iterationsbecause the clusters defined in nodes higher in the tree have not yet converged. This is notthe same as k-means performed in TSVQ. No recursive relationships between clusters arecreated until k-means has completely converged in the TSVQ algorithm. The initial pointset for k-means used in TSVQ is always fixed and does not change between iterations.

The observation to be made is that the basic difference between TSVQ and EM-treeis that TSVQ optimizes level by level, while EM-Tree optimizes the entire tree in eachiteration. This results in a further difference between the two algorithms regarding therequired number of passes over the data which is discussed in Section 12.

11

.178


8. Experimental Setup

The following sections of this paper test the proposed theoretical results by measuring theproperties of the EM-tree algorithm by clustering document collections.

We have used the INEX 2010 XML Mining collection (De Vries et al., 2011) which isa 144,265 document subset of the INEX 2009 Wikipedia collection. The INEX Wikipediahas been used for the evaluation of ad hoc retrieval in the INEX evaluation forum (Arvolaet al., 2011). The 2010 XML Mining subset contains 144,265 documents determined bythe reference search run from the ad hoc track. These documents were used for evaluationof the clustering and classification tasks in the INEX 2010 XML Mining track. INEX is acollaborative forum where researchers from different organizations take different approachesto the same task.

For larger scale clustering we have used the the ClueWeb09 Category B collection con-sisting of 50 million documents and a vocabulary of 96.1 million terms. The ClueWeb09Category B collection was introduced for evaluation at the TREC 2009 Web Track (Clarkeet al., 2009). It is a subset of the 1 billion document ClueWeb09 collection.

Each of the documents have been represented using bit vectors produced by TopSig.TopSig (Geva and De Vries, 2011) creates dense binary document vectors using randomprojections and numeric quantization. The bit vectors are compared using the Hammingdistance measure. This TopSig paper introduced a variant of k-means that works directlywith the binary representations produced by TopSig and has been shown to lead to a oneto two order magnitude increase in efficiency over traditional sparse vector approaches todocument clustering. All centroids are also bit vectors where the centroid is calculated bytaking the most frequently occurring bit in each dimension. If there are more bits set in thegiven dimension for vectors associated with the centroid, then it is also set in the centroid.In the TopSig model, the 0 and 1 values represent -1 and +1 values respectively. Therefore,this is also equivalent to accumulating all the vectors and taking their sign, to map thevalues back to -1 and +1. The TopSig search engine 2 was used to produce 4096-bit vectorsfor each document. The signature density was set to 21 and the log likelihood documentweighting (Geva and De Vries, 2011) was used. For both document collections, the termstatistics from the Wikipedia were used for the log likelihood document weighting. Whenindexing the ClueWeb collection, terms not contained in the Wikipedia are assumed to onlyappear in the document in which they occur. This allows terms without statistics in theWikipedia to be used for the log likelihood weighting function.

The signatures produced by TopSig and the software used to create them are avail-able from the K-tree SourceForge page 3. The implementations of the K-tree, k-means,TSVQ and EM-tree algorithms used in this paper are available from the LMW-tree GitHubrepository 4.

All experiments have been conducted on a system with 2 × 8 core 2GHz AMD Opteron6128 CPUs with a total of 64GB of main memory. All experiments were single threaded.

2. http://topsig.googlecode.com3. http://sf.net/projects/ktree/files/docclust_ir/4. http://github.com/cmdevries/LMW-tree/releases

12

.179


0 2 4 6 8 10 12 14 16 18 201650

1700

1750

1800

1850

1900

1950

2000

Iteration

Avera

ge R

MS

E

TSVQ

EM−tree

Figure 3: INEX 2010 XML Mining collection – Convergence

9. Convergence properties

We have tested the convergence properties of the EM-tree algorithm in comparison to TSVQ,as it is conceptually quite similar but not exactly the same algorithm. We used the 144,265document INEX 2010 XML Mining collection described in the previous section. Documentwere represented as 4096-bit vectors. Both algorithms were used to create an order 10 treewith a depth of 3. There are therefore two internal levels of the tree with the first levelcontaining 10 clusters and the second 100. Occasionally, the EM-tree would prune 1 clusterfrom the second level, resulting in 99 clusters. The number of iterations of optimization inthe TSVQ algorithm is controlled by limiting k-means to the desired number of iterations.EM-tree proceeds by iteratively optimizing the entire tree and the error can be measuredafter each iteration.

Figure 3 highlights the convergence properties of the two algorithms as the number ofiterations of optimization is increased along the x-axis. Each algorithm has been executed20 times at each iteration limit with a different random starting condition. The distortionin terms of RMSE was measured at the most fined grained level, which contains 100 clus-ters. Iteration 0 represents the state when no centroids have been updated. This is thedistortion of using randomly selected points as centroids where there was been no optimiza-tion performed. The nearest neighbors are associated with the randomly selected pointsfrom the data set. Iteration 1 is the state after the centroids have been updated once,once per level for TSVQ and once for the entire tree for EM-tree. Iteration 2 is after theyhave been updated twice and so on. The y-axis represents the Root Mean Squared Error(RMSE) between the 100 cluster centroids and the associated TopSig document bit vectors.As the distance measure is the Hamming distance, this represents the expected numberof bits difference between a document and its nearest centroid. The graph highlights that

13

.180


TSVQ EM-treeIterations Average RMSE Average RMSE p-value

0 1957.40± 3.03 1957.62± 2.28 0.791 1808.60± 5.65 1810.07± 5.56 0.412 1720.61± 4.39 1734.64± 3.81 3.93× 10−13

3 1689.68± 3.04 1712.35± 4.72 3.15× 10−20

4 1677.99± 3.84 1703.09± 4.66 1.18× 10−20

5 1672.83± 1.96 1696.83± 3.67 1.11× 10−25

6 1671.05± 3.07 1697.11± 4.68 2.29× 10−22

7 1669.95± 3.49 1692.51± 3.89 3.09× 10−21

8 1668.43± 2.30 1690.46± 6.15 1.48× 10−17

9 1667.98± 2.56 1687.55± 3.21 1.00× 10−22

10 1666.97± 2.53 1687.33± 5.05 1.37× 10−18

11 1665.35± 2.94 1686.37± 4.68 2.31× 10−19

12 1666.05± 2.88 1683.58± 2.41 2.11× 10−22

13 1664.97± 1.80 1681.41± 3.45 6.66× 10−21

14 1665.02± 2.21 1681.09± 4.64 1.53× 10−16

15 1663.53± 2.18 1679.33± 3.76 1.06× 10−18

16 1664.40± 2.47 1679.38± 3.59 7.47× 10−18

17 1664.51± 3.07 1677.82± 2.93 1.38× 10−16

18 1663.28± 2.20 1677.85± 3.54 4.02× 10−18

19 1663.57± 2.56 1676.30± 4.15 4.07× 10−14

20 1663.30± 2.94 1675.86± 3.45 6.41× 10−15

Table 1: INEX 2010 XML Mining Collection – Convergence

the EM-tree algorithm has the general behavior of convergence as the number of iterationsincreases. Note that these algorithms are performing a different number of passes over thedata and this is discussed when using the algorithms in a streaming setting in Section 12.

Table 1 contains the RMSE for each algorithm at each iteration. These numbers arethe values for the plots in Figure 3. Additionally, there is a p-value for a two tailed t-testbetween the both algorithms. There is no statistically significant difference between EM-tree and TSVQ for iterations 0 and 1 where p > 0.05. For Iterations 2 and higher, TSVQconverges to lower distortion and therefore statistically significantly better solutions thanEM-tree where p < 0.05. We suggest that changing the input vectors between iterationsfor non-root nodes introduces noise into the optimization process for EM-tree. However,there are differences in the way EM-tree performs optimization and they can lead to bettersolutions when in a streaming setting or when combined with sampling. These two settingsare discussed in Sections 11 and 12.

In Table 1, the RMSE does not monotonically decrease for either algorithm. In thecase of TSVQ, when changing the maximum number of iterations, the splits determinedin a given level of the tree are different. Therefore, the input vectors to lower levels ofthe tree are not the same, as the maximum number of iterations for k-means has changed.

14

.181


Therefore, it is possible to converge to a higher RMSE solution. The same applies to EM-tree as discussed in Theorem 5, as the set of input vectors to any non-root node changesbetween iterations.

10. Comparison to other Algorithms

We have performed experiments to compare the EM-tree to K-tree, k-means and TSVQ asthe number of clusters varies. This allows visualization of the trends of RMSE and runtimeas the number of clusters are changed. These experiments used the 144,265 documentINEX 2010 XML Mining collection described earlier. All algorithms used 4096-bit TopSigdocument vectors.

The K-tree algorithm builds a m-way tree as data arrives. The size of tree nodes arecontrolled by the tree order. Once a node becomes full, it is split in 2 using the k-meansalgorithm and the 2 centroids are promoted to create a new node. The K-tree was run withtree orders of 1000, 750, 500, 400, 300, 250, 200, 150, 100 and 75 to produce a differentnumber of clusters. Note that as the tree order becomes smaller, K-tree produces moreclusters.

The computational advantages of K-tree are no longer apparent when using the Top-Sig bit vector representations in comparison to TSVQ. When using real valued vectors theK-tree has been shown to be faster than TSVQ (Geva, 2000). The centroids contained inthe search path of a vector inserted into the tree are updated upon every insertion. Forbit vector representations, this is relatively slow compared to using integer or real valuedvectors. For a centroid to be updated, all the associated nearest neighbors must be un-packed from the bit representation and added to an integer valued vector. In contrast,the nearest neighbor comparison using the Hamming distance works directly with the bitvector without unpacking each bit into a larger representation. Therefore, K-tree becomesslower than TSVQ or EM-tree because updating the centroids dominates the computationalcost. However, delayed updates of the centroids reduces the cost associated with updatingcentroids. In these experiments, the delayed update K-tree only has updates along theinsertion path of every 1000th vector inserted into the tree. Note that this does not changethe computational complexity of the algorithm. However, it reduces large constant over-heads associated with updating means when using bit vectors. The delayed update K-treewas run with tree orders of 1000, 750, 500, 400, 300, 250, 185, 138, 85, 65 to approximatelymatch the number of clusters produced by the K-tree with updates for every insertion.

The TSVQ algorithm was used to build trees that are 3 levels deep. The tree orderswere varied to match the number of clusters produced by K-tree. The tree orders used were15, 17, 21, 24, 28, 31, 36, 42, 53 and 63.

The EM-tree algorithm was used to build trees that are 3 levels deep. The tree orderswere varied to match the number of clusters produced by K-tree. Note that the tree ordersare different to TSVQ because the EM-tree prunes clusters during optimization. The treeorders used were 15, 18, 23, 26, 33, 39, 46, 61, 86 and 107.

The k-means algorithm was used to produce approximately the same number of clustersas K-tree. The number of clusters produced was 230, 310, 470, 590, 810, 1010, 1300, 1800,2900 and 4000.

15

.182


All algorithms were limited to 10 iterations of optimization for a fair comparison. Thisincludes the k-means algorithm used in K-tree for splitting nodes. All algorithms were run20 times with different starting conditions. All values in Figures 4 and 5 are average valuesfor the 20 runs.

Figure 4 displays the RMSE as the number of clusters changes. Only a small differencein RMSE of around 6 bits is required to create a statistically significant different result.This was observed when testing RMSE in the streaming setting in Table 3. For all of thepoints in Figure 4, the variance of RMSE is low, with a standard deviation between 0.5and 5 bits for any algorithm. This is the same variance as reported in Table 3. The onlyexception was with K-tree when producing 4000 clusters with a standard deviation of 10bits and 16 bits for the delayed udpate K-tree. Therefore, the differences in RMSE arelikely to be statistically significantly different in Figure 4 as they are greater than 6 bits.

The RMSE of the K-tree algorithm as the number of clusters increases does not alwaysdecrease. This is due to decreasing the tree order to achieve the desired number of clusters.The first increase in RMSE, while also increasing the number of clusters is when the K-treegoes from a 2 level tree to a 3 level tree between 500 and 750 clusters. The deeper treeintroduces more inaccuracies to the nearest neighbor search and therefore the error increaseseven though there are more clusters. The other increase is from 3000 to 4000 clusters wherethe tree order becomes smaller. We suggest that once the tree order becomes too small,in this case, going from 100 to 75, there are not enough vectors to perform meaningfullearning with k-means. In fact, the decrease in distortion as the tree order increases, doesnot behave like the other algorithms at all. It is much flatter as can be seen in Figure 4.We suggest that this is due to the tree order being decreases further and further, and theeffect this has on the ability of k-means to perform meaningful splits of the data.

The K-tree algorithm without delayed updates is not appropriate for clustering TopSigdocument vectors. This is due to the high runtime due to many updates of centroids.However, the delayed update K-tree addresses this problem and provides the best runtimes,but at the cost of higher RMSE. It also appears that a small tree order adversely effects theK-tree algorithm. We suggest this is because the k-means algorithm does not have enoughinformation to effectively learn how to split tree nodes in two.

For in memory clustering of TopSig document signatures, the TSVQ algorithm appearsto be the best choice for a single shot approach to learning when the best trade-off betweenRMSE and runtime is required. It is almost the fastest algorithm, with only the delayedupdate K-tree being marginally faster for large numbers of clusters, but the delayed updateK-tree has much higher RMSE. The k-means algorithm provides the lowest RMSE solutionbut has a relatively high computational cost. However, the EM-tree and K-tree algorithmshave other advantages. Both support incremental clustering of the data. In the case of theK-tree, it is incrementally updated for every vector inserted into the tree. However, theEM-tree algorithm can take any m-way tree as input and optimize it. Therefore, every nowand then 1 or 2 iterations of the algorithm can be run to adapt the tree to new data inserted.TSVQ does not have any of these properties. Once the tree is built, it is final. Vectorscan be inserted into the m-way tree produced by TSVQ but the algorithm has no meansto update it. However, a m-way tree can be built using a sample of the data and TSVQ.Then the EM-tree can be used to perform 1 or 2 iterations of optimization of the wholedata set, based on the better optimized tree found by TSVQ. This is addressed in Section

16

.183


0 500 1000 1500 2000 2500 3000 3500 4000 45001350

1400

1450

1500

1550

1600

1650

Average Clusters

Avera

ge R

MS

E

k−means

TSVQ

EM−tree

K−tree

K−tree with delayed updates

Figure 4: INEX 2010 XML Mining Collection – RMSE vs Number of Clusters

0 500 1000 1500 2000 2500 3000 3500 4000 45000

100

200

300

400

500

600

700

Average Clusters

Avera

ge R

untim

e (

seconds)

k−means

TSVQ

EM−tree

K−tree

K−tree with delayed updates

Figure 5: INEX 2010 XML Mining Collection – Runtime vs Number of Clusters

11 when clustering large scale document collections of 50 million documents. EM-tree hasadvantages over TSVQ in the streaming setting and this is described in Section 12.

17

.184


Algorithm Clusters RMSE Runtime

K-tree with delayed updates 103935 1296.76 16 hoursEM-tree initialized with sampled TSVQ 104021 1242.38 9.5 hours

Table 2: ClueWeb 2009 Category B – 104,000 Clusters

11. Comparison to K-tree on ClueWeb

All of the previous experiments have been conducted on the relatively small 144,265 docu-ment INEX 2010 XML Mining collection. This was to enable reasonable running times withmultiple runs of the algorithms and matching the number of the clusters by trying differenttree orders. Clustering the 50 million document ClueWeb collection can take upwards of 10hours. Therefore, the experiments in this section contain only single runs of the algorithms.

The TopSig K-tree with delayed updates has been used in a paper submitted for peerreview (De Vries et al., 2013). It was used to cluster the 50 million document ClueWebcollection into 140,000 clusters. These clusters were used for ad hoc retrieval, where onlythe first 8 most highly ranked clusters have to be searched to retain search quality. Thisallowed 13 times less documents to be retrieved than the previous best known result. Only0.1% of all documents were contained in the most 8 highly ranked clusters when averagedacross all 50 queries used for evaluation. Complete clustering of 50 million documents intohundreds of thousands of clusters has not been reported in the literature prior to this result,never mind in a single threaded, non-distributed algorithm. Other approaches have usedsampling to address the scalability of clustering 50 million documents (Kulkarni and Callan,2010). The clusters found using the sample were used to associate documents to the clustercentroids. However, the centroids were not updated and therefore out of vocabulary termswere encountered when ranking clusters. The approaches presented here performs completeclustering of the ClueWeb collection where centroids are updated and iteratively optimizedfor all documents in the collection.

The EM-tree algorithm can be initialized with any m-way tree. In all previous experi-ments the tree has been initialized using random seeding. In the following experiments wehave initialized the EM-tree using a m-way tree produced by TSVQ using a sample. Werandomly sampled 2 million documents from the 50 million document ClueWeb collectionand built a m-way tree using TSVQ limited to 5 iterations of optimization. The choice of 5iterations was influenced by the results in Table 1 where on average, 97% of the optimiza-tion occurred by the 5th iteration of optimization when using the smaller INEX 2010 XMLMining collection. The resulting m-way tree was used to initialize the EM-tree algorithmwhich was run for 2 iterations. The tree order was 331 and was 3 levels deep.

We built a K-tree with delayed updates for every 1000th vector on the 50 million docu-ment ClueWeb collection. All vectors were reinserted into the tree once the tree had beenbuilt. The tree order was 1000 and the resulting tree was 3 levels deep.

Table 2 contains the results of the experiments. The EM-tree initialized with TSVQis able to produce lower RMSE solutions in less time than the delayed update K-tree. Asprevious results were stable across different random initializations we conclude that the sameproperties will hold here and therefore it is highly likely a difference of 56 bits in RMSE will

18

.185


statistically significantly different. Running each algorithm 20 times is impractical in thiscase, as the experiments would take approximately 20 days to complete. Because the EM-tree algorithm can be initialized with any m-way tree, it has allowed a sample based treebuilt with TSVQ to provide lower distortion solutions than the previous approach basedupon the K-tree with delayed updates.

12. Streaming EM-tree

We now consider the properties of the EM-tree algorithm in comparison to the TSVQalgorithm when the data set is too large to fit in main memory and it is streamed sequentiallyfrom disk for each pass over the data. The goal in this setting is to minimize the number ofpasses over the data. Furthermore, random access to the data set is not available and eachexample is presented one at a time. Note that the K-tree is not considered in this situation.While the K-tree algorithm performs incremental learning where examples are presentedone by one, it is not streaming because it has to revisit previous examples from the streamwhen a leaf node becomes full and is split in two.

The main distinction between the EM-tree and TSVQ algorithms is that EM-tree op-timizes the entire tree structure at each iteration while TSVQ optimizes the tree structureat each level before proceeding to the next. Therefore, we perform an analysis where thenumber of iterations of optimization is fixed. We then determine the number of passes overthe data required by each algorithm.

Both algorithms require the data set to be streamed for the comparison of points to thecentroids, and recalculation of the means. How the algorithms are presented in prior sectionsrequires two passes of the data set to perform these operations. However, if an extra vectoris kept for each centroid, it can be used as an accumulator when calculating the means of thevectors. Once all points in the data set have been streamed, the accumulators can be dividedby the associated number of vectors. Therefore, one iteration of the optimization processof k-means in TSVQ or the insert and update procedures in EM-tree can be performed ina single pass over the data.

The seed procedure introduced in Section 5 performs a pass over the entire data setfor each level in the tree. It is TSVQ limited to 1 iteration; i.e. m points are selected,their nearest neighbors associated and the centroids updated. Therefore, we have defined asingle pass approach to seeding the m-way tree structure in the following SinglePassSeedpsuedocode where splits occur in the tree after s vectors are contained in a given leaf. Thealgorithm proceeds by creating a tree containing a linked list of centroids for the first mpoints from the stream. In the streaming setting the points are not stored in the tree, so,only the cluster centroids are created. Points are then inserted into the tree one at a time,taking a point to be a new centroid once a leaf has seen s insertions.

Alternatively, a tree can be built using TSVQ on a sample of the data that fits inmemory. This can then be used to initialize the EM-tree algorithm.

19

.186


SinglePassSeed(m, depth, Y, s)

1 for each of the first m points from the stream2 create a branch that is depth− 1 levels deep containing 1 node at each level3 use the point for the 1 centroid required in each node4 for the remaining points in the stream5 insert the point into the tree following the search path6 if the lowest level centroid has seen s insertions7 make the point a new centroid8 if the node is full9 keep traversing up the tree until a non-full node is reached

10 if all nodes are full11 discard the point12 else13 create a new branch to tree depth− 114 use the point as the centroid in each node in the new branch15 update the centroids along the insertion path

Consider two d level m-way trees, where all internal nodes are full, and where one isbuilt using the EM-tree algorithm and the other TSVQ. Because all the internal nodes arefull, both will have the same number of clusters. We define the number of iterations ofoptimization as, i. As described earlier, a single iteration requires the entire data set to bestreamed from disk. In the case of TSVQ each non-leaf level requires the entire data setto be streamed from disk i times by the k-means algorithm. Therefore, in a d level treeproduced by TSVQ, (d − 1)i passes over the data will be performed. For an iteration ofoptimization in the EM-tree algorithm, each vector is read and then inserted into the treewhile updating the accumulators along the search path, requiring i passes over the data.For a, d = 5, level m-way tree performing, i = 2, iterations of optimization, TSVQ requires8 passes over the data while EM-tree requires only 2. As the tree depth, d, increases, thedisparity in the relationship increases. This says nothing about the convergence propertiesfor the number of passes over the data. However, in Figure 3 it can be seen that 80% ofthe optimization happens after 2 iterations for both TSVQ and EM-tree.

Figure 6 compares EM-tree and TSVQ in the streaming setting. The INEX 2010 XMLMining collection described earlier was used. For the purpose of these experiments we donot stream the data from disk but simulate it by streaming data from main memory. 4level trees were produced by both EM-tree and TSVQ to produce approximately the samenumber of clusters. The distortion of the lowest level, most fined grained clusters, in thetree was used. EM-tree was run with tree orders of 4, 5, 6, 7, 8, 10, and 12. TSVQwas run with tree orders of 4, 5, 6, 7, 8, 9, and 10. TSVQ was run using 2 iterationsfor k-means when recursively splitting the data set. The nearest neighbors are associatedwith centroids twice, and the centroids are updated twice. In both algorithms, 2 iterationsperforms approximately 80% of the optimization from the initial randomly initialized state.This can be seen in Figure 3 where 80% of the decrease in distortion occurs by iteration 2.Running TSVQ to create a 4 level tree, where k-means is run for 2 iterations, requires 6passes over the data; i.e. 2 passes over the data for each non-leaf level in the tree. Therefore,EM-tree was run for 6 iterations to perform the same number of passes over the data for

20

.187


0 200 400 600 800 1000 12001600

1620

1640

1660

1680

1700

1720

1740

1760

1780

Average Clusters

Avera

ge R

MS

E

TSVQ

EM−tree

Figure 6: INEX 2010 XML Mining Collection – Streaming EM-tree vs TSVQ

TSVQ 6 passes over data EM-tree 6 passes over dataAverage Clusters Average RMSE Average Clusters Average RMSE p-value

64 1760.57± 4.6 64 1733.63± 3.7 5.13× 10−22

125 1728.07± 4.2 124.7 1700.38± 3.6 1.89× 10−23

216 1702.82± 4 213.4 1680.73± 4.2 8.43× 10−21

343 1677.70± 5.1 331.6 1660.51± 4.4 9.24× 10−14

512 1654.34± 3.9 463.25 1648.53± 2.6 2.38× 10−06

729 1632.82± 4.9 762.25 1621.31± 2.7 3.79× 10−11

1000 1613.36± 5.3 1047.05 1600.5125± 2.5 6.21× 19−12

Table 3: INEX 2010 XML Mining Collection – Streaming EM-tree vs TSVQ

the simulation. Note that each tree was built 20 times given a different random initial stateto measure the variance. The result is that EM-tree produces lower distortion solutions in6 iterations than TSVQ does in 2 iterations, while both algorithms still perform 6 passesover the data.

Table 3 contains the numerical values for the plots in Figure 6. It also includes the p-value for a two tailed t-test between each 20 samples of the RMSE of the clusterings found.In all cases, EM-tree produces distortions statistically significantly lower than TSVQ witha p-value < 0.05 and almost equal to 0. This is even the case when a slightly largernumber of clusters is produced by EM-tree, which slightly disadvantages EM-tree becausea smaller number of clusters produces a higher distortion solution. This can be seen by thedownward trend in distortion in Figure 6 as the number of clusters increases. Note that theexact number of clusters produced by EM-tree can not be controlled because of the pruning

21

.188


process. The same applies to TSVQ as k-means does not always produce k clusters. Somecentroids can have no points associated with them.

In conclusion, the EM-tree performs less passes over the data for each iteration ofoptimization. This allows the algorithm to do something that TSVQ can not. It canperform 1 or 2 passes over the data. Where as TSVQ, for a 4 level tree, requires 3 or 6passes over the data.

13. Parallel and Distributed EM-tree

The EM-tree algorithm is particularly amenable to simple parallel and distributed imple-mentations due to the nature of the optimization process. Unlike K-tree or TSVQ, theentire tree is frozen at each iteration. The data points can be inserted according to thenearest neighbor search path while updating the accumulators along the way. Once all ofthe points have been reassigned to leaves, the accumulators can be synchronized betweenall threads or compute nodes co-operating in the algorithm. Each thread or compute nodecan work independently without any communication to other threads or compute nodes.The only time communication is needed is when the tree is synchronized after inserting allpoints.

The K-tree algorithm updates the model for every vector that is inserted into the tree.Additionally, this can cause nodes in the tree to split. All of these operations must besynchronized and require communication between threads or compute nodes.

In a distributed implementation, the data set is likely to be very large and streamedfrom disk. In this situation, as highlighted in Section 12, the EM-tree algorithm performs1 pass over the data per iteration where as TSVQ performs as many passes as there arenon-leaf levels in the tree.

14. Future Work

The experiments in this paper tested a single threaded implementation of the EM-tree algo-rithm. In future work, the parallel, distributed and streaming EM-tree can be implementedand tested on real data. This implementation will likely be able to cluster collections con-taining billions of documents in under 24 hours.

Further studies can possibly investigate the effect of having different order nodes atdifferent levels of the m-way tree. The intuition is that the root node would benefit fromcontaining more centroids. If there are too few points in the root node, a large fraction ofthe data set may be equidistant from all centroids due to the curse of dimensionality. In highdimensional spaces, such as documents, only the local neighborhood allows differentiation.These properties were analyzed by De Vries and Geva (2012) for TopSig document vectors.

15. Conclusion

The novel EM-tree clustering algorithm was presented. The m-way nearest neighbor searchtree data structure and the algorithm have been formally defined. Furthermore, other geo-metric clustering algorithms such as k-means, TSVQ, agglomerative hierarchical and K-treecan be represented using the m-way tree. It was proven that the EM-tree algorithm con-

22

.189


verges, which was also validated experimentally. The algorithm can produce lower distortionclusterings in the streaming setting than TSVQ, where the goal is to minimize the numberof passes over the data set. Furthermore, the EM-tree can be initialized with a tree builtusing a sample and the TSVQ algorithm. This provides lower distortion clusterings in lesstime than the previous best known large scale approach to clustering 50 million documentsin hundreds of thousands of clusters using the K-tree with delayed updates. Clustering ofthis scale has applications for search engines, where the clusters can be used to reduce thenumber of documents that are required to be searched by only searching the most relevantdocument clusters for a given query.

References

D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245–248, 2009.

D. Arthur, B. Manthey, and H. Roglin. k-means has polynomial smoothed complexity.Annual IEEE Symposium on Foundations of Computer Science, 0:405–414, 2009. ISSN0272-5428. doi: http://doi.ieeecomputersociety.org/10.1109/FOCS.2009.14.

P. Arvola, S. Geva, J. Kamps, R. Schenkel, A. Trotman, and J. Vainio. Overview of theINEX 2010 ad hoc track. Comparative Evaluation of Focused Retrieval, pages 1–32, 2011.

L. Bottou and Y. Bengio. Convergence properties of the K-means algorithms. In Advancesin Neural Information Processing Systems 7, 1995.

C.L. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC 2009 web track. Technicalreport, DTIC Document, 2009.

C.M. De Vries and S. Geva. Pairwise similarity of topsig document signatures. In Proceedingsof the Seventeenth Australasian Document Computing Symposium, pages 128–134. ACM,2012.

C.M. De Vries, L. De Vine, and S. Geva. Random indexing k-tree. ADCS 2009, page 43,2009.

C.M. De Vries, R. Nayak, S. Kutty, S. Geva, and A. Tagarelli. Overview of the INEX2010 XML mining track: Clustering and classification of XML documents. ComparativeEvaluation of Focused Retrieval, pages 363–376, 2011.

C.M. De Vries, S. Geva, and A. Trotman. Distributed information retrieval: Collectiondistribution, selection and the cluster hypothesis for evaluation of document clustering.Submitted for review July 2013, 2013.

A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete datavia the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological),pages 1–38, 1977.

A. Gersho and R.M. Gray. Vector quantization and signal compression. Kluwer AcademicPublishers, 1993.

23

.190


S. Geva. K-tree: a height balanced tree structured vector quantizer. In Neural Networksfor Signal Processing X, 2000. Proceedings of the 2000 IEEE Signal Processing SocietyWorkshop, volume 1, pages 271–280. IEEE, 2000.

S. Geva and C.M. De Vries. Topsig: topology preserving document signatures. In Proceed-ings of the 20th ACM international conference on Information and knowledge manage-ment, CIKM ’11, pages 333–338, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0717-8. doi: http://doi.acm.org/10.1145/2063576.2063629. URL http://doi.acm.org/

10.1145/2063576.2063629.

A. Kulkarni and J. Callan. Document allocation policies for selective searching of distributedindexes. In Proceedings of the 19th ACM international conference on Information andknowledge management, pages 449–458. ACM, 2010.

S. Lloyd. Least squares quantization in pcm. Information Theory, IEEE Transactions on,28(2):129–137, March 1982. ISSN 0018-9448.

S.Z. Selim and M.A. Ismail. K-means-type algorithms: a generalized convergence theoremand characterization of local optimality. Pattern Analysis and Machine Intelligence, IEEETransactions on, (1):81–87, 1984.

M. Steinbach, G. Karypis, V. Kumar, et al. A comparison of document clustering techniques.2000.

X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J. McLachlan, A. Ng,B. Liu, P.S. Yu, et al. Top 10 algorithms in data mining. Knowledge and InformationSystems, 14(1):1–37, 2008.

24

.191

Chapter 17Conclusions, Implications and

Future Directions

This thesis presented new approaches to knowledge representation, learning

algorithms, evaluation and information retrieval with respect to document clus-

tering. The primary focus was on the efficiency and scalability of the algorithms

that allowed a 13 fold reduction in the number of documents that have to be

searched in a cluster-based search engine compared to the best known approach

on large scale document collections.

The TopSig representation was introduced and used for improving the effi-

ciency of document clustering by an order of magnitude with no reduction to

cluster quality [90]. Additionally, further gains in efficiency can be made by

sacrificing quality by using shorter document signatures. This required devel-

opment of learning algorithms that work directly with the binary representation

of TopSig document signatures.

Two algorithms based upon the m-way nearest neighbor search tree were

presented for further increases in the efficiency of clustering. The TopSig K-

tree [61] allowed document clustering at a scale not previously reported in the

literature, by clustering 50 million documents into 140,000 clusters. It did this

in a single thread of execution. The EM-tree alglorithm [55] allowed further

increases in single threaded efficiency. Additionally, the EM-tree algorithm has

properties that make it suitable for further scaling via streaming, distributed

and parallel implementations of the algorithm. Both of these algorithms allow

a further trade-off between cluster quality and efficiency. However, without the

scalability of these approaches to produce hundreds of thousands of clusters, the

improvements seen in CBM625 would not have been realized.

The results presented suggest that clustering of the entire searchable Internet

192

of approximately 45 billion documents, into a hierarchy of millions of clusters,

is a tractable problem. The implications and applications of clustering at such

a scale are unknown. However, it is easy to see how these scalable hierarchical

algorithms can be used to build a machine learned hierarchical browsable cata-

log of the entire searchable Internet. Additionally, the clusters found by these

scalable approaches may be able to reduce the cardinality of a data set that can

be fed into more expensive learning algorithms to fine tune the clustering.

Category based evaluations require labeling of an entire document collection

with multiple categories. When hundreds of thousands of categories exist for

a general purpose knowledge repository such as the Wikipedia or the Internet,

this becomes a daunting problem. There have been reports of labeling doc-

ument collections causing “terrible memories” [152] for assessors even though

the document collection consisted of less than a million documents, had hun-

dreds of categories, had a tool to help automate classification based on rules,

and was labeling short news wire articles. The problem of large scale assess-

ment has already been addressed via the use of pooling in ad hoc information

retrieval evaluation. Additionally, queries used for evaluation are often specific

and well defined, unlike some of the broad, lofty categories used for document

categorization.

This thesis highlights that the cluster hypothesis holds with very fine grained

document clusters when they are used for ad hoc retrieval [61]. This was fur-

ther reinforced by the analysis of the topology of document signatures [58], that

indicated that only the local neighborhood of the representation allows differ-

entiation of meaning. Therefore, creating document clusters too large is likely

to contain many equidistant documents. It was shown that fine grained clusters

are more effective for cluster-based search.

The novel evaluation of document clustering at INEX [171, 62] using ad hoc

relevance judgements produced several findings. The use of ad hoc relevance

judgements overcomes the short-falls of using categories. Category based eval-

uation of document clustering only has use for classification. However, ad hoc

retrieval based evaluation has a use case for cluster-based retrieval. We suggest

that evaluating document clustering in the situation of its use is more sound

than a classification based evaluation. If you want to do classification, build

a classifier. Document clustering is driven by the idea of placing similar items

together. Just because two documents exist in a category does not necessarily

mean they are similar. The cluster hypothesis is directly driven by the similarity

between documents, where documents who are similar to each other, are likely

to be relevant to the same information need.

This thesis investigates the Divergence from a Random Baseline approach

to the evaluation of clustering [55]. It allows for the differentiation of ineffective

193

clusterings that perform no useful learning with respect to a given measure of

cluster quality. This allowed for identification of ineffective clusterings in the

INEX 2010 XML Mining track.

17.1 Future Directions

Due to the broad and inter-related nature of the topics in this thesis, there are

a multitude of possible directions that future research may take. The following

material highlights some possibilities that I think are interesting. It is by no

means an exhaustive list.

Clustering of billions of documents into fine grained clusters is an excit-

ing avenue for the future of this research. The results already indicate that

streaming, parallel and distributed implementations of the EM-tree algorithm

are straight forward. Once clusters of this scale are available then there are

many new possibilities for use in different applications.

One possible application for fine grained clusters is the sparsification of pair-

wise distance matrices to only include the local neighborhood of the objects

being clustered.

Large scale clustering can be applied to other domains such as feature learn-

ing for image classification. Recent results [43] have used k-means to cluster 57

million images into 150,000 clusters. The approach outlined in the paper uses

30 machines and takes 3 days to train. This is a new domain algorithms in this

thesis can be applied to.

Analysis of documents by breaking documents into separate sub-document

pieces could prove useful for document clustering and cluster-based retrieval.

The cluster hypothesis could be stronger at the sub-document level. It also

allows documents to be allocated to one or more clusters without using fuzzy

clustering algorithms; i.e. separate parts of a document can belong to different

clusters. The primary motivation for this approach is to deal with multi-topic

documents.

The tree based algorithms presented in this thesis could be adapted to pro-

duce soft clusters where an example can be associated with more than one

cluster at any given level in the tree.

Automatically learning all the parameters for building a cluster tree would

be useful. This way, the learning approaches can be thrown at a large document

collection and a tree that best fits the data according to some measure of fitness

can be produced. This requires determining the appropriate length of signature

to use for each cluster in the tree and determining the best tree structure in

terms of splits and tree depth.

Different document representations can be used with document signatures.

194

It would be interesting to investigate the effect of the different representations

for document clustering with signatures. For example, simply using term fre-

quencies with signatures may prove effective for document clustering.

The cluster-based retrieval experiments indicated that squaring BM25 docu-

ment weights proved more effective with respect to the cluster hypothesis. The

query is also compared to documents using the inner product rather than the co-

sine angle between the query and documents. Therefore, squaring BM25 weights

and using the inner product distance measure may well prove to be effective for

improving the quality of document clusters. Furthermore, inner product based

compressed representations such as Reflexive Random Indexing may prove more

effective for representing documents based on BM25 for document signatures.

195

Bibliography

[1] D. Achlioptas. Database-friendly random projections: Johnson-

Lindenstrauss with binary coins. Journal of Computer and System Sci-

ences, 66(4):671–687, 2003.

[2] N. Ailon, R. Jaiswal, and C. Monteleoni. Streaming k-means approxima-

tion. Advances in Neural Information Processing Systems, 2009.

[3] James Allan and Hema Raghavan. Using part-of-speech patterns to reduce

query ambiguity. In Proceedings of the 25th annual international ACM

SIGIR conference on Research and development in information retrieval,

pages 307–314. ACM, 2002.

[4] G. Amati and C.J. van Rijsbergen. Probabilistic models of information re-

trieval based on measuring the divergence from randomness. ACM Trans-

actions on Information Systems (TOIS), 20:357–389, October 2002.

[5] V.N. Anh, O. de Kretser, and A. Moffat. Vector-space ranking with ef-

fective early termination. In Proceedings of the 24th annual international

ACM SIGIR conference on Research and development in information re-

trieval, pages 35–42. ACM, 2001.

[6] D. Arthur, B. Manthey, and H. Roglin. k-means has polynomial smoothed

complexity. Annual IEEE Symposium on Foundations of Computer Sci-

ence, 0:405–414, 2009.

[7] D. Arthur and S. Vassilvitskii. k-means++: the advantages of careful

seeding. In SODA ’07: Proceedings of the Eighteenth Annual ACM-SIAM

Symposium on Discrete Algorithms, pages 1027–1035, Philadelphia, PA,

USA, 2007. Society for Industrial and Applied Mathematics.

196

[8] P. Arvola, S. Geva, J. Kamps, R. Schenkel, A. Trotman, and J. Vainio.

Overview of the INEX 2010 ad hoc track. Comparative Evaluation of

Focused Retrieval, pages 1–32, 2011.

[9] M. Atsumi. Probabilistic learning of visual object composition from

attended segments. In Advances in Visual Computing, pages 696–705.

Springer, 2010.

[10] M. Atsumi. Learning visual categories based on probabilistic latent com-

ponent models with semi-supervised labeling. GSTF Journal on Comput-

ing (JoC), 2(1), 2012.

[11] M. Atsumi. Visual categorization based on learning contextual proba-

bilistic latent component tree. In Artificial Neural Networks and Machine

Learning – ICANN 2012, pages 419–426. Springer, 2012.

[12] M. Atsumi. Attention-guided organized perception and learning of ob-

ject categories based on probabilistic latent variable models. Journal of

Intelligent Learning Systems and Applications, 5:123–133, 2013.

[13] S. Bader and F. Maire. Ents - a fast and adaptive indexing system for

codebooks. ICONIP ’02: Proceedings of the 9th International Conference

on Neural Information Processing, 4:1837–1841, November 2002.

[14] A. Banerjee, S. Merugu, I.S. Dhillon, and J. Ghosh. Clustering with

bregman divergences. The Journal of Machine Learning Research, 6:1705–

1749, 2005.

[15] M. Banko and E. Brill. Scaling to very very large corpora for natural

language disambiguation. In ACL ’01: Proceedings of the 39th Annual

Meeting on Association for Computational Linguistics, pages 26–33, Mor-

ristown, NJ, USA, 2001. Association for Computational Linguistics.

[16] A.L. Barabasi and R. Albert. Emergence of scaling in random networks.

Science, 286(5439):509, 1999.

[17] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof

of the restricted isometry property for random matrices. Constructive

Approximation, 28(3):253–263, 2008.

[18] Kobus Barnard and David Forsyth. Learning the semantics of words and

pictures. In Eighth IEEE International Conference on Computer Vision,

2001. ICCV 2001. Proceedings., volume 2, pages 408–415. IEEE, 2001.

[19] Rudolf Bayer and Edward M. McCreight. Organization and maintenance

of large ordered indices. Acta Informatica, 1:173–189, 1972.

197

[20] P. Berkhin. A survey of clustering data mining techniques. Grouping

Multidimensional Data, pages 25–71, 2006.

[21] J.C. Bezdek, R. Ehrlich, et al. Fcm: The fuzzy c-means clustering algo-

rithm. Computers & Geosciences, 10(2-3):191–203, 1984.

[22] Christophe Biernacki, Gilles Celeux, and Gerard Govaert. Assessing a mix-

ture model for clustering with the integrated completed likelihood. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 22(7):719–

725, 2000.

[23] E. Bingham and H. Mannila. Random projection in dimensionality reduc-

tion: applications to image and text data. In KDD 2001, pages 245–250.

ACM, 2001.

[24] H. Bisgin and H.N. Dalfes. Parallel clustering algorithms with application

to climatology. In Geophysical Research Abstracts, volume 10, 2008.

[25] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet allocation. The

Journal of Machine Learning Research, 3:993–1022, 2003.

[26] B.H. Bloom. Space/time trade-offs in hash coding with allowable errors.

Communications of the ACM, 13(7):422–426, 1970.

[27] Fazli Can and Esen A. Ozkarahan. Concepts and effectiveness of the

cover-coefficient-based clustering methodology for text databases. ACM

Transactions on Database Systems, 15(4):483–517, 1990.

[28] R.F. Cancho and R.V. Sole. The small world of human language. Pro-

ceedings of the Royal Society of London. Series B: Biological Sciences,

268(1482):2261, 2001.

[29] Rebecca J. Cathey, Eric C. Jensen, Steven M. Beitzel, Ophir Frieder, and

David Grossman. Exploiting parallelism to support scalable hierarchical

clustering. Journal of the American Society for Information Science and

Technology, 58(8):1207–1221, 2007.

[30] M.S. Charikar. Similarity estimation techniques from rounding algorithms.

In STOC, pages 380–388, New York, NY, USA, 2002. ACM.

[31] Brant W. Chee and Bruce Schatz. Document clustering using small world

communities. In JCDL ’07: Proceedings of the 7th ACM/IEEE-CS joint

conference on Digital libraries, pages 53–62, New York, NY, USA, 2007.

ACM.

198

[32] David Cheng, Ravi Kannan, Santosh Vempala, and Grant Wang. A divide-

and-merge methodology for clustering. ACM Transactions on Database

Systems, 31(4):1499–1525, 2006.

[33] Y. Chi, X. Song, D. Zhou, K. Hino, and B.L. Tseng. Evolutionary spectral

clustering by incorporating temporal smoothness. In Proceedings of the

13th ACM SIGKDD international conference on Knowledge discovery and

data mining, page 162. ACM, 2007.

[34] B. Chidlovskii. Semi-supervised Categorization of Wikipedia collection

by Label Expansion. In Workshop of the INitiative for the Evaluation of

XML Retrieval. Springer, 2008.

[35] Tzi-Dar Chiueh, Tser-Tzi Tang, and Liang-Gee Chen. Vector quantiza-

tion using tree-structured self-organizing feature maps. IEEE Journal on

Selected Areas in Communications, 12(9):1594–1599, Dec 1994.

[36] N. Chomsky. Modular approaches to the study of the mind. San Diego

State Univ, 1984.

[37] P.A. Chou, T. Lookabaugh, and R.M. Gray. Entropy-constrained vector

quantization. IEEE Transactions on Acoustics, Speech, and Signal Pro-

cessing [see also IEEE Transactions on Signal Processing], 37(1):31–42,

Jan 1989.

[38] C.L. Clarke, N. Craswell, and I. Soboroff. Overview of the TREC 2009

web track. Technical report, DTIC Document, 2009.

[39] C.L.A. Clarke, N. Craswell, I. Soboroff, and G.V. Cormack. Overview of

the TREC 2010 web track. Technical report, DTIC Document, 2010.

[40] C.L.A. Clarke, N. Craswell, I. Soboroff, and E.M. Voorhees. Overview of

the TREC 2011 web track. Technical report, DTIC Document, 2011.

[41] C.L.A. Clarke, N. Craswell, and E.M. Voorhees. Overview of the TREC

2012 web track. Technical report, DTIC Document, 2012.

[42] C. Cleverdon. The cranfield tests on index language devices. In Aslib

proceedings, volume 19, pages 173–194. MCB UP Ltd, 1967.

[43] A. Coates, A. Karpathy, and A. Ng. Emergence of object-selective fea-

tures in unsupervised feature learning. In Advances in Neural Information

Processing Systems 25, pages 2690–2698, 2012.

[44] William W Cohen and Yoram Singer. Context-sensitive learning meth-

ods for text categorization. ACM Transactions on Information Systems

(TOIS), 17(2):141–173, 1999.

199

[45] D. Comer. Ubiquitous b-tree. ACM Computing Surveys (CSUR),

11(2):121–137, 1979.

[46] G.V. Cormack, M.D. Smucker, and C.L.A. Clarke. Efficient and effective

spam filtering and re-ranking for large web datasets. Information retrieval,

14(5):441–465, 2011.

[47] F. Crestani and S. Wu. Testing the cluster hypothesis in distributed

information retrieval. Information Processing & Management, 42(5):1137–

1150, 2006.

[48] W.B. Croft. A model of cluster searching based on classification. Infor-

mation systems, 5(3):189–195, 1980.

[49] S. Dasgupta and Y. Freund. Random projection trees and low dimensional

manifolds. In STOC ’08: Proceedings of the 40th annual ACM symposium

on Theory of computing, pages 537–546, New York, NY, USA, 2008. ACM.

[50] S. Dasgupta and A. Gupta. An elementary proof of the Johnson-

Lindenstrauss lemma. Random Structures & Algorithms, 22(1):60–65,

2002.

[51] L.M. de Campos, J.M. Fernandez-Luna, J.F. Huete, A.E. Romero, and

E.I. y de Telecomunicacion. Probabilistic Methods for Link-Based Classi-

fication at INEX 2008. In Workshop of the INitiative for the Evaluation

of XML Retrieval. Springer, 2008.

[52] Christopher Michael De Vries. Application of k-tree to document cluster-

ing. 2010.

[53] C.M. De Vries. Application of k-tree to document clustering. Master’s

thesis, School of Computer Science, Queensland University of Technology,

Brisbane, Australia, 2010.

[54] C.M. De Vries, L. De Vine, and S. Geva. Random indexing k-tree. ADCS

2009, page 43, 2009.

[55] C.M. De Vries, L. De Vine, and S. Geva. The em-tree algorithm. Submitted

to the Journal of Machine Learning Research for peer review September

2013, 2013.

[56] C.M. De Vries and S. Geva. Document clustering with K-tree. Advances

in Focused Retrieval: 7th International Workshop of the Initiative for the

Evaluation of XML Retrieval, INEX 2008, Dagstuhl Castle, Germany,

December 15-18, 2008. Revised and Selected Papers, pages 420–431, 2009.

200

[57] C.M. De Vries and S. Geva. K-tree: large scale document clustering. In

Proceedings of the 32nd international ACM SIGIR conference on Research

and development in information retrieval, pages 718–719. ACM, 2009.

[58] C.M. De Vries and S. Geva. Pairwise similarity of topsig document signa-

tures. In Proceedings of the Seventeenth Australasian Document Comput-

ing Symposium, pages 128–134. ACM, 2012.

[59] C.M. De Vries, S. Geva, and L. De Vine. Clustering with random indexing

k-tree and xml structure. In Focused Retrieval and Evaluation, pages 407–

415. Springer, 2010.

[60] C.M. De Vries, S. Geva, and A. Trotman. Document clustering evalua-

tion: Divergence from a random baseline. Workshop ”Information Re-

trieval 2012” (IR-2012), 12-14 September, 2012, Technical University of

Dortmund, Dortmund, Germany., 2012.

[61] C.M. De Vries, S. Geva, and A. Trotman. Distributed information re-

trieval: Collection distribution, selection and the cluster hypothesis for

evaluation of document clustering. Submitted to the ACM Transactions

on Information Systems for peer review at July 2013, 2013.

[62] C.M. De Vries, R. Nayak, S. Kutty, S. Geva, and A. Tagarelli. Overview of

the INEX 2010 XML mining track: Clustering and classification of XML

documents. Comparative Evaluation of Focused Retrieval, pages 363–376,

2011.

[63] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harsh-

man. Indexing by latent semantic analysis. Journal of the American

Society for Information Science, 41(6):391–407, 1990.

[64] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from

incomplete data via the em algorithm. Journal of the Royal Statistical

Society. Series B (Methodological), pages 1–38, 1977.

[65] L. Denoyer and P. Gallinari. The Wikipedia XML Corpus. SIGIR Forum,

2006.

[66] L. Denoyer and P. Gallinari. Report on the XML mining track at INEX

2007 categorization and clustering of XML documents. In ACM SIGIR

Forum, volume 42, pages 22–28. ACM, 2008.

[67] L. Denoyer and P. Gallinari. Overview of the inex 2008 XML mining

track. Advances in Focused Retrieval: 7th International Workshop of the

201

Initiative for the Evaluation of XML Retrieval, INEX 2008, Dagstuhl Cas-

tle, Germany, December 15-18, 2008. Revised and Selected Papers, pages

401–411, 2009.

[68] L. Denoyer, P. Gallinari, and A.M. Vercoustre. Report on the XML mining

track at INEX 2005 and INEX 2006. INEX 2006, pages 432–443, 2007.

[69] I.S. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: spectral clustering

and normalized cuts. page 556, 2004.

[70] I.S. Dhillon, S. Mallela, and D.S. Modha. Information-theoretic co-

clustering. In Proceedings of the ninth ACM SIGKDD international con-

ference on Knowledge discovery and data mining, pages 89–98. ACM New

York, NY, USA, 2003.

[71] F. Diaz. Regularizing ad hoc retrieval scores. In Proceedings of the 14th

ACM international conference on Information and knowledge manage-

ment, pages 672–679. ACM, 2005.

[72] C. Ding and X. He. K-means clustering via principal component analysis.

In Proceedings of the twenty-first international conference on Machine

learning, page 29. ACM, 2004.

[73] S. Ding, S. Gollapudi, S. Ieong, K. Kenthapadi, and A. Ntoulas. Indexing

strategies for graceful degradation of search quality. In Proceedings of the

34th international ACM SIGIR conference on Research and development

in Information, pages 575–584. ACM, 2011.

[74] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern classification, 2nd Edition.

Wiley-Interscience, 2001.

[75] M. Ester, H.P. Kriegel, J. Sander, and X. Xu. A density-based algorithm

for discovering clusters in large spatial databases with noise. In Proceedings

of KDD, volume 96, pages 226–231, 1996.

[76] R.M. Esteves and C. Rong. Using mahout for clustering wikipedia’s latest

articles: a comparison between k-means and fuzzy c-means in the cloud.

In IEEE CloudCom, pages 565–569. IEEE, 2011.

[77] V. Estivill-Castro. Why so many clustering algorithms: a position paper.

ACM SIGKDD Explorations Newsletter, 4(1):65–75, 2002.

[78] C. Faloutsos and S. Christodoulakis. Signature files: an access method for

documents and its analytical performance evaluation. ACM Transactions

on Information Systems (TOIS), 2:267–288, October 1984.

202

[79] I. Farber, S. Gunnemann, H.P. Kriegel, P. Kroger, E. Muller, E. Schubert,

T. Seidl, and A. Zimek. On using class-labels in evaluation of clusterings.

In MultiClust: 1st International Workshop on Discovering, Summarizing

and Using Multiple Clusterings Held in Conjunction with KDD, 2010.

[80] M. Feldman, R. Lempel, O. Somekh, and K. Vornovitsky. On the im-

pact of random index-partitioning on index compression. Arxiv preprint

arXiv:1107.5661, 2011.

[81] P. Forner, J. Gonzalo, J. Kekalainen, M. Lalmas, and M. De Rijke. Multi-

lingual and Multimodal Information Access Evaluation: Proceedings of the

Second International Conference of the Cross-language Evaluation Forum,

Clef 2011, Amsterdam, the Netherlands, September 19-22, 2011., volume

6941. Springer-Verlag New York Inc, 2011.

[82] Christopher Fox. A stop list for general text. In ACM SIGIR Forum,

volume 24, pages 19–21. ACM, 1989.

[83] T.W. Fox. Document vector compression and its application in document

clustering. Canadian Conference on Electrical and Computer Engineering,

pages 2029–2032, May 2005.

[84] Chris Fraley and Adrian E. Raftery. How many clusters? which cluster-

ing method? answers via model-based cluster analysis. The Computer

Journal, 41(8):578–588, ???? 1998.

[85] Annalisa Franco, Alessandra Lumini, and Dario Maio. Mkl-tree: an in-

dex structure for high-dimensional vector spaces. Multimedia Systems,

12(6):533–550, May 2007.

[86] N. Fuhr, M. Lechtenfeld, B. Stein, and T. Gollub. The optimum clustering

framework: implementing the cluster hypothesis. Information Retrieval,

pages 1–23, 2011.

[87] A. Gersho and R.M. Gray. Vector quantization and signal compression.

Kluwer Academic Publishers, 1993.

[88] M. Gery, C. Largeron, and C. Moulin. UJM at INEX 2008 XML mining

track. In Advances in Focused Retrieval, page 452. Springer, 2009.

[89] S. Geva. K-tree: a height balanced tree structured vector quantizer. In

Proceedings of the 2000 IEEE Signal Processing Society Workshop Neural

Networks for Signal Processing X, 2000., volume 1, pages 271–280. IEEE,

2000.

203

[90] S. Geva and C.M. De Vries. Topsig: topology preserving document sig-

natures. In Proceedings of the 20th ACM international conference on

Information and knowledge management, CIKM ’11, pages 333–338, New

York, NY, USA, 2011. ACM.

[91] A. Girgensohn, F. Shipman, and L. Wilcox. Adaptive clustering and in-

teractive visualizations to support the selection of video clips. In Proceed-

ings of the 1st ACM International Conference on Multimedia Retrieval,

page 34. ACM, 2011.

[92] G. Golub and W. Kahan. Calculating the singular values and pseudo-

inverse of a matrix. Journal of the Society for Industrial and Applied

Mathematics: Series B, Numerical Analysis, 2(2):205–224, 1965.

[93] G.H. Golub and C. Reinsch. Singular value decomposition and least

squares solutions. Numerische Mathematik, 14(5):403–420, 1970.

[94] L. Gravano and H. Garcia-Molina. Generalizing gloss to vector-space

databases and broker hierarchies. 1999.

[95] U. Großekathofer, S. Geva, T. Hermann, and S. Kopp. Learning hierarchi-

cal prototypes of motion time series for interactive systems. MACHINE

LEARNING FOR INTERACTIVE SYSTEMS: BRIDGING THE GAP

BETWEEN LANGUAGE, MOTOR, page 37, 2012.

[96] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O’Callaghan. Clus-

tering data streams: Theory and practice. IEEE Transactions on Knowl-

edge and Data Engineering, pages 515–528, 2003.

[97] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. Cure: an efficient

clustering algorithm for large databases. Information Systems, 26(1):35–

58, March 2001.

[98] I. Guyon, U. von Luxburg, and R.C. Williamson. Clustering: Science or

art. In NIPS Workshop on Clustering Theory, 2009.

[99] A. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of

data. IEEE Intelligent Systems, 24(2):8–12, 2009.

[100] K.M. Hammouda and M.S. Kamel. Efficient phrase-based document in-

dexing for web document clustering. Knowledge and Data Engineering,

IEEE Transactions on, 16(10):1279–1296, Oct. 2004.

[101] Z.S. Harris. Distributional structure. Word, 1954.

204

[102] M.A. Hearst and J.O. Pedersen. Reexamining the cluster hypothesis:

scatter/gather on retrieval results. In Proceedings of the 19th annual in-

ternational ACM SIGIR conference on Research and development in in-

formation retrieval, pages 76–84. ACM, 1996.

[103] Laurie J. Heyer, Semyon Kruglyak, and Shibu Yooseph. Exploring ex-

pression data: Identification and analysis of coexpressed genes. Genome

Research, 9(11):1106–1115, 1999.

[104] Thomas Hofmann. Probabilistic latent semantic indexing. In SIGIR

’99: Proceedings of the 22nd annual international ACM SIGIR confer-

ence on Research and development in information retrieval, pages 50–57,

New York, NY, USA, 1999. ACM.

[105] A. Hotho, S. Staab, and G. Stumme. Ontologies improve text document

clustering. Third IEEE International Conference on Data Mining, 2003.

ICDM 2003., pages 541–544, Nov. 2003.

[106] Xiaohua Hu, Xiaodan Zhang, Caimei Lu, Eun K Park, and Xiaohua

Zhou. Exploiting wikipedia as external knowledge for document cluster-

ing. In Proceedings of the 15th ACM SIGKDD international conference

on Knowledge discovery and data mining, pages 389–396. ACM, 2009.

[107] A. Huang. Similarity measures for text document clustering. In Pro-

ceedings of the Sixth New Zealand Computer Science Research Student

Conference (NZCSRSC2008), Christchurch, New Zealand, pages 49–56,

2008.

[108] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: to-

wards removing the curse of dimensionality. In Proceedings of the thirtieth

annual ACM symposium on Theory of computing, pages 604–613. ACM,

1998.

[109] H. V. Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu, and Rui Zhang.

idistance: an adaptive b+-tree based indexing method for nearest neigh-

bor search. ACM Transactions on Database Systems, 30(2):364–397, 2005.

[110] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review.

ACM Computing Surveys (CSUR), 31(3):264–323, 1999.

[111] N. Jardine and C.J. van Rijsbergen. The use of hierarchic clustering in

information retrieval. Information storage and retrieval, 7(5):217–240,

1971.

205

[112] R. Jin, C. Kou, R. Liu, and Y. Li. Efficient parallel spectral cluster-

ing algorithm design for large data sets under cloud computing environ-

ment. Journal of Cloud Computing: Advances, Systems and Applications,

2(1):18, 2013.

[113] Thorsten Joachims. Text categorization with Support Vector Machines:

Learning with many relevant features, chapter Text categorization with

Support Vector Machines: Learning with many relevant features, pages

137–142. 1998.

[114] W.B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings

into a Hilbert space. Contemporary mathematics, 26(189-206):1–1, 1984.

[115] N. Kando, K. Kuriyama, T. Nozue, K. Eguchi, H. Kato, and S. Hidaka.

Overview of ir tasks at the first ntcir workshop. In Proceedings of the

first NTCIR workshop on research in Japanese text retrieval and term

recognition, pages 11–44, 1999.

[116] P. Kanerva. The spatter code for encoding concepts at many levels. In

ICANN94, Proceedings of the International Conference on Artificial Neu-

ral Networks, 1994.

[117] T. Kanungo, D.M. Mount, N.S. Netanyahu, C.D. Piatko, R. Silverman,

and A.Y. Wu. An efficient k-means clustering algorithm: analysis and

implementation. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 24(7):881–892, Jul 2002.

[118] R. Kaptein and J. Kamps. Using Links to Classify Wikipedia Pages. In

Advances in Focused Retrieval, page 435. Springer, 2009.

[119] G. Karypis. CLUTO-A Clustering Toolkit. 2002.

[120] G. Karypis, E.H. Han, and V. Kumar. Chameleon: Hierarchical clustering

using dynamic modeling. Computer, 32(8):68–75, 1999.

[121] M. Kaszkiel and J. Zobel. Effective ranking with arbitrary passages.

Journal of the American Society for Information Science and Technol-

ogy, 52(4):344–364, 2001.

[122] L. Kaufman, P. Rousseeuw, Delft(Netherlands). Dept. of Mathematics

Technische Hogeschool, and Informatics. Clustering by means of medoids.

Technical report, Technische Hogeschool, Delft(Netherlands). Dept. of

Mathematics and Informatics., 1987.

[123] L. Kaufman and P.J. Rousseeuw. Finding groups in data. An introduction

to cluster analysis. 1990.

206

[124] M.G. Kendall. A new measure of rank correlation. Biometrika,

30(1/2):81–93, 1938.

[125] R.W. Klein and R.C. Dubes. Experiments in projection and clustering by

simulated annealing. Pattern Recognition, 22(2):213–220, 1989.

[126] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment.

Journal of the ACM, 46(5):604–632, 1999.

[127] S. Kotsiantis and PE Pintelas. Recent advances in clustering: a brief

survey. WSEAS Transactions on Information Science and Applications,

1(1):73–81, 2004.

[128] A. Kulkarni and J. Callan. Document allocation policies for selective

searching of distributed indexes. In Proceedings of the 19th ACM inter-

national conference on Information and knowledge management, pages

449–458. ACM, 2010.

[129] A. Kulkarni and J. Callan. Topic-based index partitions for efficient and

effective selective search. In SIGIR 2010 Workshop on Large-Scale Dis-

tributed Information Retrieval, 2010.

[130] A. Kulkarni, A.S. Tigelaar, D. Hiemstra, and J. Callan. Shard ranking

and cutoff estimation for topically partitioned collections. In Proceedings

of the 21st ACM international conference on Information and knowledge

management, pages 555–564. ACM, 2012.

[131] A. Kumar. G-tree: a new data structure for organizing multidimensional

data. IEEE Transactions on Knowledge and Data Engineering, 6(2):341–

347, Apr 1994.

[132] A. Kumar and S. Mukherjee. Verification and validation of mapreduce

program model for parallel k-means algorithm on hadoop cluster. Inter-

national Journal of Computer Applications, 72, 2013.

[133] S. Kutty, T. Tran, R. Nayak, and Y. Li. Clustering XML Documents

Using Frequent Subtrees. In Advances in Focused Retrieval, page 445.

Springer, 2009.

[134] Sangeetha Kutty, Richi Nayak, and Yuefeng Li. Xml documents clustering

using a tensor space model. In Advances in Knowledge Discovery and Data

Mining, pages 488–499. Springer, 2011.

[135] I.O. Kyrgyzov, O.O. Kyrgyzov, H. Maıtre, and M. Campedel. Kernel mdl

to determine the number of clusters. LECTURE NOTES IN COMPUTER

SCIENCE, 4571:203, 2007.

207

[136] A. Kyriakopoulou and T. Kalamboukis. Using clustering to enhance text

classification. In SIGIR ’07: Proceedings of the 30th annual international


trieval, pages 805–806, New York, NY, USA, 2007. ACM.

[137] S. Lamprier, T. Amghar, B. Levrat, and F. Saubion. Seggen: A genetic

algorithm for linear text segmentation. In International Joint Conference

on Artificial Intelligence (IJCAI07), 2007.

[138] S. Lamprier, T. Amghar, B. Levrat, and F. Saubion. Using Text Segmen-

tation to Enhance the Cluster Hypothesis. Artificial Intelligence: Method-

ology, Systems, and Applications, pages 69–82, 2008.

[139] S. Lamrous and M. Taileb. Divisive hierarchical k-means. International

Conference on Computational Intelligence for Modelling, Control and Au-

tomation and International Conference on Intelligent Agents, Web Tech-

nologies and Internet Commerce, pages 18–18, November 2006.

[140] T.K. Landauer and S.T. Dumais. A solution to platoas problem: the latent

semantic analysis theory of acquisition, induction, and representation of

knowledge. Psychological Review, 104(2):211–240, 1997.

[141] T.K. Landauer, P.W. Foltz, and D. Laham. An introduction to latent

semantic analysis. Discourse Processes, 25(2-3):259–284, 1998.

[142] Leah S Larkey, Margaret E Connell, and Jamie Callan. Collection selection

and results merging with topically organized us patents and trec data.

In Proceedings of the ninth international conference on Information and

knowledge management, pages 282–289. ACM, 2000.

[143] V. Latora and M. Marchiori. Efficient behavior of small-world networks.

Physical Review Letters, 87(19):198701, 2001.

[144] Q.V. Le, M.A. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado,

J. Dean, and A.Y. Ng. Building high-level features using large scale un-

supervised learning. arXiv preprint arXiv:1112.6209, 2011.

[145] K.S. Lee, W.B. Croft, and J. Allan. A cluster-based resampling method for

pseudo-relevance feedback. In Proceedings of the 31st annual international


trieval, pages 235–242. ACM, 2008.

[146] K.S. Lee, K. Kageura, and K.S. Choi. Implicit ambiguity resolution using

incremental clustering in cross-language information retrieval. Informa-

tion processing & management, 40(1):145–159, 2004.

208

[147] K.S. Lee, Y.C. Park, and K.S. Choi. Re-ranking model based on document

clusters. Information processing & management, 37(1):1–14, 2001.

[148] Mingwei Leng, Haitao Tang, and Xiaoyun Chen. An efficient k-means

clustering algorithm based on influence factors. Eighth ACIS International

Conference on Software Engineering, Artificial Intelligence, Networking,

and Parallel/Distributed Computing, 2:815–820, 2007.

[149] A.V. Leouski and W.B. Croft. An evaluation of techniques for clustering

search results. 1996.

[150] A. Leuski. Evaluating document clustering for interactive information

retrieval. In CIKM ’01: Proceedings of the tenth international conference

on Information and knowledge management, pages 33–40, New York, NY,

USA, 2001. ACM.

[151] D.D. Lewis. Representation and learning in information retrieval. PhD

thesis, 1992.

[152] D.D. Lewis, Y. Yang, T.G. Rose, and F. Li. RCV1: A new benchmark col-

lection for text categorization research. The Journal of Machine Learning

Research, 5:361–397, 2004.

[153] AWC Liew, SH Leung, and WH Lau. Fuzzy image clustering incorporating

spatial continuity. IEEE Proceedings-Vision, Image and Signal Processing,

147(2):185–192, 2000.

[154] C.J. Lin. Projected Gradient Methods for Nonnegative Matrix Factoriza-

tion. Neural Computation, 19(10):2756–2779, 2007.

[155] T. Liu, C. Rosenberg, and H.A. Rowley. Clustering billions of images with

large scale nearest neighbor search. In WACV’07, pages 28–28. IEEE,

2007.

[156] X. Liu and W.B. Croft. Cluster-based retrieval using language models. In

Proceedings of the 27th annual international ACM SIGIR conference on

Research and development in information retrieval, pages 186–193. ACM,

2004.

[157] S. Lloyd. Least squares quantization in pcm. IEEE Transactions on

Information Theory, 28(2):129–137, March 1982.

[158] Lie Lu, Hong-Jiang Zhang, and Hao Jiang. Content analysis for audio

classification and segmentation. IEEE Transactions on Speech and Audio

Processing, 10(7):504–516, 2002.

209

[159] Kevin Lund and Curt Burgess. Producing high-dimensional semantic

spaces from lexical co-occurrence. Behavior Research Methods, Instru-

ments, & Computers, 28(2):203–208, 1996.

[160] Bin Luo, Richard C Wilson, and Edwin R Hancock. Spectral embedding

of graphs. Pattern recognition, 36(10):2213–2230, 2003.

[161] C. Macdonald and I. Ounis. Voting for candidates: adapting data fusion

techniques for an expert search task. In CIKM 2006, pages 387–396. ACM,

2006.

[162] Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. Detecting

near-duplicates for web crawling. In Proceedings of the 16th international

conference on World Wide Web, WWW ’07, pages 141–150, New York,

NY, USA, 2007. ACM.

[163] C.D. Manning, P. Raghavan, and H. Schutze. An introduction to infor-

mation retrieval. 2008.

[164] S.T. March and G.F. Smith. Design and natural science research on infor-

mation technology. Decision Support Systems, 15(4):251–266, December

1995.

[165] Laura A. Mather. A linear algebra measure of cluster quality. Journal of

the American Society for Information Science, 51(7):602–613, 2000.

[166] Yutaka Matsuo, Takeshi Sakaki, Koki Uchiyama, and Mitsuru Ishizuka.

Graph-based word clustering using a web search engine. In EMNLP ’06:

Proceedings of the 2006 Conference on Empirical Methods in Natural Lan-

guage Processing, pages 542–550, Morristown, NJ, USA, 2006. Association

for Computational Linguistics.

[167] Yingbo Miao, Vlado Keselj, and Evangelos Milios. Document clustering

using character n-grams: a comparative evaluation with term-based and

word-based clustering. In Proceedings of the 14th ACM international con-

ference on Information and knowledge management, pages 357–358. ACM,

2005.

[168] B.L. Milenova and M.M. Campos. O-cluster: scalable clustering of large

high dimensional data sets. ICDM ’02: IEEE International Conference

on Data Mining, pages 290–297, 2002.

[169] A. Moffat and J. Zobel. Rank-biased precision for measurement of retrieval

effectiveness. ACM Transactions on Information Systems (TOIS), 27:2:1–

2:27, December 2008.

210

[170] Fionn Murtagh, Jean-Luc Starck, and Michael W Berry. Overcoming the

curse of dimensionality in clustering by means of the wavelet transform.

The Computer Journal, 43(2):107–120, 2000.

[171] R. Nayak, C.M. De Vries, S. Kutty, S. Geva, L. Denoyer, and P. Gallinari.

Overview of the INEX 2009 XML mining track: Clustering and classifica-

tion of XML documents. Focused Retrieval and Evaluation, pages 366–378,

2010.

[172] R. Nayak, R. Witt, and A. Tonev. Data mining and XML documents.

2002.

[173] R.T. Ng and J. Han. Clarans: a method for clustering objects for spatial

data mining. IEEE Transactions on Knowledge and Data Engineering,

14(5):1003–1016, 2002.

[174] Zheng-Yu Niu, Dong-Hong Ji, and Chew-Lim Tan. Document clustering

based on cluster validation. In CIKM ’04: Proceedings of the thirteenth

ACM international conference on Information and knowledge manage-

ment, pages 501–506, New York, NY, USA, 2004. ACM.

[175] T. Ozyer and R. Alhajj. Achieving natural clustering by validating re-

sults of iterative evolutionary clustering approach. 3rd International IEEE

Conference on Intelligent Systems, pages 488–493, September 2006.

[176] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The

pagerank citation ranking: Bringing order to the web. 1999.

[177] 0. Papapetrou, W. Siberski, F. Leitritz, and W. Nejdl. Exploiting distri-

bution skew for scalable p2p text clustering. In DBISP2P, pages 1–12,

2008.

[178] Lance Parsons, Ehtesham Haque, and Huan Liu. Subspace clustering for

high dimensional data: a review. SIGKDD Explorer Newsletter, 6(1):90–

105, 2004.

[179] Dan Pelleg, Andrew W Moore, et al. X-means: Extending k-means with

efficient estimation of the number of clusters. In ICML, pages 727–734,

2000.

[180] T.A. Plate. Distributed representations and nested compositional struc-

ture. PhD thesis, 1994.

[181] M.F. Porter. An algorithm for suffix stripping. Program: Electronic Li-

brary and Information Systems, 40(3):211–218, 2006.

211

[182] William H Press. Numerical recipes 3rd edition: The art of scientific

computing. Cambridge university press, 2007.

[183] Diego Puppin, Fabrizio Silvestri, and Domenico Laforenza. Query-driven

document partitioning and collection selection. In Proceedings of the 1st

international conference on Scalable information systems, page 34. ACM,

2006.

[184] Behrang QasemiZadeh and Siegfried Handschuh. Random manhattan in-

dexing. In 25th International Workshop on Database and Expert Systems

Applications, 2014.

[185] Daniel Ramage, Paul Heymann, Christopher D Manning, and Hector

Garcia-Molina. Clustering the tagged web. In Proceedings of the Second

ACM International Conference on Web Search and Data Mining, pages

54–63. ACM, 2009.

[186] T.S. Rappaport. Wireless communications: principles and practice, vol-

ume 2. Prentice Hall Inc., 1996.

[187] Kaspar Riesen, Michel Neuhaus, and Horst Bunke. Graph embedding in

vector spaces by means of prototype selection. In Graph-Based Represen-

tations in Pattern Recognition, pages 383–393. Springer, 2007.

[188] Jr. Robert M. Losee, Lewis Church. Are two document clusters better than

one? the cluster performance question for information retrieval. Journal of

the American Society for Information Science and Technology, 56(1):106–

108, 2005.

[189] S.E. Robertson and K.S. Jones. Simple, proven approaches to text re-

trieval. Update, 1997.

[190] S.E. Robertson, S. Walker, S. Jones, M.M. Hancock-Beaulieu, and M. Gat-

ford. Okapi at trec-3. In Proceedings of the seventh Text REtrieval Con-

ference, 1995.

[191] K. Rose. Deterministic annealing for clustering, compression, classifica-

tion, regression, and related optimization problems. Proceedings of the

IEEE, 86(11):2210–2239, November 1998.

[192] K. Rose, D. Miller, and A. Gersho. Entropy-constrained tree-structured

vector quantizer design. IEEE Transactions on Image Processing,

5(2):393–398, Feb 1996.

212

[193] P.J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and

validation of cluster analysis. Journal of computational and applied math-

ematics, 20(1):53–65, 1987.

[194] M. Sahlgren. An introduction to random indexing. In TKE 2005, 2005.

[195] Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. Interna-

tional Journal of Approximate Reasoning, 50(7):969–978, 2009.

[196] G. Salton and C. Buckley. Term-weighting approaches in automatic text

retrieval. Information processing & management, 24(5):513–523, 1988.

[197] G. Salton, E.A. Fox, and H. Wu. Extended boolean information retrieval.

Communications of the ACM, 26(11):1022–1036, 1983.

[198] G. Salton, A. Wong, and C.S. Yang. A vector space model for automatic

indexing. Communications of the ACM, 18(11):613–620, 1975.

[199] S. Salvador and P. Chan. Determining the number of clusters/segments in

hierarchical clustering/segmentation algorithms. 16th IEEE International

Conference on Tools with Artificial Intelligence, 2004. ICTAI 2004., pages

576–584, 15-17 Nov. 2004.

[200] T. Saracevic. Effects of inconsistent relevance judgments on information

retrieval test results: A historical perspective. 2008.

[201] R. Schenkel, F. Suchanek, and G. Kasneci. Yawn: A semantically anno-

tated wikipedia xml corpus. A. Kemper, H. Schoning, T. Rose, M. Jarke,

T. Seidl, C. Quix, and C. Brochhaus, editors, 12:277–291, 2007.

[202] F. Sebastiani. Machine learning in automated text categorization. ACM

computing surveys (CSUR), 34(1):1–47, 2002.

[203] S.Z. Selim and M.A. Ismail. K-means-type algorithms: a generalized con-

vergence theorem and characterization of local optimality. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence, (1):81–87, 1984.

[204] C.E. Shannon and Warren Weaver. The mathematical theory of commu-

nication. University of Illinois Press, 1949.

[205] G. Sheikholeslami, S. Chatterjee, and A. Zhang. Wavecluster: a wavelet-

based clustering approach for spatial data in very large databases. The

VLDB Journal, 8(3-4):289–304, 2000.

[206] Heng Tao Shen, Xiaofang Zhou, and Aoying Zhou. An adaptive and

dynamic dimensionality reduction method for high-dimensional indexing.

The VLDB Journal, 16(2):219–234, 2007.

213

[207] M. Smucker and J. Allan. A new measure of the cluster hypothesis. Ad-

vances in Information Retrieval Theory, pages 281–288, 2009.

[208] Y. Song, W.Y. Chen, H. Bai, C.J. Lin, and E.Y. Chang. Parallel Spectral

Clustering. 2008.

[209] K. Sparck Jones and C. van Rijsbergen. Report on the need for and

provision of an” ideal” information retrieval test collection. British Library

Research and Development Report 5266, Computer Laboratory, University

of Cambridge, 1975.

[210] M. Steinbach, G. Karypis, V. Kumar, et al. A comparison of document

clustering techniques. In KDD workshop on text mining, volume 400,

pages 525–526. Boston, 2000.

[211] F.M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic

knowledge. In Proceedings of the 16th international conference on World

Wide Web, pages 697–706. ACM, 2007.

[212] Mihai Surdeanu, Jordi Turmo, and Alicia Ageno. A hybrid unsuper-

vised approach for document clustering. In KDD ’05: Proceedings of the

eleventh ACM SIGKDD international conference on Knowledge discovery

in data mining, pages 685–690, New York, NY, USA, 2005. ACM.

[213] Andrea Tagarelli and Sergio Greco. Toward semantic xml clustering. In

SDM, pages 188–199. SIAM, 2006.

[214] A.H. Tan. Text mining: The state of the art and the challenges. In

Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from

Advanced Databases, pages 65–70, 1999.

[215] A. Tombros, R. Villa, and C.J. Van Rijsbergen. The effectiveness of query-

specific hierarchic clustering in information retrieval. Information process-

ing & management, 38(4):559–582, 2002.

[216] Simon Tong and Daphne Koller. Support vector machine active learn-

ing with applications to text classification. Journal of Machine Learning

Research, 2:45–66, 2002.

[217] M. Tovar, A. Cruz, B. Vazquez, D. Pinto, and D. Vilarino. An iterative

clustering method for the XML-mining task of the INEX 2010. INEX

2010, 2011.

[218] T. Tran, S. Kutty, and R. Nayak. Utilizing the Structure and Content

Information for XML Document Clustering. In Advances in Focused Re-

trieval, page 468. Springer, 2009.

214

[219] A. Trotman. Compressing inverted files. Information Retrieval, 6(1):5–19,

2003.

[220] A. Trotman, X.F. Jia, and S. Geva. Fast and effective focused retrieval.

Focused Retrieval and Evaluation, pages 229–241, 2010.

[221] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large Margin

Methods for Structured and Interdependent Output Variables. Journal of

Machine Learning Research, 6:1453–1484, 2005.

[222] A. Turpin and F. Scholer. User performance versus precision measures for

simple search tasks. In SIGIR 2006, pages 11–18. ACM, 2006.

[223] Stijn Van Dongen. Graph clustering via a discrete uncoupling process.

SIAM Journal on Matrix Analysis and Applications, 30(1):121–141, 2008.

[224] C. J. van Rijsbergen. Information Retrieval. Butterworths, 1979.

[225] C.J. van Rijsbergen and K.S. Jones. A test for the separation of relevant

and non-relevant documents in experimental retrieval collections. Journal

of Documentation, 29(3):251–257, 1973.

[226] V.N. Vapnik. The nature of statistical learning theory. Springer Verlag,

2000.

[227] E.M. Voorhees. The cluster hypothesis revisited. In Proceedings of the

8th annual international ACM SIGIR conference on Research and devel-

opment in information retrieval, pages 188–196. ACM, 1985.

[228] E.M. Voorhees. The effectiveness and efficiency of agglomerative hierar-

chic clustering in document retrieval. PhD thesis, 1986.

[229] E.M. Voorhees. The philosophy of information retrieval evaluation. In

Evaluation of cross-language information retrieval systems, pages 143–

170. Springer, 2002.

[230] R.S. Wallace and T. Kanade. Finding natural clusters having minimum

description length. 10th International Conference on Pattern Recognition,

1:438–442, June 1990.

[231] F. Wang, S. amd Liang and J. Yang. PKU at INEX 2010 XML mining

track. INEX 2010, 2010.

[232] D.J. Watts and S.H. Strogatz. Collective dynamics of ’small-world’ net-

works. Nature, (393):440–442, 1998.

215

[233] M. Weeks, V.J. Hodge, and J. Austin. A hardware-accelerated novel ir sys-

tem. In 10th Euromicro Workshop on Parallel, Distributed and Network-

based Processing, 2002. Proceedings., pages 283–289. IEEE, 2002.

[234] John S Whissell and Charles LA Clarke. Effective measures for inter-

document similarity. In Proceedings of the 22nd ACM international con-

ference on Conference on information & knowledge management, pages

1361–1370. ACM, 2013.

[235] J.S. Whissell and C.L.A. Clarke. Improving document clustering using

okapi bm25 feature weighting. Information Retrieval, pages 1–22, 2011.

[236] P. Willett. Recent trends in hierarchic document clustering: a critical

review. Information Processing & Management, 24(5):577–597, 1988.

[237] I.H. Witten, A. Moffat, and T.C. Bell. Managing gigabytes: compressing

and indexing documents and images. Morgan Kaufmann, 1999.

[238] J. Woodcock and J. Davies. Using Z: specification, refinement, and proof,

volume 39. Prentice Hall, 1996.

[239] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J.

McLachlan, A. Ng, B. Liu, P.S. Yu, et al. Top 10 algorithms in data

mining. Knowledge and Information Systems, 14(1):1–37, 2008.

[240] J. Xu and W.B. Croft. Cluster-based language models for distributed

retrieval. In Proceedings of the 22nd annual international ACM SIGIR

conference on Research and development in information retrieval, pages

254–261. ACM, 1999.

[241] W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative

matrix factorization. In SIGIR ’03: Proceedings of the 26th annual inter-

national ACM SIGIR conference on Research and development in infor-

maion retrieval, pages 267–273, New York, NY, USA, 2003. ACM.

[242] Minerva Yeung, Boon-Lock Yeo, and Bede Liu. Segmentation of video by

clustering and graph analysis. Computer vision and image understanding,

71(1):94–109, 1998.

[243] Lan Yi, Bing Liu, and Xiaoli Li. Eliminating noisy information in web

pages for data mining. In Proceedings of the ninth ACM SIGKDD in-

ternational conference on Knowledge discovery and data mining, pages

296–305. ACM, 2003.

216

[244] Illhoi Yoo and Xiaohua Hu. A comprehensive comparison study of doc-

ument clustering for a biomedical digital library medline. In JCDL ’06:

Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Li-

braries, pages 220–229, New York, NY, USA, 2006. ACM.

[245] Sangho Yoon and R.M. Gray. Clustering and finding the number of clus-

ters by unsupervised learning of mixture models using vector quantization.

IEEE International Conference on Acoustics, Speech and Signal Process-

ing, 3:III–1081–III–1084, April 2007.

[246] B. Yuwono and D.L. Lee. Server ranking for distributed text retrieval sys-

tems on the internet. In Proceedings of the Fifth International Conference

on Database Systems for Advanced Applications (DASFAA), pages 41–50,

1997.

[247] C. Zhai and J. Lafferty. A study of smoothing methods for language mod-

els applied to information retrieval. ACM Transactions on Information

Systems (TOIS), 22:179–214, April 2004.

[248] Bin Zhang and S.N. Srihari. Fast k-nearest neighbor classification using

cluster-based trees. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 26(4):525–528, April 2004.

[249] D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taught hashing for fast simi-

larity search. In SIGIR 2010, pages 18–25. ACM, 2010.

[250] S. Zhang, M. Hagenbuchner, A.C. Tsoi, and A. Sperduti. Self Organizing

Maps for the clustering of large sets of labeled graphs. In Workshop of

the INitiative for the Evaluation of XML Retrieval. Springer, 2008.

[251] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an effi-

cient data clustering method for very large databases. Proceedings of the

1996 ACM SIGMOD International Conference on Management of Data,

25(2):103–114, 1996.

[252] Y. Zhao and G. Karypis. Criterion functions for document clustering.

Experiments and Analysis University of Minnesota, Department of Com-

puter Science/Army HPC Research Center, 2002.

[253] Bing Zhou, Jun-Yi Shen, and Qin-Ke Peng. Parcle: a parallel clustering

algorithm for cluster system. 2003 International Conference on Machine

Learning and Cybernetics, 1:4–8 Vol.1, 2-5 Nov. 2003.

[254] G.K. Zipf. Human behavior and the principle of least effort: An introduc-

tion to human ecology. Addison-Wesley Press, 1949.

217

[255] J. Zobel. How reliable are the results of large-scale information retrieval

experiments? In Proceedings of the 21st annual international ACM SIGIR

conference on Research and development in information retrieval, pages

307–314. ACM, 1998.

[256] J. Zobel, A. Moffat, and L.A.F. Park. Against recall: is it persistence,

cardinality, density, coverage, or totality? SIGIR Forum, 43:3–8, June

2009.

[257] J. Zobel, A. Moffat, and K. Ramamohanarao. Inverted files versus sig-

nature files for text indexing. ACM Transactions on Database Systems,

23:453–490, December 1998.

218