an efficient document clustering algorithm and its application to a document browser
TRANSCRIPT
An e�cient document clustering algorithm and its
application to a document browser
Hideki Tanakaa, *, Tadashi Kumanob, Noriyoshi Uratani b,Terumasa Eharab
aATR Interpreting Telecommunications Research Laboratories, 2-2 Hikaridai, Seika-cho, Souraku-gun,
Kyoto 619-0288, JapanbNHK Science and Technical Research Laboratories, 1-10-11 Kinuta, Setagaya-ku, Tokyo 157-8510, Japan
Abstract
We present an e�cient document clustering algorithm that uses a term frequency vector for each
document instead of using a huge proximity matrix. The algorithm has the following features: (1) it
requires a relatively small amount of memory and runs fast, (2) it produces a hierarchy in the form of a
document classi®cation tree and (3) the hierarchy obtained by the algorithm explicitly reveals a
collection structure. We con®rm these features and thus show the algorithm's feasibility through
clustering experiments in which we use two collections of Japanese documents, the sizes of which are
83,099 and 14,701 documents. We also introduce an application of this algorithm to a document
browser. This browser is used in our Japanese-to-English translation aid system. The browsing module
of the system consists of a huge database of Japanese news articles and their English translations. The
Japanese article collection is clustered into a hierarchy by our method. Since each node in the hierarchy
corresponds to a topic in the collection, we can use the hierarchy to directly access articles by topic. A
user can learn general translation knowledge of each topic by browsing the Japanese articles and their
English translations. We also discuss techniques of presenting a large tree-formed hierarchy on a
computer screen. # 1999 Elsevier Science Ltd. All rights reserved.
Keywords: Document clustering; Document retrieval; Automatic document organization
1. Motivation
Document clustering has long received keen attention from those concerned with document
retrieval and some of the many papers which have been published are those of (Jardine & van
Information Processing and Management 35 (1999) 541±557
0306-4573/99/$ - see front matter # 1999 Elsevier Science Ltd. All rights reserved.
PII: S0306-4573(98)00056-9
PERGAMON
* Corresponding author. Tel.: +81-774-95-1325; fax: +81-774-95-1308.
E-mail address: [email protected] (H.Tanaka).
Rijsbergen, 1971; Croft, 1980; Gri�ths, Robinson, & Willet, 1984; van Rijsbergen, 1986) all of
whom have studied document clustering techniques for retrieval purposes. Traditionally,
interest in this ®eld has mainly concentrated on whether the clustering of documents helps
increase the `e�ectiveness' and `e�ciency' of the retrieval.
Expectations of increased e�ectiveness have been based on the `cluster hypothesis' (Salton &
McGill, 1983), which asserts that closely associated documents tend to be relevant to the same
queries. According to a voluminous survey (Willett, 1988), however, there seems to be no
strong evidence that this hypothesis generally holds.
Expectations of increased e�ciency have come from the fact that the retrieval space is
reduced by the clustering of documents. This is particularly attractive when the number of
documents is large. Full-search strategies such as the similarity-ranking linear search method
and the K-NN (Weiss & Kulikowski, 1990) method have often been reported to be superior to
clustering approaches for retrieval accuracy, but they obviously require quite intensive
computations.
Recently, another aspect of clustering has been focused on. That is, clustering o�ers a
friendly user interface for a document browser. In the Scatter/Gather system (Cutting, Karger,
Pedersen, & Tukey, 1992; Rao et al., 1995; Hearst & Pedersen, 1996), a light clustering method
called `Buckshot' has been proposed for summarizing collections on the ¯y. In this case, each
summary is used in user±machine interactions during document browsing sessions. We believe
this aspect to be equally important as e�ectiveness and e�ciency, since we are constantly faced
with an ever increasing number of unstructured documents from various sources.
The clustering algorithms used in the above research fall within the agglomerative hierarchic
group, such as single linkage, complete linkage, group average and Ward's method (Kaufman
& Rousseeuw, 1990; Everitt, 1993). One of the attractive features of this group is its automatic
nature; it produces a document hierarchy automatically without requiring any outside
information for the number of clusters.
The group of algorithms, however, includes quadratic space and time complexities. This
becomes problematic in the clustering of a large document collection. Our exploration with
Ward's method has revealed that the space complexity becomes more problematic than the
time complexity when the document collection size is increased; the system begins the process
of memory swapping upon hitting the memory limit. This is, of course, due to the quadratic
nature of the proximity matrix which the agglomerative method uses1.
In this paper, we present an e�cient document clustering algorithm under the divisive
hierarchic family. Unfortunately this family has been ``largely ignored'' (Kaufman &
Rousseeuw, 1990), ``far less popular'' (Everitt, 1993) and considered ``of less use for
applications in document retrieval'' (Willett, 1988).
We, however, propose a variation based on `the topic binder hypothesis'. This algorithm
uses a term frequency vector for each document instead of using a proximity matrix. The
1 For the high space complexity problem in Ward's method, an algorithm which uses the NN-chain (nearest neigh-
bor chain) was proposed, which reduced the space complexity to O(N) with the unchanged time complexity O(N 2)
(Murtagh, 1983).
H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557542
vectors are far smaller than the proximity matrix and thus the method can be applied to a
larger document collection than can the agglomerative method.
The algorithm also has the following features, which the agglomerative clustering method
does not possess.
. The cluster hierarchy explicitly shows the structure of the document collection.
. Each cluster can be explicitly explained by a set of terms and their frequencies.
We thus show a document browser which takes advantage of these features.
In this paper, we ®rst explain the `topic binder hypothesis' and then present the basic
algorithm. Next we show some implementation techniques for the task of document clustering
and test the algorithm's feasibility by clustering two large Japanese document collections whose
sizes are 83,099 and 14,701 documents. Finally, we present a document browser which uses this
clustering algorithm and conclude the paper.
2. Topic binder hypothesis
A document set which shares a term (or terms) appearing uniquely in the set treats the same
topic and thus the documents in the set resemble each other. We call such a term a `topic
binding term' or a `topic binder'. This is the `topic binder hypothesis' on which our clustering
algorithm is based.
Let us show an example. In a Japanese news database covering the early part of 1997, the
articles containing the term `Cerpa'2 more than twice are very likely about the guerrilla attack
on the Japanese Embassy in Peru. Thus the term `Cerpa' strongly binds the articles to the
particular topic. The topic binder in a di�erent view has a strong `set discrimination
capability'.
Our clustering algorithm starts with a whole document collection, recursively ®nds the
strongest topic binder and the frequency threshold and partitions the collection into two
subsets. In Sections 3 and 4 we explain the algorithm in detail.
3. De®nitions
3.1. Data
Suppose there are N objects to be clustered and each object has M discrete measurements of
variables3. Following the terminology in Kaufman and Rousseeuw (1990), we call this N�M
matrix an objects-by-variables matrix. The proposed clustering algorithm takes this objects-by-
variables matrix as input. An example matrix is shown in Table 1.
2 Nestor Cerpa Cartolini: the leader of the Peruvian guerrilla group that attacked the Japanese Embassy in Peru at
the end of 1996 and held hostages for 4 months.3 Continuous measurements can also be used without changing the algorithm.
H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557 543
Here, we de®ne some notations concerning an array4. The jth element of an array A is
represented by A[ j]. The sth to eth elements in A are referred to by A[s, e]. The average value
of A[s, e] is represented by A[s, e].
If there are n objects, a variable i will have n measurements and they are stored in an n-
dimensional array X (i ). All of the measurements in X (i ) will then be represented by X (i )[1, n].
3.2. Variation of measurements
The variation or the sum of squares of the measurements X (i )[1, n] is given by
t�X �i ��1, n�� �Xn
j�1
�X �i �� j� ÿ �X�i ��1, n��2: �1�
If the measurements distribute near their average, the variation will be small and thus we can
use this quantity to measure the coherence among the measurements.
3.3. Binary partition of measurements
The binary partition of the measurements X (i )[1, n] includes the following procedures:
. Sort X (i )[1, n] in ascending order.
. Divide X (i )[1, n] into two subsets X (i )[1, p] and X (i )[ p+1, n] using p in 1R p< n.
3.4. Variation reduction
When the measurements X (i )[1, n] are divided into two subsets, the variation is reduced by
b(X (i )[1, n], p) where
Table 1
Input data
Object
Variable
1 2
a 1 1
b 1 3
c 2 1
d 2 1
e 3 3
f 3 3
4 In this paper, we use the terms `collection', `array' and `set' interchangeably.
H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557544
b�X �i ��1, n�, p� � t�X �i ��1, n�� ÿ t�X �i ��1, p�� ÿ t�X �i �� p� 1, n��: �2�
This quantity measures how much coherence is obtained by the binary partition and thus
indicates how well the sets are discriminated. As mentioned in Section 2, a topic binder's
capability is measured in terms of the set discrimination ability; we use the variation reduction
to gauge it.
The variation reduction is clearly non-negative and is bounded by the initial variation, so we
have
0Rb�X �i ��1, n�, p�Rt�X �i ��1, n��: �3�
4. Clustering algorithm
This section presents the basic form of the clustering algorithm. Some implementation
techniques which improve the document clustering e�ciency will be introduced in Section 5.
4.1. Basic algorithm
Here again, we assume that there are N objects, each of which has M variables with their
measurements. We thus have an N�M matrix for the input.
Begin
Set n= N.
Make the initial cluster with n objects.
1. Stop condition check
Return if nR C, where C is the given maximum cluster size.
2. Optimal variable selection
For each variable i where (1R iRM), do the following.
(a) Sort X (i )[1, n] in ascending order.
(b) Try all binary partitions on X (i )[1, n] and record each variation reduction:
b(X (i )[1, n], p) where 1R p< n.
3. Object partition and tree generation
Find the variable ib and the partition point p�ib�b that give the maximum variation
reduction.
Return if the maximum variation reduction is zero. Otherwise, set the partition
threshold at �X �ib�� p�ib�b � � X �ib�� p
�ib�b � 1��=2.
For each object, if the measurement for the variable ib is greater than the partition
threshold, assign the object to group 1; otherwise to group 2. Spawn two child nodes
under the present node and assign group 1 to the left node and group 2 to the right
node.
H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557 545
4. Recursion
Repeat the process from 1 recursively with the object set in group 1 and then with the
object set in group 2.
End
We term ib the optimal variable and p�ib�b the optimal partition point.
The b(X (i )[1, n], p) for each p within 1R p< n is e�ciently calculated with the set of
di�erence equations in Appendix A.
4.2. Theoretical complexity
The proposed clustering algorithm has the following space and time complexities.
. Space complexity: the agglomerative hierarchic clustering algorithms use as input a
proximity (or dissimilarity) matrix whose space complexity is O(N 2) with N objects. Our
algorithm uses an objects-by-variables matrix. The number of variables M is usually
constant, and we obtain the space complexity of O(N). As a consequence, our algorithm can
be applied to a larger object set than can the agglomerative methods.
. Time complexity: the time complexity of our algorithm varies depending on the shape of the
obtained tree, and we evaluate two extreme cases in terms of the number of partition tests.
In this evaluation, we assume the partition continues until the cluster size becomes less than
or equal to 2.
First, we assume a perfect binary tree. In this case, the algorithm ®rst tries M�(Nÿ1)
partitions then M�2�(N/2ÿ1) partitions and so on, yielding a total of M�(N log2Nÿ3N/
2+1) partition tests. Since M is constant, we obtain the time complexity O (N log2N). The
second case assumes that one object is cut out at each partition. In this case, the jth
(1R jR Nÿ2) partition tries (Nÿ jÿ1)�M partition tests yielding the complexity of O(N 2).
We assume that the practical complexity lies between these two extreme cases, O(N log2N) and
O(N 2) and, therefore, is lower than the time complexity of the agglomerative methods.
If we applied a data sampling method before the clustering, we could obtain even lower time
complexity with an agglomerative method (Cutting et al., 1992). However, this method would
not produce coherent results because the cluster hierarchy obtained would depend on the
sampled data (Fig. 1).
5. E�cient implementation for document clustering
Let us consider a document clustering task. We have a collection of documents (size N) and
we take all terms appearing in the collection as variables (size M) and their frequencies as
H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557546
measurements. In this situation, the naive implementation of the algorithm in Section 4 is not
appropriate.
The N�M matrix for this task becomes quite large. In fact, the number of terms M grows
rapidly with the collection size N until N reaches a nontrivial ®gure. Then, the objects-by-
variables matrix becomes far larger than the proximity matrix. This makes our algorithm
unattractive in terms of memory consumption and also speed.
The memory required for storing the matrix can be compressed e�ectively by taking
advantage of the sparseness of the matrix. We use a vector that stores only nonzero
measurements for each document. The N�M matrix can be recovered on the ¯y from the use
of such vectors.
There are two techniques to reduce a high time complexity.
5.1. Use of histogram-style array
The variable i (a term) initially has elements X (i )[1, N] (term frequencies across the whole
collection) and (Nÿ1) partition tests are done to evaluate the variable i in the optimal variable
selection. Thus, N governs the number of partition tests.
The number can be greatly reduced for the document clustering task because (1) the
measurements in X (i )[1, N] distribute within a rather limited range and (2) the best partition
point falls between di�erent frequencies.
We therefore use a histogram-style array to store X (i )[1, N]. As the index this array has a
measurement and as the value it has the measurement's frequency. The histogram array size K
is bounded by the types of measurements in X (i )[1, N], which is far smaller than N. In fact, in
the clustering experiment in Section 6, which treated 83,099 documents (this corresponds to N),
the average size of K was 2.4 and reached a maximum of 31.
With the use of the histogram array, we can omit the sorting process in the variable selection
process. The variation reduction can be calculated e�ciently with a set of di�erence equations
similar to those in Appendix A.
Fig. 1. Dendrogram for Table 1.
H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557 547
5.2. Variable skip rule
Although the practical size of N is reduced drastically, we still face a large M. However,
there is a simple way to reduce the size: we do not have to evaluate each variable i (1R iRM)
in searching for the maximum b(X (i )[1, n], p). Here, we call the set of all M variables (terms)
the term list.
We can skip some variables in the search. To do this, we sort the term list in descending
order by each term's variation before clustering so that
t�X �i ��1, N ��rt�X �i�1��1, N ��: �4�
We then evaluate the variables according to the order in the sorted term list. Here, we can
prove that the following rule holds.
Variable skip rule.
Suppose the best partition point for the variable i (with measurements X (i )[1, n]) is obtained
as pb(i ). If
t�X �i�1��1, N ��Rb�X �i ��1, n�, p�i �b � �5�
holds, no further evaluation is necessary on variables j (i< jRM). The best solution obtained
through to variable i is the optimal one.
Here, we omit the proof because of space limitations, but it can be proved with some simple
derivations on Eq. (3).
The variable skip rule takes advantage of the variation gap between two adjacent terms. The
gap tends to be big in the early partitioning stages where the document size n is big. Therefore
the rule reduces the extent of the most computation intensive part in the partitioning.
The e�ect of using the histogram array in Section 5.1 and using the variable skip rule in the
optimal variable selection process is data-dependent. We will show their e�ects empirically in
Section 6.
6. Experiment
6.1. Document collection
We collected a number of scripts of Japanese news articles compiled by NHK (the Japan
Broadcasting Corporation). According to a survey by Kumano (Kumano, Tanaka, Kim, &
Uratani, 1996), an average NHK news article contains 5.2 sentences and 88.9 Japanese
characters. Some of the Japanese scripts had English translations for bilingual broadcasts.
We made two document collections from the collected scripts.
. Collection 1: articles covering the period from March 1995 to February 1997. This collection
contained 83,099 articles.
H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557548
. Collection 2: articles in collection 1 that were accompanied by English translations. This
collection contained 14,701 articles.
We analyzed the two collections with our in-house morphological analyzer allowing nouns of
22 types to be keywords and made a term list for each collection. Each term list included
nouns of 17 types. Table 2 shows the most frequent parts of speech for each collection.
The `Kata-unk' in Table 2 means the `unknown word written in Kata-Kana form'. Most
words in this item were non-Japanese persons' names and non-Japanese place names.
The `Verbal-noun' corresponds to the nouns that cover actions like negotiation, prohibition,
start, stop and so on.
These six noun types shown in Table 2 occupy more than 90% of each collection, and these
parts of speech will appear frequently in the hierarchy as topic binders.
6.2. Complexity evaluation
Table 3 shows the input information and the result statistics. We used the stop condition
`cluster sizeR20' in these experiments. This threshold was determined by a practical reason.
Table 2
Dominant parts of speech in term lists
Collection 1 Collection 2
pos freq. accumulation
(%)
pos freq. accumulation
(%)
Kata-unk 21,353 26.0 normal noun 11,066 34.3
Normal noun 20,367 50.9 Kata-unk 6,476 54.6
Place-name 9,561 62.5 verbal-noun 4,375 68.2
Person's name 8,907 73.4 place-name 2,928 77.3
Family-name 8,270 83.5 family-name 2,281 84.4
Verbal-noun 7,389 92.5 person's name 1,963 90.5
* * * * * *
Total 81,995 100 total 32,138 100
Table 3
Result statistics
Collection 1 2
Size 83,099 14,701
Term list 81,995 32,138
E�ective terms 14,348 4,217
Partition tests 3.31�107 4.61�106
Comparisons (Ward) 9.56�1013 5,29�1011
Execution time 50 h 49 min 3 h 11 min
H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557 549
We applied this algorithm to a document browser, which will be described in Section 7. An
exceedingly small cluster size, like `2', will result in too many leaves, which will greatly degrade
the readability of the tree and the browser. We judged that a cluster size of 20 is appropriate
here5. The row `Term list' shows the size of the term list and the row `E�ective terms' shows
the average number of terms (variables) evaluated in the clustering, which indicates the e�ect
of the variable skip rule.
By comparing the `Term list' and the `E�ective terms', we observed that the rule reduced the
virtual size of the term list to about 17% for collection 1 and 13% for collection 2.
The row `Partition tests' shows the total number of partition tests. The row `Comparisons
(Ward)' shows the number of elements in the proximity matrix that would have been evaluated
if an agglomerative clustering algorithm like Ward's method were used6.
A di�erence can be seen between the number of `Partition tests' and `Comparisons (Ward)'
and it indicates the superiority of our algorithm to Ward's method in terms of the time
complexity. Actually, Ward's method, which was written with the same coding level as our
algorithm, could process collection 2. The method took about 5 h to calculate the initial
proximity matrix and the process had continued for more than 90 h before we decided to stop
it.
In addition, the Ward method took up about 800 MB to cluster collection 2, whereas the
proposed method took up only about 260 MB to cluster the far larger collection 1. These
experiments therefore con®rmed our method's superiority to the agglomerative method in
terms of speed and memory consumption.
6.3. Qualitative evaluation
Fig. 2 shows a part of the hierarchy obtained with collection 2. Each node in the hierarchy
has a topic binder and its partition threshold. All articles containing the topic binder above the
threshold are grouped in the left branch. The remaining articles are grouped in the right
branch and are tested by the next topic binder. As a result, the algorithm produces a right-
branching and therefore a right-deep tree. Table 4 lists the English titles for the articles
clustered in leaves A, B and C in Fig. 2. Similar articles are grouped properly in Table 4.
We observed that similar titles were properly clustered, especially in the left shallow leaves in
the tree, where documents were positively clustered as including the topic binders above the
threshold value. These observations indicated that the topic binder hypothesis worked
reasonably well. To further clarify this hypothesis and the quality of the hierarchy, we should
conduct formal evaluations. A simple comparison between our hierarchy and those obtained
by other methods will be informative since this can reveal the di�erences objectively.
5 Changing this condition into `cluster sizeR2' increased the computation time by 15% for collection 1 and 19%
for collection 2. This increase was not very big, since the additional clustering at each leaf required the evaluation of
very few partition points (maximum 20).6 We assumed a naive implementation. Such an implementation requires i(iÿ1)/2 comparisons for selecting two
clusters (objects) from i objects. Since i varies from N to 1, the total number of comparisons becomes
(Nÿ1)N(N+1)/6.
H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557550
The absolute quality of our cluster hierarchy, on the other hand, is hard to evaluate. An
evaluation in terms of an application seems best. We can thereby evaluate the cluster hierarchy
by means of the application's goodness. The news article browser in the next section took
advantage of our clustering hierarchy, and this application can be used for such an evaluation.
7. Application to document browser
7.1. Features of cluster hierarchy
The cluster hierarchy obtained with our algorithm has the following features.
. Collection structure revelation. The topic binders on the right-most branch in Fig. 2 are
considered to be the `main keywords' of the whole collection. The ®rst topic binder
Fig. 2. Part of the cluster hierarchy for collection 2.
Table 4
Cluster example
Car (thres=2.5) Aviation (thres=2.0) Semiconductor (thres=3.0)
Japan-US: auto Japan US aviation Japan-US (semiconductor)
Trade Ministers Canada, EU US-JAL-sanction US semiconductor
Brittan-Hashimoto Japan-US (AIR) Japan-EU
Japan-US auto hearing ®lm-aviation US-Japan-chips
Auto dispute Kansai Federal semiconductor
Japan-US auto talks Japan US aviation Japan-US semiconductor
H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557 551
represents the most conspicuous topic in the entire collection, and articles belonging to the
topic are grouped under the left branch. The remaining collection, whose size is usually far
larger than the collection belonging to the ®rst topic, is grouped under the right branch. The
remainder still contains many topics. Therefore, the best topic binder here will be the second
most conspicuous topic in the entire collection. With such recursive processes on the right-
most branch, the topic binders here will be the list of keywords that roughly partition the
whole document collection. The list therefore shows us the overall topical structure of the
entire document collection. The topic binders on the left branch, on the contrary, are `sub-
keywords' that partition the collection under a main keyword in detail and thus show the
micro-structure.
. Cluster explanation by individual words. We can understand each cluster's (leaf's) feature
by the conjunction of inequities of a topic binder and its partition threshold. For example,
the feature of leaf A in Fig. 2 is shown by (Japan>2.5 \ America>2.5 \ Japan US>2.5 \
car>2.5). We can easily guess the topic of leaf A by these inequities.
These two features are unique to our clustering algorithm and are not available with the
agglomerative method. They are useful, as one example, for constructing a document browser.
7.2. Browser
We have applied our clustering algorithm to an article browser in a translation aid system
(Kumano, Tanaka, Uratani, & Ehara, 1997). The system consists of (1) a voluminous database
that contains Japanese articles with English translations, (2) a ¯exible expression retrieval
system and (3) a document browser. The system helps users to translate Japanese news articles
into English by showing past translation examples.
Those who want to know how to translate speci®c Japanese vocabulary words and
expressions can use the similarity-based expression retrieval system (Tanaka, 1997) and ®nd
their English translations in the database.
The retrieval part basically uses a keyword retrieval method. The system ®rst extracts
keywords from a Japanese input expression with a morphological analyzer and then retrieves
sentences containing the keywords in the AND condition. Here the similarity is measured by
the number of common keywords between the input and the sentences, and the result is
presented in descending order by the similarity.
We found, however, that such a simple use of this AND retrieval technique resulted in many
spurious search results. The main reason is rooted in the nature of the Japanese news articles
in our database. The sentences in the database are long (about 90 Japanese characters per
sentence) and an identical keyword often appears at several positions even in a single sentence.
Such multiple keywords became noise in the retrieval process.
To solve this problem, we added two constraints to the search condition and the similarity:
word order and positions of keywords. The order of the input keywords are observed through
the search. The di�erence in the keyword positions is also considered. Although these are
simple heuristics, we found them quite robust and e�ective.
In contrast to the expression retrieval system, the browser o�ers general translation
knowledge for each news topic. Suppose that we have a user who is interested in the general
H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557552
translation strategy for the topic `semiconductor dispute' articles; this will be learned by
looking at the Japanese and English articles in leaf C in Fig. 2. We have many topics that are
almost always translated in the same way7; such-topic based article browsing will be
informative to translators, especially to trainees in news translation.
The cluster hierarchy obtained with our algorithm can be directly applied to such an article
browser as each node represents a topic binder and has the corresponding articles. Therefore, a
user may browse the whole article set by specifying the nodes in the tree. Based on this notion,
we have constructed an article browser by using the Japanese articles in collection 2 and their
English translations.
The hierarchy, however, is a very large binary tree, and it is di�cult to present all topic
binders to a user. Therefore, we employ the following strategies, taking advantage of our
hierarchy's features.
. Folded tree: at the beginning of the browsing session, the system presents a user with the
nodes on the right-most edge and their direct left child nodes. Fig. 3 shows the actual screen
image of an initial browsing session. As we mentioned before, the nodes on the right-most
edge show the overall structure of the whole article collection. The user is therefore able to
grasp the general content of the collection.
. Sub-tree expansion: when a user ®nds an interesting topic binder node in the folded tree, he
or she clicks the node. Then, the system expands the tree under the clicked node and shows
the details of the hierarchy. At this time, the main keywords (topic binders in the right-most
edge) are left unerased so that the user does not get lost in the article collection. An
expanded tree is shown in Fig. 4.
In the middle section of the window, article titles under a speci®ed topic binder are shown. The
left window displays the article titles corresponding to the left branch of the speci®ed topic
binder and the right window displays the titles in the right branch. With these titles, ancestor
(upper) topic binders and main keywords, a user can con®rm his or her position in the
collection and freely navigate through it.
When a user clicks on a title, another window with the Japanese article body and English
article body, will appear and the user can read and compare them.
The browser we constructed with collection 2 contained 14,701 Japanese articles. The
number of main keywords in this hierarchy reached 416. This ®gure seems to be too large for
the number of main keywords. However, the ®rst 87 main keywords contained about 90%
(13,222) of the articles. The remaining 10% of the articles were of rather isolated topics or very
short notices; they did not provide useful translation examples and we could virtually ignore
them. We therefore consider that having 87 main keywords for 13,222 articles is acceptable.
The cluster tree with collection 1 (83,099) produced 690 main keywords but, again, the ®rst
119 main keywords contained 90% of the total collection. This ®gure is also acceptable
considering the collection size8.
7 To mention a few topics: stock market reports, weather forecasts, tra�c accidents, monthly reviews of car and
air conditioner sales.8 We do not use this collection for the browser, since most of the articles do not have English translations.
H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557 553
For some purposes, ignoring 10% of the document set (the remainder) will not be
permissible. Each document must be placed at a proper position in a tree. We can think of two
possible ways to handle this situation.
One is to treat the remainder as one group under the main keyword `miscellaneous' and
build a subtree from it with a di�erent clustering method. Clustering methods based on a
proximity matrix can be used here since the collection size is small enough. These clustering
methods will have a better chance of producing a more balanced tree than ours since they
measure the similarity with a word vector instead of a single word. The nodes in the subtree
will have a combination of some important keywords (extracted from the word vectors of each
cluster) instead of one topic binder.
The other method is to spread out the remainder at the main part of the tree. This could be
done by comparing the word vectors between each document in the remainder and each cluster
(leaf). The obvious drawback of this method is the violation of the topic binder's threshold
rule, which would lead to `contamination' of the cluster and could confuse someone using the
browser. Some consideration to document presentation will be required.
The problem of e�ciently visualizing a gigantic tree is gathering attention recently since this
kind of structure appears in various data and data sizes are getting larger and larger. Some
general gigantic tree display techniques like the `hyperbolic browser' (Lamping, Rao, & Pirolli,
1995) and the `cone tree' (Robertson, Mackinlay, & Card, 1991) have been proposed and their
superiority in utilizing display space has been reported.
In our visualization, we used a kind of `pruning and growing' technique (Robertson et al.,
1991) that simply hides and expands the substructures of a tree in accordance with the user's
operation. This was e�ectively applied since the nodes in our tree have di�erent degrees of
importance in showing the collection structure. Therefore, we could take advantage of the
important nodes (main keywords) to indicate the relative position of each node.
Fig. 3. Folded tree.
H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557554
However, if the tree size grows much larger, some sophisticated display technique will
become desirable since a user must scroll through the display many times even to scan the
main keywords. This is one of our future work topics. We are also planning some evaluations
of the browser, which is another important future task.
We are now preparing to put this browser into practical use at a broadcasting station. The
intended users are news translators and trainees in news translation. As mentioned at the
beginning of this section, we expect them to e�ectively learn the topic-dependent translation
knowledge.
8. Conclusions
We have proposed a fast clustering algorithm based on the topic binder hypothesis. When it
is applied to a document clustering task, it will produce a classi®cation tree in which each
internal node represents a `topic binding term'. The two edges spanning downward test the
topic binding term's `frequency threshold' and the leaves hold documents.
Our method requires a relatively small amount of memory since it uses term vectors instead
of a proximity matrix. We evaluated the time complexity of our method theoretically and
empirically. These evaluations revealed the superiority of our method to the conventional
agglomerative methods.
We conducted a preliminary investigation on the clustered hierarchy and received the
impression that our method performed reasonable clustering. We would like to conduct more
formal evaluations concerning quality.
Finally, we introduced a news article browser which directly uses the hierarchy obtained with
our clustering algorithm. This browser takes advantage of the two features of the hierarchy: (1)
Fig. 4. Expanded tree.
H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557 555
the hierarchy explicitly reveals the document collection structure and (2) each cluster can be
directly explained with a combination of keywords and their frequency thresholds.
As we mentioned at the beginning, the group of divisive hierarchic clustering algorithms has
attracted less attention than the agglomerative hierarchic group. We believe our work reported
here has demonstrated that the former family is useful.
Acknowledgements
The work described here was mainly done when the ®rst author was with NHK Science and
Technical Research Laboratories. The authors would like to thank the President of ATR
Interpreting Telecommunications Research Laboratories, Dr. Seichi Yamamoto and the
Department Head, Dr. Akio Yokoo for allowing us to publish this joint named paper. The
authors also would like to thank the Director-General of NHK Science and Technical
Research Laboratories, Dr. Takehiko Yoshino and the Deputy Director-General Dr.
Kazumasa Enami and the Director Dr. Haruo Isono for their continuous support to our work.
Appendix A. Di�erence equation
The variation reduction is equal to the variation between sets and we obtain
b�X�1, n�, p� � p� �X�1, n� ÿ �X�1, p��2 � �nÿ p�� �X�1, n� ÿ �X� p� 1, n��2: �A:1�
Here, we omitted the su�x i. The term X[1, n] is constant and can be calculated before the
partition test. The terms X[1, p] and X[ p+1, n], where 1R p< n, can be calculated with the
following di�erence equations.
�X�1, p� ��X�1, pÿ 1�� pÿ 1� � X� p�
p�A:2�
�X�1, 0� � 0 �A:3�
�X� p� 1, n� ��X� p, n��nÿ � pÿ 1�� ÿ X� p�
nÿ p�A:4�
References
Croft, W. B. (1980). A model of cluster searching based on classi®cation. Information Systems, 5, 189±195.
Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey, J. W. (1992). Scatter/Gather: a cluster-based approach to browsing large
document collections. In Proc. SIGIR 92 (pp. 318±329).
Everitt, B. S. (1993). Cluster analysis. Arnold.
H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557556
Gri�ths, A., Robinson, L. A., & Willet, P. (1984). Hierarchic agglomerative clustering methods for automatic document classi®cation.
Journal of Documentation, 40(3), 175±205.
Hearst, M. A. & Pedersen, J. O. (1996). Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proc. SIGIR 96 (pp.
76±83).
Jardine, N., & van Rijsbergen, C. J. (1971). The use of hierarchic clustering in information retrieval. Information Storage and Retrieval,
7, 217±240.
Kaufman, L. & Rousseeuw, P. J. (1990). Finding groups in data. A Wiley-Interscience Publication.
Kumano, T., Tanaka, H., Kim, Y. B., & Uratani, N. (1996). Primary investigation on NHK's Japanese and English news database. In
Proc. of the Second Annual Meeting of the Association for Natural Language Processing (pp. 41±44) (in Japanese).
Kumano, T., Tanaka, H., Uratani, N., & Ehara, T. (1997). Translation example browser for news articles. In Proc. of the Third Annual
Meeting of the Association for Natural Language Processing (pp. 529±532) (in Japanese).
Lamping, J., Rao, R., & Pirolli, P. (1995). A focus+context technique based on hyperbolic geometry for visualizing large hierarchies.
In Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (pp. 401±408).
Murtagh, F. (1983). A survey of recent advances in hierarchical clustering algorithms. The Computer Journal, 26(4), 354±359.
Rao, R., Pedersen, J. O., Hearst, M. A., Mackinlay, J. D., Card, S. K., Masinter, L., Halvorsen, P., & Robertson, G. G. (1995). Rich
interaction in the digital library. Communications of the ACM, 38(4), 29±39.
Robertson, G., Mackinlay, J., & Card, S. (1991). Cone trees: animated 3D visualizations of hierarchical information. In Proc. ACM
SIGCHI Conference on Human Factors in Computing Systems (pp. 189±194).
Salton, G. & McGill, M. J. (1983). Introduction to modern information retrieval. McGraw-Hill.
Tanaka, H. (1997). An e�cient way of gauging similarity between long Japanese expressions. IPSJ SIG Notes, 97(85), 69±74 (in
Japanese).
van Rijsbergen, C. J. (1986). Further experiments with hierarchic clustering in document retrieval. Information Storage and Retrieval,
29(12), 1213±1228.
Weiss, S. M. & Kulikowski, C. (1990). Computer systems that learn. Morgan Kaufmann.
Willett, P. (1988). Recent trends in hierarchic document clustering: a critical review. Information Processing and Management, 24(5),
577±597.
H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557 557