an efficient document clustering algorithm and its application to a document browser

17

Upload: hideki-tanaka

Post on 16-Sep-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

An e�cient document clustering algorithm and its

application to a document browser

Hideki Tanakaa, *, Tadashi Kumanob, Noriyoshi Uratani b,Terumasa Eharab

aATR Interpreting Telecommunications Research Laboratories, 2-2 Hikaridai, Seika-cho, Souraku-gun,

Kyoto 619-0288, JapanbNHK Science and Technical Research Laboratories, 1-10-11 Kinuta, Setagaya-ku, Tokyo 157-8510, Japan

Abstract

We present an e�cient document clustering algorithm that uses a term frequency vector for each

document instead of using a huge proximity matrix. The algorithm has the following features: (1) it

requires a relatively small amount of memory and runs fast, (2) it produces a hierarchy in the form of a

document classi®cation tree and (3) the hierarchy obtained by the algorithm explicitly reveals a

collection structure. We con®rm these features and thus show the algorithm's feasibility through

clustering experiments in which we use two collections of Japanese documents, the sizes of which are

83,099 and 14,701 documents. We also introduce an application of this algorithm to a document

browser. This browser is used in our Japanese-to-English translation aid system. The browsing module

of the system consists of a huge database of Japanese news articles and their English translations. The

Japanese article collection is clustered into a hierarchy by our method. Since each node in the hierarchy

corresponds to a topic in the collection, we can use the hierarchy to directly access articles by topic. A

user can learn general translation knowledge of each topic by browsing the Japanese articles and their

English translations. We also discuss techniques of presenting a large tree-formed hierarchy on a

computer screen. # 1999 Elsevier Science Ltd. All rights reserved.

Keywords: Document clustering; Document retrieval; Automatic document organization

1. Motivation

Document clustering has long received keen attention from those concerned with document

retrieval and some of the many papers which have been published are those of (Jardine & van

Information Processing and Management 35 (1999) 541±557

0306-4573/99/$ - see front matter # 1999 Elsevier Science Ltd. All rights reserved.

PII: S0306-4573(98)00056-9

PERGAMON

* Corresponding author. Tel.: +81-774-95-1325; fax: +81-774-95-1308.

E-mail address: [email protected] (H.Tanaka).

Rijsbergen, 1971; Croft, 1980; Gri�ths, Robinson, & Willet, 1984; van Rijsbergen, 1986) all of

whom have studied document clustering techniques for retrieval purposes. Traditionally,

interest in this ®eld has mainly concentrated on whether the clustering of documents helps

increase the `e�ectiveness' and `e�ciency' of the retrieval.

Expectations of increased e�ectiveness have been based on the `cluster hypothesis' (Salton &

McGill, 1983), which asserts that closely associated documents tend to be relevant to the same

queries. According to a voluminous survey (Willett, 1988), however, there seems to be no

strong evidence that this hypothesis generally holds.

Expectations of increased e�ciency have come from the fact that the retrieval space is

reduced by the clustering of documents. This is particularly attractive when the number of

documents is large. Full-search strategies such as the similarity-ranking linear search method

and the K-NN (Weiss & Kulikowski, 1990) method have often been reported to be superior to

clustering approaches for retrieval accuracy, but they obviously require quite intensive

computations.

Recently, another aspect of clustering has been focused on. That is, clustering o�ers a

friendly user interface for a document browser. In the Scatter/Gather system (Cutting, Karger,

Pedersen, & Tukey, 1992; Rao et al., 1995; Hearst & Pedersen, 1996), a light clustering method

called `Buckshot' has been proposed for summarizing collections on the ¯y. In this case, each

summary is used in user±machine interactions during document browsing sessions. We believe

this aspect to be equally important as e�ectiveness and e�ciency, since we are constantly faced

with an ever increasing number of unstructured documents from various sources.

The clustering algorithms used in the above research fall within the agglomerative hierarchic

group, such as single linkage, complete linkage, group average and Ward's method (Kaufman

& Rousseeuw, 1990; Everitt, 1993). One of the attractive features of this group is its automatic

nature; it produces a document hierarchy automatically without requiring any outside

information for the number of clusters.

The group of algorithms, however, includes quadratic space and time complexities. This

becomes problematic in the clustering of a large document collection. Our exploration with

Ward's method has revealed that the space complexity becomes more problematic than the

time complexity when the document collection size is increased; the system begins the process

of memory swapping upon hitting the memory limit. This is, of course, due to the quadratic

nature of the proximity matrix which the agglomerative method uses1.

In this paper, we present an e�cient document clustering algorithm under the divisive

hierarchic family. Unfortunately this family has been ``largely ignored'' (Kaufman &

Rousseeuw, 1990), ``far less popular'' (Everitt, 1993) and considered ``of less use for

applications in document retrieval'' (Willett, 1988).

We, however, propose a variation based on `the topic binder hypothesis'. This algorithm

uses a term frequency vector for each document instead of using a proximity matrix. The

1 For the high space complexity problem in Ward's method, an algorithm which uses the NN-chain (nearest neigh-

bor chain) was proposed, which reduced the space complexity to O(N) with the unchanged time complexity O(N 2)

(Murtagh, 1983).

H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557542

vectors are far smaller than the proximity matrix and thus the method can be applied to a

larger document collection than can the agglomerative method.

The algorithm also has the following features, which the agglomerative clustering method

does not possess.

. The cluster hierarchy explicitly shows the structure of the document collection.

. Each cluster can be explicitly explained by a set of terms and their frequencies.

We thus show a document browser which takes advantage of these features.

In this paper, we ®rst explain the `topic binder hypothesis' and then present the basic

algorithm. Next we show some implementation techniques for the task of document clustering

and test the algorithm's feasibility by clustering two large Japanese document collections whose

sizes are 83,099 and 14,701 documents. Finally, we present a document browser which uses this

clustering algorithm and conclude the paper.

2. Topic binder hypothesis

A document set which shares a term (or terms) appearing uniquely in the set treats the same

topic and thus the documents in the set resemble each other. We call such a term a `topic

binding term' or a `topic binder'. This is the `topic binder hypothesis' on which our clustering

algorithm is based.

Let us show an example. In a Japanese news database covering the early part of 1997, the

articles containing the term `Cerpa'2 more than twice are very likely about the guerrilla attack

on the Japanese Embassy in Peru. Thus the term `Cerpa' strongly binds the articles to the

particular topic. The topic binder in a di�erent view has a strong `set discrimination

capability'.

Our clustering algorithm starts with a whole document collection, recursively ®nds the

strongest topic binder and the frequency threshold and partitions the collection into two

subsets. In Sections 3 and 4 we explain the algorithm in detail.

3. De®nitions

3.1. Data

Suppose there are N objects to be clustered and each object has M discrete measurements of

variables3. Following the terminology in Kaufman and Rousseeuw (1990), we call this N�M

matrix an objects-by-variables matrix. The proposed clustering algorithm takes this objects-by-

variables matrix as input. An example matrix is shown in Table 1.

2 Nestor Cerpa Cartolini: the leader of the Peruvian guerrilla group that attacked the Japanese Embassy in Peru at

the end of 1996 and held hostages for 4 months.3 Continuous measurements can also be used without changing the algorithm.

H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557 543

Here, we de®ne some notations concerning an array4. The jth element of an array A is

represented by A[ j]. The sth to eth elements in A are referred to by A[s, e]. The average value

of A[s, e] is represented by A[s, e].

If there are n objects, a variable i will have n measurements and they are stored in an n-

dimensional array X (i ). All of the measurements in X (i ) will then be represented by X (i )[1, n].

3.2. Variation of measurements

The variation or the sum of squares of the measurements X (i )[1, n] is given by

t�X �i ��1, n�� �Xn

j�1

�X �i �� j� ÿ �X�i ��1, n��2: �1�

If the measurements distribute near their average, the variation will be small and thus we can

use this quantity to measure the coherence among the measurements.

3.3. Binary partition of measurements

The binary partition of the measurements X (i )[1, n] includes the following procedures:

. Sort X (i )[1, n] in ascending order.

. Divide X (i )[1, n] into two subsets X (i )[1, p] and X (i )[ p+1, n] using p in 1R p< n.

3.4. Variation reduction

When the measurements X (i )[1, n] are divided into two subsets, the variation is reduced by

b(X (i )[1, n], p) where

Table 1

Input data

Object

Variable

1 2

a 1 1

b 1 3

c 2 1

d 2 1

e 3 3

f 3 3

4 In this paper, we use the terms `collection', `array' and `set' interchangeably.

H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557544

b�X �i ��1, n�, p� � t�X �i ��1, n�� ÿ t�X �i ��1, p�� ÿ t�X �i �� p� 1, n��: �2�

This quantity measures how much coherence is obtained by the binary partition and thus

indicates how well the sets are discriminated. As mentioned in Section 2, a topic binder's

capability is measured in terms of the set discrimination ability; we use the variation reduction

to gauge it.

The variation reduction is clearly non-negative and is bounded by the initial variation, so we

have

0Rb�X �i ��1, n�, p�Rt�X �i ��1, n��: �3�

4. Clustering algorithm

This section presents the basic form of the clustering algorithm. Some implementation

techniques which improve the document clustering e�ciency will be introduced in Section 5.

4.1. Basic algorithm

Here again, we assume that there are N objects, each of which has M variables with their

measurements. We thus have an N�M matrix for the input.

Begin

Set n= N.

Make the initial cluster with n objects.

1. Stop condition check

Return if nR C, where C is the given maximum cluster size.

2. Optimal variable selection

For each variable i where (1R iRM), do the following.

(a) Sort X (i )[1, n] in ascending order.

(b) Try all binary partitions on X (i )[1, n] and record each variation reduction:

b(X (i )[1, n], p) where 1R p< n.

3. Object partition and tree generation

Find the variable ib and the partition point p�ib�b that give the maximum variation

reduction.

Return if the maximum variation reduction is zero. Otherwise, set the partition

threshold at �X �ib�� p�ib�b � � X �ib�� p

�ib�b � 1��=2.

For each object, if the measurement for the variable ib is greater than the partition

threshold, assign the object to group 1; otherwise to group 2. Spawn two child nodes

under the present node and assign group 1 to the left node and group 2 to the right

node.

H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557 545

4. Recursion

Repeat the process from 1 recursively with the object set in group 1 and then with the

object set in group 2.

End

We term ib the optimal variable and p�ib�b the optimal partition point.

The b(X (i )[1, n], p) for each p within 1R p< n is e�ciently calculated with the set of

di�erence equations in Appendix A.

4.2. Theoretical complexity

The proposed clustering algorithm has the following space and time complexities.

. Space complexity: the agglomerative hierarchic clustering algorithms use as input a

proximity (or dissimilarity) matrix whose space complexity is O(N 2) with N objects. Our

algorithm uses an objects-by-variables matrix. The number of variables M is usually

constant, and we obtain the space complexity of O(N). As a consequence, our algorithm can

be applied to a larger object set than can the agglomerative methods.

. Time complexity: the time complexity of our algorithm varies depending on the shape of the

obtained tree, and we evaluate two extreme cases in terms of the number of partition tests.

In this evaluation, we assume the partition continues until the cluster size becomes less than

or equal to 2.

First, we assume a perfect binary tree. In this case, the algorithm ®rst tries M�(Nÿ1)

partitions then M�2�(N/2ÿ1) partitions and so on, yielding a total of M�(N log2Nÿ3N/

2+1) partition tests. Since M is constant, we obtain the time complexity O (N log2N). The

second case assumes that one object is cut out at each partition. In this case, the jth

(1R jR Nÿ2) partition tries (Nÿ jÿ1)�M partition tests yielding the complexity of O(N 2).

We assume that the practical complexity lies between these two extreme cases, O(N log2N) and

O(N 2) and, therefore, is lower than the time complexity of the agglomerative methods.

If we applied a data sampling method before the clustering, we could obtain even lower time

complexity with an agglomerative method (Cutting et al., 1992). However, this method would

not produce coherent results because the cluster hierarchy obtained would depend on the

sampled data (Fig. 1).

5. E�cient implementation for document clustering

Let us consider a document clustering task. We have a collection of documents (size N) and

we take all terms appearing in the collection as variables (size M) and their frequencies as

H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557546

measurements. In this situation, the naive implementation of the algorithm in Section 4 is not

appropriate.

The N�M matrix for this task becomes quite large. In fact, the number of terms M grows

rapidly with the collection size N until N reaches a nontrivial ®gure. Then, the objects-by-

variables matrix becomes far larger than the proximity matrix. This makes our algorithm

unattractive in terms of memory consumption and also speed.

The memory required for storing the matrix can be compressed e�ectively by taking

advantage of the sparseness of the matrix. We use a vector that stores only nonzero

measurements for each document. The N�M matrix can be recovered on the ¯y from the use

of such vectors.

There are two techniques to reduce a high time complexity.

5.1. Use of histogram-style array

The variable i (a term) initially has elements X (i )[1, N] (term frequencies across the whole

collection) and (Nÿ1) partition tests are done to evaluate the variable i in the optimal variable

selection. Thus, N governs the number of partition tests.

The number can be greatly reduced for the document clustering task because (1) the

measurements in X (i )[1, N] distribute within a rather limited range and (2) the best partition

point falls between di�erent frequencies.

We therefore use a histogram-style array to store X (i )[1, N]. As the index this array has a

measurement and as the value it has the measurement's frequency. The histogram array size K

is bounded by the types of measurements in X (i )[1, N], which is far smaller than N. In fact, in

the clustering experiment in Section 6, which treated 83,099 documents (this corresponds to N),

the average size of K was 2.4 and reached a maximum of 31.

With the use of the histogram array, we can omit the sorting process in the variable selection

process. The variation reduction can be calculated e�ciently with a set of di�erence equations

similar to those in Appendix A.

Fig. 1. Dendrogram for Table 1.

H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557 547

5.2. Variable skip rule

Although the practical size of N is reduced drastically, we still face a large M. However,

there is a simple way to reduce the size: we do not have to evaluate each variable i (1R iRM)

in searching for the maximum b(X (i )[1, n], p). Here, we call the set of all M variables (terms)

the term list.

We can skip some variables in the search. To do this, we sort the term list in descending

order by each term's variation before clustering so that

t�X �i ��1, N ��rt�X �i�1��1, N ��: �4�

We then evaluate the variables according to the order in the sorted term list. Here, we can

prove that the following rule holds.

Variable skip rule.

Suppose the best partition point for the variable i (with measurements X (i )[1, n]) is obtained

as pb(i ). If

t�X �i�1��1, N ��Rb�X �i ��1, n�, p�i �b � �5�

holds, no further evaluation is necessary on variables j (i< jRM). The best solution obtained

through to variable i is the optimal one.

Here, we omit the proof because of space limitations, but it can be proved with some simple

derivations on Eq. (3).

The variable skip rule takes advantage of the variation gap between two adjacent terms. The

gap tends to be big in the early partitioning stages where the document size n is big. Therefore

the rule reduces the extent of the most computation intensive part in the partitioning.

The e�ect of using the histogram array in Section 5.1 and using the variable skip rule in the

optimal variable selection process is data-dependent. We will show their e�ects empirically in

Section 6.

6. Experiment

6.1. Document collection

We collected a number of scripts of Japanese news articles compiled by NHK (the Japan

Broadcasting Corporation). According to a survey by Kumano (Kumano, Tanaka, Kim, &

Uratani, 1996), an average NHK news article contains 5.2 sentences and 88.9 Japanese

characters. Some of the Japanese scripts had English translations for bilingual broadcasts.

We made two document collections from the collected scripts.

. Collection 1: articles covering the period from March 1995 to February 1997. This collection

contained 83,099 articles.

H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557548

. Collection 2: articles in collection 1 that were accompanied by English translations. This

collection contained 14,701 articles.

We analyzed the two collections with our in-house morphological analyzer allowing nouns of

22 types to be keywords and made a term list for each collection. Each term list included

nouns of 17 types. Table 2 shows the most frequent parts of speech for each collection.

The `Kata-unk' in Table 2 means the `unknown word written in Kata-Kana form'. Most

words in this item were non-Japanese persons' names and non-Japanese place names.

The `Verbal-noun' corresponds to the nouns that cover actions like negotiation, prohibition,

start, stop and so on.

These six noun types shown in Table 2 occupy more than 90% of each collection, and these

parts of speech will appear frequently in the hierarchy as topic binders.

6.2. Complexity evaluation

Table 3 shows the input information and the result statistics. We used the stop condition

`cluster sizeR20' in these experiments. This threshold was determined by a practical reason.

Table 2

Dominant parts of speech in term lists

Collection 1 Collection 2

pos freq. accumulation

(%)

pos freq. accumulation

(%)

Kata-unk 21,353 26.0 normal noun 11,066 34.3

Normal noun 20,367 50.9 Kata-unk 6,476 54.6

Place-name 9,561 62.5 verbal-noun 4,375 68.2

Person's name 8,907 73.4 place-name 2,928 77.3

Family-name 8,270 83.5 family-name 2,281 84.4

Verbal-noun 7,389 92.5 person's name 1,963 90.5

* * * * * *

Total 81,995 100 total 32,138 100

Table 3

Result statistics

Collection 1 2

Size 83,099 14,701

Term list 81,995 32,138

E�ective terms 14,348 4,217

Partition tests 3.31�107 4.61�106

Comparisons (Ward) 9.56�1013 5,29�1011

Execution time 50 h 49 min 3 h 11 min

H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557 549

We applied this algorithm to a document browser, which will be described in Section 7. An

exceedingly small cluster size, like `2', will result in too many leaves, which will greatly degrade

the readability of the tree and the browser. We judged that a cluster size of 20 is appropriate

here5. The row `Term list' shows the size of the term list and the row `E�ective terms' shows

the average number of terms (variables) evaluated in the clustering, which indicates the e�ect

of the variable skip rule.

By comparing the `Term list' and the `E�ective terms', we observed that the rule reduced the

virtual size of the term list to about 17% for collection 1 and 13% for collection 2.

The row `Partition tests' shows the total number of partition tests. The row `Comparisons

(Ward)' shows the number of elements in the proximity matrix that would have been evaluated

if an agglomerative clustering algorithm like Ward's method were used6.

A di�erence can be seen between the number of `Partition tests' and `Comparisons (Ward)'

and it indicates the superiority of our algorithm to Ward's method in terms of the time

complexity. Actually, Ward's method, which was written with the same coding level as our

algorithm, could process collection 2. The method took about 5 h to calculate the initial

proximity matrix and the process had continued for more than 90 h before we decided to stop

it.

In addition, the Ward method took up about 800 MB to cluster collection 2, whereas the

proposed method took up only about 260 MB to cluster the far larger collection 1. These

experiments therefore con®rmed our method's superiority to the agglomerative method in

terms of speed and memory consumption.

6.3. Qualitative evaluation

Fig. 2 shows a part of the hierarchy obtained with collection 2. Each node in the hierarchy

has a topic binder and its partition threshold. All articles containing the topic binder above the

threshold are grouped in the left branch. The remaining articles are grouped in the right

branch and are tested by the next topic binder. As a result, the algorithm produces a right-

branching and therefore a right-deep tree. Table 4 lists the English titles for the articles

clustered in leaves A, B and C in Fig. 2. Similar articles are grouped properly in Table 4.

We observed that similar titles were properly clustered, especially in the left shallow leaves in

the tree, where documents were positively clustered as including the topic binders above the

threshold value. These observations indicated that the topic binder hypothesis worked

reasonably well. To further clarify this hypothesis and the quality of the hierarchy, we should

conduct formal evaluations. A simple comparison between our hierarchy and those obtained

by other methods will be informative since this can reveal the di�erences objectively.

5 Changing this condition into `cluster sizeR2' increased the computation time by 15% for collection 1 and 19%

for collection 2. This increase was not very big, since the additional clustering at each leaf required the evaluation of

very few partition points (maximum 20).6 We assumed a naive implementation. Such an implementation requires i(iÿ1)/2 comparisons for selecting two

clusters (objects) from i objects. Since i varies from N to 1, the total number of comparisons becomes

(Nÿ1)N(N+1)/6.

H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557550

The absolute quality of our cluster hierarchy, on the other hand, is hard to evaluate. An

evaluation in terms of an application seems best. We can thereby evaluate the cluster hierarchy

by means of the application's goodness. The news article browser in the next section took

advantage of our clustering hierarchy, and this application can be used for such an evaluation.

7. Application to document browser

7.1. Features of cluster hierarchy

The cluster hierarchy obtained with our algorithm has the following features.

. Collection structure revelation. The topic binders on the right-most branch in Fig. 2 are

considered to be the `main keywords' of the whole collection. The ®rst topic binder

Fig. 2. Part of the cluster hierarchy for collection 2.

Table 4

Cluster example

Car (thres=2.5) Aviation (thres=2.0) Semiconductor (thres=3.0)

Japan-US: auto Japan US aviation Japan-US (semiconductor)

Trade Ministers Canada, EU US-JAL-sanction US semiconductor

Brittan-Hashimoto Japan-US (AIR) Japan-EU

Japan-US auto hearing ®lm-aviation US-Japan-chips

Auto dispute Kansai Federal semiconductor

Japan-US auto talks Japan US aviation Japan-US semiconductor

H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557 551

represents the most conspicuous topic in the entire collection, and articles belonging to the

topic are grouped under the left branch. The remaining collection, whose size is usually far

larger than the collection belonging to the ®rst topic, is grouped under the right branch. The

remainder still contains many topics. Therefore, the best topic binder here will be the second

most conspicuous topic in the entire collection. With such recursive processes on the right-

most branch, the topic binders here will be the list of keywords that roughly partition the

whole document collection. The list therefore shows us the overall topical structure of the

entire document collection. The topic binders on the left branch, on the contrary, are `sub-

keywords' that partition the collection under a main keyword in detail and thus show the

micro-structure.

. Cluster explanation by individual words. We can understand each cluster's (leaf's) feature

by the conjunction of inequities of a topic binder and its partition threshold. For example,

the feature of leaf A in Fig. 2 is shown by (Japan>2.5 \ America>2.5 \ Japan US>2.5 \

car>2.5). We can easily guess the topic of leaf A by these inequities.

These two features are unique to our clustering algorithm and are not available with the

agglomerative method. They are useful, as one example, for constructing a document browser.

7.2. Browser

We have applied our clustering algorithm to an article browser in a translation aid system

(Kumano, Tanaka, Uratani, & Ehara, 1997). The system consists of (1) a voluminous database

that contains Japanese articles with English translations, (2) a ¯exible expression retrieval

system and (3) a document browser. The system helps users to translate Japanese news articles

into English by showing past translation examples.

Those who want to know how to translate speci®c Japanese vocabulary words and

expressions can use the similarity-based expression retrieval system (Tanaka, 1997) and ®nd

their English translations in the database.

The retrieval part basically uses a keyword retrieval method. The system ®rst extracts

keywords from a Japanese input expression with a morphological analyzer and then retrieves

sentences containing the keywords in the AND condition. Here the similarity is measured by

the number of common keywords between the input and the sentences, and the result is

presented in descending order by the similarity.

We found, however, that such a simple use of this AND retrieval technique resulted in many

spurious search results. The main reason is rooted in the nature of the Japanese news articles

in our database. The sentences in the database are long (about 90 Japanese characters per

sentence) and an identical keyword often appears at several positions even in a single sentence.

Such multiple keywords became noise in the retrieval process.

To solve this problem, we added two constraints to the search condition and the similarity:

word order and positions of keywords. The order of the input keywords are observed through

the search. The di�erence in the keyword positions is also considered. Although these are

simple heuristics, we found them quite robust and e�ective.

In contrast to the expression retrieval system, the browser o�ers general translation

knowledge for each news topic. Suppose that we have a user who is interested in the general

H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557552

translation strategy for the topic `semiconductor dispute' articles; this will be learned by

looking at the Japanese and English articles in leaf C in Fig. 2. We have many topics that are

almost always translated in the same way7; such-topic based article browsing will be

informative to translators, especially to trainees in news translation.

The cluster hierarchy obtained with our algorithm can be directly applied to such an article

browser as each node represents a topic binder and has the corresponding articles. Therefore, a

user may browse the whole article set by specifying the nodes in the tree. Based on this notion,

we have constructed an article browser by using the Japanese articles in collection 2 and their

English translations.

The hierarchy, however, is a very large binary tree, and it is di�cult to present all topic

binders to a user. Therefore, we employ the following strategies, taking advantage of our

hierarchy's features.

. Folded tree: at the beginning of the browsing session, the system presents a user with the

nodes on the right-most edge and their direct left child nodes. Fig. 3 shows the actual screen

image of an initial browsing session. As we mentioned before, the nodes on the right-most

edge show the overall structure of the whole article collection. The user is therefore able to

grasp the general content of the collection.

. Sub-tree expansion: when a user ®nds an interesting topic binder node in the folded tree, he

or she clicks the node. Then, the system expands the tree under the clicked node and shows

the details of the hierarchy. At this time, the main keywords (topic binders in the right-most

edge) are left unerased so that the user does not get lost in the article collection. An

expanded tree is shown in Fig. 4.

In the middle section of the window, article titles under a speci®ed topic binder are shown. The

left window displays the article titles corresponding to the left branch of the speci®ed topic

binder and the right window displays the titles in the right branch. With these titles, ancestor

(upper) topic binders and main keywords, a user can con®rm his or her position in the

collection and freely navigate through it.

When a user clicks on a title, another window with the Japanese article body and English

article body, will appear and the user can read and compare them.

The browser we constructed with collection 2 contained 14,701 Japanese articles. The

number of main keywords in this hierarchy reached 416. This ®gure seems to be too large for

the number of main keywords. However, the ®rst 87 main keywords contained about 90%

(13,222) of the articles. The remaining 10% of the articles were of rather isolated topics or very

short notices; they did not provide useful translation examples and we could virtually ignore

them. We therefore consider that having 87 main keywords for 13,222 articles is acceptable.

The cluster tree with collection 1 (83,099) produced 690 main keywords but, again, the ®rst

119 main keywords contained 90% of the total collection. This ®gure is also acceptable

considering the collection size8.

7 To mention a few topics: stock market reports, weather forecasts, tra�c accidents, monthly reviews of car and

air conditioner sales.8 We do not use this collection for the browser, since most of the articles do not have English translations.

H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557 553

For some purposes, ignoring 10% of the document set (the remainder) will not be

permissible. Each document must be placed at a proper position in a tree. We can think of two

possible ways to handle this situation.

One is to treat the remainder as one group under the main keyword `miscellaneous' and

build a subtree from it with a di�erent clustering method. Clustering methods based on a

proximity matrix can be used here since the collection size is small enough. These clustering

methods will have a better chance of producing a more balanced tree than ours since they

measure the similarity with a word vector instead of a single word. The nodes in the subtree

will have a combination of some important keywords (extracted from the word vectors of each

cluster) instead of one topic binder.

The other method is to spread out the remainder at the main part of the tree. This could be

done by comparing the word vectors between each document in the remainder and each cluster

(leaf). The obvious drawback of this method is the violation of the topic binder's threshold

rule, which would lead to `contamination' of the cluster and could confuse someone using the

browser. Some consideration to document presentation will be required.

The problem of e�ciently visualizing a gigantic tree is gathering attention recently since this

kind of structure appears in various data and data sizes are getting larger and larger. Some

general gigantic tree display techniques like the `hyperbolic browser' (Lamping, Rao, & Pirolli,

1995) and the `cone tree' (Robertson, Mackinlay, & Card, 1991) have been proposed and their

superiority in utilizing display space has been reported.

In our visualization, we used a kind of `pruning and growing' technique (Robertson et al.,

1991) that simply hides and expands the substructures of a tree in accordance with the user's

operation. This was e�ectively applied since the nodes in our tree have di�erent degrees of

importance in showing the collection structure. Therefore, we could take advantage of the

important nodes (main keywords) to indicate the relative position of each node.

Fig. 3. Folded tree.

H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557554

However, if the tree size grows much larger, some sophisticated display technique will

become desirable since a user must scroll through the display many times even to scan the

main keywords. This is one of our future work topics. We are also planning some evaluations

of the browser, which is another important future task.

We are now preparing to put this browser into practical use at a broadcasting station. The

intended users are news translators and trainees in news translation. As mentioned at the

beginning of this section, we expect them to e�ectively learn the topic-dependent translation

knowledge.

8. Conclusions

We have proposed a fast clustering algorithm based on the topic binder hypothesis. When it

is applied to a document clustering task, it will produce a classi®cation tree in which each

internal node represents a `topic binding term'. The two edges spanning downward test the

topic binding term's `frequency threshold' and the leaves hold documents.

Our method requires a relatively small amount of memory since it uses term vectors instead

of a proximity matrix. We evaluated the time complexity of our method theoretically and

empirically. These evaluations revealed the superiority of our method to the conventional

agglomerative methods.

We conducted a preliminary investigation on the clustered hierarchy and received the

impression that our method performed reasonable clustering. We would like to conduct more

formal evaluations concerning quality.

Finally, we introduced a news article browser which directly uses the hierarchy obtained with

our clustering algorithm. This browser takes advantage of the two features of the hierarchy: (1)

Fig. 4. Expanded tree.

H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557 555

the hierarchy explicitly reveals the document collection structure and (2) each cluster can be

directly explained with a combination of keywords and their frequency thresholds.

As we mentioned at the beginning, the group of divisive hierarchic clustering algorithms has

attracted less attention than the agglomerative hierarchic group. We believe our work reported

here has demonstrated that the former family is useful.

Acknowledgements

The work described here was mainly done when the ®rst author was with NHK Science and

Technical Research Laboratories. The authors would like to thank the President of ATR

Interpreting Telecommunications Research Laboratories, Dr. Seichi Yamamoto and the

Department Head, Dr. Akio Yokoo for allowing us to publish this joint named paper. The

authors also would like to thank the Director-General of NHK Science and Technical

Research Laboratories, Dr. Takehiko Yoshino and the Deputy Director-General Dr.

Kazumasa Enami and the Director Dr. Haruo Isono for their continuous support to our work.

Appendix A. Di�erence equation

The variation reduction is equal to the variation between sets and we obtain

b�X�1, n�, p� � p� �X�1, n� ÿ �X�1, p��2 � �nÿ p�� �X�1, n� ÿ �X� p� 1, n��2: �A:1�

Here, we omitted the su�x i. The term X[1, n] is constant and can be calculated before the

partition test. The terms X[1, p] and X[ p+1, n], where 1R p< n, can be calculated with the

following di�erence equations.

�X�1, p� ��X�1, pÿ 1�� pÿ 1� � X� p�

p�A:2�

�X�1, 0� � 0 �A:3�

�X� p� 1, n� ��X� p, n��nÿ � pÿ 1�� ÿ X� p�

nÿ p�A:4�

References

Croft, W. B. (1980). A model of cluster searching based on classi®cation. Information Systems, 5, 189±195.

Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey, J. W. (1992). Scatter/Gather: a cluster-based approach to browsing large

document collections. In Proc. SIGIR 92 (pp. 318±329).

Everitt, B. S. (1993). Cluster analysis. Arnold.

H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557556

Gri�ths, A., Robinson, L. A., & Willet, P. (1984). Hierarchic agglomerative clustering methods for automatic document classi®cation.

Journal of Documentation, 40(3), 175±205.

Hearst, M. A. & Pedersen, J. O. (1996). Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proc. SIGIR 96 (pp.

76±83).

Jardine, N., & van Rijsbergen, C. J. (1971). The use of hierarchic clustering in information retrieval. Information Storage and Retrieval,

7, 217±240.

Kaufman, L. & Rousseeuw, P. J. (1990). Finding groups in data. A Wiley-Interscience Publication.

Kumano, T., Tanaka, H., Kim, Y. B., & Uratani, N. (1996). Primary investigation on NHK's Japanese and English news database. In

Proc. of the Second Annual Meeting of the Association for Natural Language Processing (pp. 41±44) (in Japanese).

Kumano, T., Tanaka, H., Uratani, N., & Ehara, T. (1997). Translation example browser for news articles. In Proc. of the Third Annual

Meeting of the Association for Natural Language Processing (pp. 529±532) (in Japanese).

Lamping, J., Rao, R., & Pirolli, P. (1995). A focus+context technique based on hyperbolic geometry for visualizing large hierarchies.

In Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (pp. 401±408).

Murtagh, F. (1983). A survey of recent advances in hierarchical clustering algorithms. The Computer Journal, 26(4), 354±359.

Rao, R., Pedersen, J. O., Hearst, M. A., Mackinlay, J. D., Card, S. K., Masinter, L., Halvorsen, P., & Robertson, G. G. (1995). Rich

interaction in the digital library. Communications of the ACM, 38(4), 29±39.

Robertson, G., Mackinlay, J., & Card, S. (1991). Cone trees: animated 3D visualizations of hierarchical information. In Proc. ACM

SIGCHI Conference on Human Factors in Computing Systems (pp. 189±194).

Salton, G. & McGill, M. J. (1983). Introduction to modern information retrieval. McGraw-Hill.

Tanaka, H. (1997). An e�cient way of gauging similarity between long Japanese expressions. IPSJ SIG Notes, 97(85), 69±74 (in

Japanese).

van Rijsbergen, C. J. (1986). Further experiments with hierarchic clustering in document retrieval. Information Storage and Retrieval,

29(12), 1213±1228.

Weiss, S. M. & Kulikowski, C. (1990). Computer systems that learn. Morgan Kaufmann.

Willett, P. (1988). Recent trends in hierarchic document clustering: a critical review. Information Processing and Management, 24(5),

577±597.

H. Tanaka et al. / Information Processing and Management 35 (1999) 541±557 557