topic hierarchy generation for text segments: a practical web-based …b91034/hacp.pdf · 2006. 8....

Topic Hierarchy Generation for Text Segments: A

Practical Web-based Approach

SHUI-LUNG CHUANG

Institute of Information Science, Academia Sinica

and

LEE-FENG CHIEN

Institute of Information Science, Academia Sinica

Department of Information Management, National Taiwan University

It is crucial in many information systems to organize short text segments, such as keywords indocuments and queries from users, into a well-formed topic hierarchy. In this paper, we addressthe problem of generating topic hierarchies for diverse text segments with a general and practicalapproach that uses the Web as an additional knowledge source. Unlike long documents, short text

segments typically do not contain enough information to extract reliable features. This work inves-tigates the possibilities of using highly ranked search-result snippets to enrich the representationof text segments. A hierarchical clustering algorithm is then designed for creating the hierarchicaltopic structure of text segments. Text segments with close concepts can be grouped together in a

cluster, and relevant clusters linked at the same or near levels. Different from traditional clusteringalgorithms, which tend to produce cluster hierarchies with a very unnatural shape, the algorithmtries to produce a more natural and comprehensive tree hierarchy. Extensive experiments were

conducted on different domains of text segments, including subject terms, people names, papertitles, and natural language questions. The obtained experimental results have shown the poten-tial of the proposed approach, which provides a basis for the in-depth analysis of text segmentson a larger scale and is believed able to benefit many information systems.

Categories and Subject Descriptors: H.3 [Information Storage and Retrieval]: Miscellaneous

General Terms: Algorithms, Experimentation, Performance

Additional Key Words and Phrases: Topic Hierarchy Generation, Text Segment, HierarchicalClustering, Partitioning, Search-Result Snippet, Text Data Mining

1. INTRODUCTION

It is crucial in many information systems to organize short text segments, such askeywords in documents and queries from users, into a well-formed topic hierarchy.For example, deriving a topic hierarchy (or concept hierarchy) of terms from a set ofdocuments in an information retrieval system could provide a comprehensive form

This work was supported in part by the following grants: NSC 93-2752-E-001-001-PAE, 93-2422-H-001-0004, and 93-2213-E-001-025.Authors’ address: Institute of Information Science, Academia Sinica, 128 Academia Road, Sec-

tion 2, Nankang, Taipei 115, Taiwan; email: {slchuang,lfchien}@iis.sinica.edu.twPermission to make digital/hard copy of all or part of this material without fee for personalor classroom use provided that the copies are not made or distributed for profit or commercial

advantage, the ACM copyright/server notice, the title of the publication, and its date appear, andnotice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish,to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.c© 2005 ACM 0000-0000/2005/0000-0001 $5.00

ACM Journal Name, Vol. V, No. N, March 2005, Pages 1–33.

2 · S.-L. Chuang and L.-F. Chien

What effects will an Iraq war have on oil prices?

Find the cheapest ACER TravelMate without a CD-Rom drive

PDAs, Tablet computers, and Pocket PCs

When is the SIGIR 2003 paper submission deadline?

What is the difference between PHS, WAP, and GPRS?

The two final proposals to rebuild the World Trade Center

The newest product in the IBM Thinkpad X Series

Where can I find the interview with Saddam Hussein?

The cellular phone with the GSM-1900 system

The website of the KDD 2003 conference

®¶

Find the cheapest ACER TravelMate without a CD-Rom driveThe newest product in the IBM Thinkpad X Series

PDAs, Tablet computers, and Pocket PCs

The cellular phone with the GSM-1900 systemWhat is the difference between PHS, WAP, and GPRS?

The two final proposals to rebuild the World Trade Center

What effects will an Iraq war have on oil prices?Where can I find the interview with Saddam Hussein?

The website of the KDD 2003 conferenceWhen is the SIGIR 2003 paper submission deadline?

◦

◦

◦

(3C) uuuuuuuuu (World)ZZZZZZZ

(Conf.)PPPPPPPPPPPPPPPPPP

(Notebook)gggggggggg (Mobile)]]]]]]]]]]

(Handy)TTTTTTTTTT

(WTC)

(Iraq War)VVVVVVVVVV

Fig. 1. Example text segments and the topic hierarchy we seek to generate.

to present those documents [Sanderson and Croft 1999]. A similar need for auto-generation of topic hierarchies could occur in a question answering system. Manyenterprise Web sites provide users with the ability to use natural language queries toask questions and search for manually prepared answers. To make the preparationof answers to frequently asked questions (FAQ) more efficient, it is expected thatqueries in similar topics can be automatically clustered. In this paper, we addressthe problem of generating topic hierarchies for diverse text segments, and presenta general and practical approach to this problem using the Web as an additionalknowledge source. Here, a text segment is defined as a meaningful word stringthat is often short in length but represents a specific concept in a certain subjectdomain, such as a keyword in a document set and a natural language query from auser. Text segments are of many types, including words, phrases, named entities,natural language queries, news events, product names, paper or book titles, etc.

The key idea of the proposed approach is to apply a clustering technique to auto-matically create the hierarchical topic structure of text segments. In the structure,text segments with close concepts are grouped to form the basic clusters, and simi-lar basic clusters form the super clusters recursively to characterize the associationsbetween the composed clusters. Relevant clusters are also linked at the same ornear levels as close as possible. Each cluster, therefore, represents a certain topicclass of its composed text segments. Figure 1 shows an illustrative example todemonstrate the idea of the proposed approach. In the figure, there is a set of ex-ample text segments, e.g., natural language queries to a search engine, and a topichierarchy that we seek to generate automatically from those example queries. Withthe auto-generated topic hierarchy, users’ search topic classes would be easier to beobserved and analyzed.

Clustering short text segments is a difficult problem given that, unlike long docu-ments, short text segments typically do not contain enough information to extract

ACM Journal Name, Vol. V, No. N, March 2005.

Topic Hierarchy Generation for Text Segments · 3

Search Engines

ContextExtractionContext

ExtractionHAC-Based Binary-Tree

Hierarchy Generation Min-Max PartitioningText Segments

Hierarchical Clustering

. . . . .

TopicHierarchy

Search Engines

ContextExtractionContext

ExtractionHAC-Based Binary-Tree

Hierarchy Generation Min-Max PartitioningText Segments

Hierarchical Clustering

. . . . .. . . . .

TopicHierarchy

Fig. 2. An abstract diagram showing the concept behind the proposed approach.

reliable features. For long documents, their similarities can be estimated basedon the common composed words. A few of words (usually new words or propernouns) in the documents unknown to the classifier might not cause serious classi-fication errors. However, the similarity between two text segments is difficult tojudge using the same way because text segments are usually short and do not con-tain enough textual overlap. Thus, one of the most challenging issues concerningthis problem is to acquire proper features to characterize the text segments. Forthose text segments extracted from documents, such as key terms from documents,the source documents can be used to characterize the text segments. However,in many cases, such as in dealing with search-engine query strings, there may notexist sufficient relevant documents to represent the target text segments. A lackof domain-specific corpora used to describe text segments is usually the case inpractice. Therefore, relying on a predetermined corpus cannot become a generalapproach to this problem.

Fortunately, the Web, as the largest and most accessible data repository in theworld, provides an alternative way to deal with this difficulty; that is, it provides ageneral way to supplement the insufficiency of information suffered by various textsegments with the rich resources on the Web. Many search engines are constantlycrawling Web resources and providing the retrieval of relevant Web pages for largeamounts of free text queries, including single terms and longer word strings. In theproposed approach, we incorporate the search-result snippets returned from searchengines into the process of acquiring features for text segments. Also a queryrelaxation technique is developed to get adequate snippets for long text segmentsbecause they have a higher chance to retrieve fewer search results. The overallconcept of the proposed approach is shown as Figure 2.

In addition to the way we use to acquire features, a hierarchical clustering algo-rithm was developed for creating the hierarchical topic structure of text segments.Different from traditional clustering algorithms, which tend to produce clusters andhierarchies with a very unnatural shape, the algorithm is pursued to produce a morenatural and comprehensive hierarchy structure. In the literature, many algorithmsfor hierarchical data clustering have been developed. However, they are mostlybinary-tree-based; i.e., they generate binary tree hierarchies [Willet 1988]. Onlya few of them deal with multi-way-tree hierarchy, e.g., model-based hierarchicalclustering [Vaithyanathan and Dom 2000], but they suffer from high computationalcost or a need of predetermined constants on the number of branches or thresholdvalues for similarity measure scores. Our initial intention of discovering topic hi-erarchies for text segments is to provide humans a basis for the in-depth analysisof text segments. The broad and shallow multi-way-tree representation, instead



of the narrow and deep binary-tree one, is believed easier and more suitable forhumans to browse, interpret, and do deeper analysis. According to this motivation,we develop a hierarchical clustering algorithm, called HAC+P: an extension of theHierarchical Agglomerative Clustering algorithm (HAC), which builds a binary hi-erarchy in a bottom-up fashion, followed by a top-down hierarchical partitioningtechnique, named min-max partitioning, to partition the binary hierarchy into anatural and comprehensive multi-way-tree hierarchy.

Extensive experiments have been conducted on different domains of text seg-ments, including subject terms, people names, paper titles, and natural languagequestions. The promising results show the potential of the proposed approach inclustering similar text segments and creating natural topic structures; this is be-lieved able to benefit the design of information systems in many ways, such as textsummarization, query clustering, and thesaurus construction. In the rest of thispaper, we first examine the possibilities of using search-result snippets for featureextraction of text segments and introduce the data representation model used inthis study. Then the proposed hierarchical clustering algorithm is presented indetail, followed by the experiments and their results. Further, the query relax-ation technique and more experiments are introduced. An evaluation based on userstudies is also conducted for the results of our approach. Finally, we have morediscussions, review some related works, and draw conclusions.

2. FEATURE EXTRACTION USING SEARCH-RESULT SNIPPETS

Compared with general documents, text segments are much shorter and typicallydo not contain enough information to extract adequate and reliable features. Toassist the relevance judgment between text segments, additional knowledge sourceswould be exploited. It would be helpful to understand the process that a humanexpert determines the meaning(s) of a text segment beyond his/her knowledge.From our observations, when facing an unknown text segment, humans may referto the various contexts it occurs in documents, from which the meaning(s) of thesegment can be inferred. The proposed approach is, therefore, designed based onthe stimulation of such human behavior.

Our basic idea is to exploit the Web, the largest and ubiquitously accessible datarepository in the world. Adequate contexts of a text segment, e.g., the neighboringsentences of the given text segment, can be retrieved from large amounts of Webpages. Such idea is somewhat analogous to the one of determining the sense ofa word by means of its context words extracted from a predetermined documentcorpus in linguistic analysis [Manning and Schutze 1999]. However, there are dif-ferences. With a document corpus of limited size and domains, a conventionalapproach normally extracts all possible context words of the given word from thecorpus. The situation is different when using the Web as the corpus. The numberof matched contexts on the Web might be huge. A practical approach should onlyadopt certain of them that are considered relevant to the intended domain(s) ofthe given text segment. From this perspective, the proposed approach favors thecontexts obtained from relevant pages of the given text segment.

We found that it is convenient to implement our idea using the existent searchengines. A text segment could be treated as a query with a certain search request.



And its contexts are then obtained directly from the highly ranked search-resultsnippets, e.g., the titles and descriptions of search-result entries, and the textssurrounding the matched terms. This is analogous to the technique of pseudo-relevance feedback used to improve the retrieval performance with expansion termsextracted from the top-ranked documents [Buckley et al. 1992]. Mostly, this sce-nario works fine for short text segments. However, for long text segments, some ofthem might be too specific to obtain exact text strings and effective search resultsdirectly via search engines; besides, the returned snippets might not be sufficient.For the reason, a specific query processing technique, named query relaxation (referto Section 5), was developed to get adequate relevant search results for long textsegments, through a bootstrapping process of search requests to search engines.Below we first introduce the text representation model used in this study.

2.1 Representation Model

We adopt the vector-space model as our data representation. Suppose that, foreach text segment p, we collect up to Nmax search-result entries, denoted as Dp.Each text segment can then be converted into a bag of feature terms by applyingnormal text processing techniques, e.g., removing stop words1 and stemming, tothe contents of Dp. Let T be the feature term vocabulary, and let ti be the i-th term in T . With simple processing, a text segment p can be represented as aterm vector vp in a |T |-dimensional space, where vp,i is the weight ti in vp. Theterm weights in this work are determined according to one of the conventional tf-idfterm weighting schemes [Salton and Buckley 1988], in which each term weight vp,i

is defined as

vp,i = (1 + log2 fp,i) × log2(n/ni), (1)

where fp,i is the frequency ti occurring in vp’s corresponding feature term bag, n isthe total number of text segments, and ni is the number of text segments that con-tain ti in their corresponding bags of feature terms. The similarity between a pairof text segments is computed as the cosine of the angle between the correspondingvectors, i.e.,

sim(va, vb) = cos(va, vb).

Further, we define the average similarity between two sets of vectors, Ci and Cj ,as the average of all pairwise similarities among the vectors in Ci and Cj :

simA(Ci, Cj) =1

|Ci||Cj |∑

va∈Ci

∑

vb∈Cj

sim(va, vb).

It should be noticed that our purpose in using search-result snippets is not tofulfill the search request, but mainly to get adequate information to reflect thefeature-distribution characteristics of a text segment’s intended topic domain(s)and to then aid the relevance measurement among text segments. It is not strictlyrequired that the extracted contexts should be obtained from the most relevantpages. In fact, as the experimental results shown in Section 5, the ranking of search

1In this work, stop words are determined by a stop-word list, and the one we used is obtainedfrom the Smart system, available at ftp://ftp.cs.cornell.edu/pub/smart/.



Laptop-Notebook-Handheld-Palm-size-PDA-Pocket-PC-Companion- ...

... makes carrying cases for Handheld PC, Palm-size PC, PDA, Pocket PC, PC Companion,Laptop, Notebook, and Tablet PC computers - as well as Cellphones, Pagers ...

Carrying cases for Portable Electronics - Cellular Phones, Laptop ...

Since there is a flood of new notebook, PDA computers, and cellular phones constantly comingto market, The Pouch, Inc., must do a lot of research, in order to ...

PocketLOOX - Fujitsu Siemens Computers

... PDA & Tablet PC notebooks personal computers thin clients broadband solutions worksta-tions intel based servers UNIX servers BS2000/OSD servers ...

pen tablet pc - Fujitsu Siemens Computers

Fujitsu Siemens Computers .com, HOME SEARCH CONTACT COUNTRIES JOBS SITEMAP,... PDA & Tablet PC notebooks personal computers thin clients broadband solutions ...

PDA Street - The PDA Network for Handheld Computers, PDA Software ...

... you can really fit in any pocket? ... a PDA Talk about PDAs PDA News Windows ... REXPocketMail Smartphones Tablet Computers Other Gadgets. ...

PDAStreet: News

... Zire Stands Out in Sluggish PDA Stats As enterprise users back off buying ... has releasedversion 1.1 of mcBank, its financial manager software for the Pocket PC. ...

Fujitsu Siemens Computers - Serwis FSC - PRODUKTY - PDA & Tablet ...

... Nowe urzadzenie kieszonkowe Fujitsu Siemens Computers to krok w przyszlosc ... male ilekkie urzadzenie z oprogramowaniem Microsoft Pocket PC 2002 zapewnia ...

Yahoo! News Full Coverage - Technology - Handheld and Palmtop ...

... more. Audio. -, PDA Tech Tips from ... , Sales of handheld computers fall - All ThingsConsidered/NPR (Aug 7, 2001). more. ... , Is Microsoft’s Tablet PC innovative enough? ...

Hardware

... Computers, Computers, Copiers & Fax Machines, Copiers & Fax Machines. ... Hand-held/PalmPC/PDA/Pocket PC/Tablet PC, Hard Drives, ...

Medical Pocket PC - Medical Resources for the Pocket PC

... devices on the WLAN, including Tablet PCs ... considerations involved in implementinghandheld computers into residency ... Dictionary, 27th edition for PDA gives you ...

Fig. 3. The top ten search-result snippets of text segment “PDA, Tablet computers, and Pocket

PCs.”

results by search engines did not affect the clustering performance too much; theproposed approach is, therefore, believed robust and not highly dependent on theemployed search engines. Before we move to the clustering algorithm, let’s examinethe feasibility of using search-result snippets through a preliminary observation.

2.2 An Illustrative Example

Figure 3 shows Google2’s top ten search-result snippets for text segment “PDA,Tablet computers, and Pocket PCs,” where the titles and descriptions are repre-sented in bold and normal faces, respectively. Obviously, these snippets, composedof only fragmented texts, do not directly provide an answer to a comparison be-tween pda, tablet computer, and pocket pc, neither to any question about thesethree products that users might want to know. However, the snippets do contain alot of words related to the corresponding text segment. In order to provide readers aclearer understanding of this phenomenon, Figure 4 lists the eighty words that occurmost frequently in the top 100 search-result snippets together with their frequencycounts. All stop words have been removed, and the remaining words are convertedto lower case but not stemmed (e.g., plural and singular are temporarily treateddifferent in this preliminary observation). From the list, words with extremely highfrequency are those occurring in the text segment itself, i.e., “pc,” “pda,” “com-puters,” “tablet,” and “pocket.” Besides these, there can be found a lot of wordsthat are considered highly related to the corresponding text segment; for example,the product manufactories and brand names, e.g., “fujitsu,” “palm,” “compaq,”“microsoft,” etc; the related products, e.g., “pen,” “notebook,” “phone,” “laptop,”etc; the product characteristics, e.g., “handheld,” “mobile,” “wireless,” “digital,”“portable,” etc; and other software or hardware related terms, e.g., “windows,”“ce,” “xp,” “pocketmail,” etc. Intuitively, all these related terms give a higher

2http://www.google.com/



pc 157

pda 127

computers 113

tablet 108

pocket 89

handheld 29

news 25

fujitsu 22

siemens 19

pen 18

handhelds 15

new 15

palm 16

com 15

mobile 15

compaq 14

microsoft 14

pcs 13

software 13

windows 13

pdas 12

wireless 11

computer 9

notebooks 9

systems 9

toshiba 9

digital 8

notebook 8

servers 8

accessories 7

ce 7

computing 7

gadgets 7

phone 7

256mb 6

epoc 6

internet 6

laptop 6

loox 6

personal 6

talk 6

0ghz 5

30gb 5

buy 5

devices 5

hardware 5

network 5

online 5

os 5

pocketmail 5

smartphones 5

tc1000t 5

technology 5

companion 5

ipaq 5

magazine 5

medical 5

prices 5

product 5

available 4

based 4

desktop 4

edition 4

portable 4

printers 4

products 4

rex 4

shareware 4

submit 4

thin 4

warnty 4

xp 4

xpp 4

components 4

dell 4

drives 4

infocater 4

ink 4

resources 4

rim 4

Fig. 4. The word and frequency list of the 80 most frequent words in the top 100 search-resultsnippets of text segment “PDA, Tablet computers, and Pocket PCs.”

chance to make an association between the corresponding text segment and othersegments with similar topics, e.g., “The newest product in the IBM Thinkpad XSeries” and “The cellular phone with the GSM-1900 system” (refer to Figure 1).Of course, there exist very few terms, such as “based,” that are considered lessrelated, i.e., the noisy terms, but they seems not to hurt too much. As a conclusionof this example, most of the highly-frequent terms found in snippets are reasonableto characterize the corresponding text segment.

To further examine whether using search-result snippets for feature extractioncan help the relevance judgment between text segments and reflect their topic sim-ilarity, we selected five text segments from those listed in Figure 1, as p1–5 listedin Figure 5, as the testing samples3. For each text segment, we collected up to 100search-result snippets and selected the 80 most frequent words as the segment’s fea-ture terms4. Then we computed the term weights of those five segments’ featuresaccording to Equation 1 and left those feature terms with nonzero weight. Figure 5lists the corresponding information: the head column lists the feature terms, andeach inner cell lists the weight of a feature term with respect to the segment indi-cated in the head row. The matrix has been arranged so as to obviously reveal therelationships between text segments and their highly correlated features.

Suppose that the hierarchy shown in Figure 1 is the target topical relationshipbetween the testing text segments that we want to discover. The similarity scoresbetween all pairs of testing text segments are listed as follows:

sim(p1,p2)=.091 sim(p1,p3)=.037 sim(p1,p4)=.011 sim(p1,p5)=.008sim(p2,p3)=.068 sim(p2,p4)=.014 sim(p2,p5)=.023 sim(p3,p4)=.041

sim(p3,p5)=.020 sim(p4,p5)=.178

From the data, segments p4 and p5 have the highest similarity score, indicatingthey are most related based on the similarity measure strategy used. This result

3The testing text segments were chosen because they were related in the topic about technologyand consumer electronic products, but still have certain difference that can be distinguished.4Again, the feature terms are shown without stemming in order to provide readers a more under-

standable form of those terms. Notice that the eighty most frequent words are chosen mainly forthe purpose of illustraction. Our approach uses all of the words found in the retrieved search-resultsnippets, except those stop words.



feature term p1 p2 p3 p4 p5

ibm .091 .164 0 0 0model .078 .065 0 0 0port .061 .072 0 0 0review .061 .072 0 0 0reviews .078 .086 0 0 0series .078 .132 0 0 0thinkpad .078 .156 0 0 0work .101 .077 0 0 0acer .182 0 .041 0 0buy .030 0 .068 0 0computer .056 .052 .048 0 0computing .034 .040 .044 0 0hardware .056 .040 .038 0 0laptop .044 .046 .041 0 0microsoft .061 0 .097 0 0news .044 .046 .061 0 0notebook .068 .052 .041 0 0notebooks .056 .069 .048 0 0pc .051 .062 .084 0 0cable .034 .043 0 .041 0connect .061 0 0 .073 0new .015 .025 .023 .023 0price .051 .046 0 .041 0communications .034 0 0 .055 .039mobile .022 0 .022 .034 .029time .061 0 0 0 .096computers 0 .086 .152 0 0desktop 0 .072 .062 0 0os 0 .065 .068 0 0prices 0 .065 .062 0 0product 0 .163 .062 0 0toshiba 0 .077 .078 0 0products 0 .036 .034 .041 0systems 0 .048 .041 .051 0business 0 .072 0 0 .075

feature term p1 p2 p3 p4 p5

information 0 .086 0 0 .063page 0 .036 .011 0 .039services 0 .040 0 .051 .063system 0 .036 0 .078 .039technology 0 .020 .017 .018 .021wireless 0 .017 .022 .025 .030www 0 .086 0 0 .063accessories 0 0 .074 .079 0pcs 0 0 .097 .115 0siemens 0 0 .099 .066 0based 0 0 .062 0 .091digital 0 0 .044 .075 .035internet 0 0 .068 0 .091network 0 0 .038 .063 .049personal 0 0 .038 .053 .044phone 0 0 .044 .094 .035cdma 0 0 0 .092 .107cell 0 0 0 .112 .054cellular 0 0 0 .166 .070communication 0 0 0 .088 .063data 0 0 0 .084 .093europe 0 0 0 .073 .063global 0 0 0 .095 .054gprs 0 0 0 .073 .110gsm 0 0 0 .166 .118networks 0 0 0 .103 .063pdc 0 0 0 .066 .108phones 0 0 0 .128 .087radio 0 0 0 .092 .063service 0 0 0 .103 .096tdma 0 0 0 .098 .091technologies 0 0 0 .079 .087users 0 0 0 .066 .084wap 0 0 0 .066 .135world 0 0 0 .114 .063

Text Patterns:p1 Find the cheapest ACER TravelMate without a CD-Rom drivep2 The newest product in the IBM Thinkpad X Seriesp3 PDAs, Tablet computers, and Pocket PCsp4 The cellular phone with the GSM-1900 systemp5 What is the difference between PHS, WAP, and GPRS?

Fig. 5. The weight matrix of feature terms with respect to the example text segments.

truly reflects the fact (as shown in Figure 1) that they are in the same topic, namely,a topic about mobile phone system, that can be significantly distinguished from therest three text segments. The second pair with highest similarity score is p1 andp2, and they can be identified as similar in a topic about notebook. The third pairwith highest score is p2 and p3, indicating p3 is more related to the topic aboutnotebook (p1 and p2 belong to) than the one about mobile phone system (p4 andp5 belong to). The relationship between the testing text segments discovered isalmost the same as the topic hierarchy shown in Figure 1. This result supportsour approach in extracting features using search-result snippets for revealing topicsimilarity between text segments.

From the observations, there are several problems required to justify for usingsearch-result snippets as a general approach to feature extraction, e.g.,

—How many snippets are appropriate to obtain adequate features?

—Does the ranking of snippets by search engines seriously affect the quality ofretrieved features?

—If a text segment has several different meanings, can highly ranked search-resultsnippets reveal all of its meanings?

—How to deal with those text segments that retrieve too few or even no search-result snippets?

Some of the questions will be answered through experiments, and some will beclarified further in the discussion section. Overall, the above preliminary observa-



tion shows that the idea of using search-result snippets for feature extraction hastremendous possibilities. Of course, such approach certainly has disadvantages, andboth of its strengths and weaknesses will be discussed in Section 7. In conclusion,with regard to the problem concerned in this paper, the clustering hypothesis canbe intuitively stated as follows: two text segments are clustered together becausethey can retrieve similar context contents.

3. HIERARCHICAL CLUSTERING ALGORITHM: HAC+P

The purpose of clustering in our approach is to generate a cluster hierarchy fororganizing text segments. The hierarchical clustering problem has been studiedextensively in the literature, and many different clustering algorithms exist. Theyare mainly of two major types: agglomerative and divisive. We have adopted HACas the backbone mechanism and developed a new algorithm, called HAC+P, for ourclustering problem. The algorithm consists of two phases: HAC-based clusteringto construct a binary-tree cluster hierarchy and min-max partitioning to generatea natural and comprehensive multi-way-tree hierarchy structure from the binaryone. The algorithmic procedure is formally shown in Figure 6, and the details willbe introduced in the following subsections.

3.1 HAC-Based Binary-Tree Hierarchy Generation

An HAC algorithm operates on a set of objects with a matrix of inter-object sim-ilarities and builds a binary-tree cluster hierarchy in a bottom-up fashion [Mirkin1996]. Let v1, v2, . . ., vn be the input object vectors, and let C1, C2, . . ., Cn be thecorresponding singleton clusters. In the HAC clustering process, at each iterationstep, the two most-similar clusters are merged to form a new one, and the wholeprocess halts when there exists only one un-merged cluster, i.e., the root node inthe binary-tree hierarchy. Let Cn+i be the new cluster created at the i-th step. Theoutput binary-tree hierarchy can be unambiguously expressed as a list, C1, . . . , Cn,Cn+1, . . . , C2n−1, with two functions, left(Cn+i) and right(Cn+i), 1 ≤ i < n,indicating the left and right children of the internal cluster node Cn+i, respectively.

The core of an HAC algorithm is a specific function used to measure the similaritybetween any pair of clusters Ci and Cj (steps 8 and 11 in Figure 6). Here, weconsider four well-known inter-cluster similarity functions: (SL) the single-linkagefunction, defined as the largest similarity between two objects in both clusters:

simSL(Ci, Cj) = maxva∈Ci,vb∈Cj

sim(va, vb);

(CL) the complete-linkage function, defined as the smallest similarity between twoobjects in both clusters:

simCL(Ci, Cj) = minva∈Ci,vb∈Cj

sim(va, vb);

(AL) the average-linkage function, defined as the average of all similarities amongthe objects in both clusters:

simAL(Ci, Cj) = simA(Ci, Cj);

(CE) the centroid function, defined as the similarity between the centroids of the



HAC+P(v1, . . . , vn)vi, 1 ≤ i ≤ n: the vectors of the objects

1: C1, . . . , C2n−1 ← GenerateHACBinaryTree(v1, . . . , vn)2: return MinMaxPartition(1, C1, . . . , C2n−1)

GenerateHACBinaryTree(v1, . . . , vn)vi, 1 ≤ i ≤ n: the vectors of the objects3: for all vi, 1 ≤ i ≤ n do

4: Ci ← {vi}5: f(i) ← true {f : whether a cluster can be merged}6: calculate the pairwise cluster similarity matrix7: for all 1 ≤ i < n do

8: choose the most-similar pair {Ca, Cb} with f(a) ∧ f(b) ≡ true9: Cn+i ← Ca ∪ Cb, left(Cn+i) ← Ca, right(Cn+i) ← Cb

10: f(n + i) ← true, f(a) ← false, f(b) ← false11: update the similarity matrix with new cluster Cn+i

12: return C1, . . . , C2n−1 together with functions left and rightMinMaxPartition(d, C1, . . . , Cn, Cn+1, . . . , C2n−1)d: the current depthCi, 1 ≤ i ≤ 2n − 1: the binary-tree hierarchy13: if n < ǫ ∨ d > ρ then

14: return C1, C2, . . . , Cn

15: minq ← ∞, bestcut ← 016: for all cut level l, 1 ≤ l < n do

17: q ← Q(LC(l))/N(LC(l))18: if minq > q then

19: minq ← q, bestcut ← l20: for all Ci ∈ LC(bestcut) do

21: children(Ci) ← MinMaxPartition(d + 1, CH(Ci))22: return LC(bestcut)

Fig. 6. The HAC+P algorithm.

two clusters:

simCE(Ci, Cj) = sim(ci, cj)

where ci and cj are the centroids of Ci and Cj , respectively, and, for a clusterCl, the k-th feature weight of its centroid, cl, is defined as cl,k =

∑

vi∈Clvi,k/|Cl|.

Usually, the clusters produced by the single-linkage method are isolated but notcohesive, and there may be some undesirably elongated clusters. At the otherextreme, the complete-linkage method produces cohesive clusters that may not beisolated. The average-linkage method represents a compromise between the twoextremes. The centroid method is another commonly used similarity measurementapproach different from the linkage ones. A comparison of these methods will bemade in a later section.

3.2 Min-Max Partitioning

The HAC algorithm produces a binary-tree cluster hierarchy. However, the pro-posed approach is pursued to produce a natural and comprehensive hierarchicalorganization like that of Yahoo!, in which there are 13-15 major categories, whereeach category also contains an appropriate number of sub-categories and so on.This broad and shallow multi-way-tree representation, instead of the narrow anddeep binary-tree one, is easier and more suitable for humans to browse, interpret,and do deeper analysis.

To generate a multi-way-tree hierarchy from a binary-tree representation, a top-down approach is used to decompose the hierarchy into several sub-hierarchies first,and then recursively applies the same decomposing procedure to each sub-hierarchy.



(A)

C1 C2 C3 C4 C5

C6

C8

C9

C7

1

23

4

Cut level l

C1 C2 C3 C4 C5

C6

C8

C9

C7

1

23

4

Cut level l

(B)

C1 C2 C3 C4 C5

C6

C9

C7

C1 C2 C3 C4 C5

C6

C9

C7

Fig. 7. An illustrative example for cluster partitioning.

Our idea is to determine a suitable level at which to cut the binary-tree hierarchyand create the most appropriate sub-hierarchies; that is, these sub-hierarchies arewith the best balance of cluster quality and number preference over those producedby cutting at the other levels. Through recursively decomposing the sub-hierarchies,a new multi-way-tree hierarchy can be constructed.

The problem of cutting at a suitable level can be taken as that of determiningbetween which pair of adjacent clusters {Cn+i−1, Cn+i}, 1 ≤ i < n, in the binary-tree hierarchy C1, . . . , C2n−1 to put the partition point. Let the level between{C2n−2, C2n−1} be 1, the level between {C2n−3, C2n−2} be 2, and so on such thatthe level between {Cn+i−1, Cn+i} is n − i (refer to Figure 7). For the purpose offurther illustration, we let LC(l) be the set of clusters produced after cutting thebinary-tree hierarchy at level l, and let CH(Ci) be the cluster hierarchy rooted atnode Ci, i.e., CH(Ci) = Ci

1, . . ., Cini

, Cini+1, . . ., Ci

2ni−1, where Ci1, . . ., Ci

niare

the leaf, singleton clusters, Cini+1, . . ., Ci

2ni−1 are the internal, merged clusters,and Ci

2ni−1 = Ci. For example, in Figure 7(A), LC(1) is {C7, C8}, LC(2) is {C5,C6, C7}, and CH(C8) is {C3, C4, C5, C6, C8}. Suppose that the best cut levelis chosen as 2; the first-level clusters of the generated hierarchy are LC(2), {C5,C6, C7} (refer to Figure 7(B)). If, then, the sub-hierarchy CH(C7) is partitionedand the chosen level is 1, the second-level clusters include C1 and C2. Readersshould notice that all the above information, e.g., the values of functions LC andCH, could be collected while the HAC clustering process is proceeding, so theyare available without too much extra computational efforts. In the following, wedescribe the two criteria used to determine the best cut level.

Cluster Set Quality. The generally accepted requirement of “natural” clustersis that they must be cohesive and isolated from the other clusters. Our criterionfor determining a proper cut level given a binary-tree hierarchy of clusters is toheuristically satisfy this requirement. Let the inter-similarity between two clustersCi and Cj be defined as the average of all pairwise similarities among the objectsin Ci and Cj , i.e., simA(Ci, Cj), and let the intra-similarity within a cluster Ci be



defined as the average of all pairwise similarities within Ci, i.e., simA(Ci, Ci). Ourpartitioning approach finds a particular level that minimizes the inter-similaritiesamong the clusters produced at the level and maximizes the intra-similarities of allthose clusters; this is why the approach is named min-max partitioning. Let C bea set of clusters; our quality measurement of C based on its cohesion and isolationis defined as

Q(C) =1

|C|∑

Ci∈C

simA(Ci, C̄i)

simA(Ci, Ci),

where C̄i =⋃

k 6=i Ck is the complement of Ci. Note that the smaller the Q(C)value is, the better the quality of the given set of clusters, C, is.

Cluster-Number Preference. Usually, a partition with neither too few nor toomany clusters is preferable. Given n objects, there are at least one cluster andat most n clusters. To have a natural and comprehensive hierarchy structure, weexpect that the number of clusters at each level should be appropriate to humans,but a proper number is really difficult to anticipate automatically because we haveno idea how many meaningful groups exist among the objects. To make the gen-erated hierarchy structure adaptable to the personal preference of each individualwho is going to construct the taxonomy, a parameter Nclus on the default expectednumber of generated clusters at each layer is assumed given by the correspondingtaxonomy constructor. Notice that Nclus could be a constant value or a function.Then, a simplified gamma distribution function is used to measure the degree ofpreference on the number of clusters. Its definition is given as follows:

f(x) =1

α!βαxα−1e−x/β ; N(C) = f(|C|),

where |C| is the number of target clusters in C, α is a positive integer, and theconstraint (α − 1)β = Nclus is required to ensure f(Nclus) >= f(x) for 0 < x ≤ n.This constraint can be easily derived according to that the gamma distributionfunction f(x) is convex and the maximum can then be computed by differentiatingf(x) to get f ′(x) and solving f ′(x) = 0. The detailed derivation is omitted here.The two parameters α and β allow us to tune the smoothness of the preferencefunction, and they are empirically set as α = 3 and β = Nclus/2 in our study. Inthis work, we empirically define Nclus as the square root of the number of objectsin each partitioning step. These parameters are unavoidably subjective to theindividual taxonomy constructor, therefore, no further experimental comparison ofvarious values will be made. Figure 8 depicts the curves of this cluster-numberpreference function with respect to different numbers of generated clusters, f(x),on different sizes of objects n = 100, 200, and 400. Note that the function favorsthose cluster numbers close to Nclus.

Finally, to partition the given hierarchy, we estimate the quality and the cluster-number preference of all possible cluster sets produced at each level. The suitablecut level is chosen as the level l with the minimum Q(LC(l))/N(LC(l)) value (re-fer to steps 17–19 in Figure 6). The detailed partitioning procedure is shown inFigure 6. To avoid performing the partitioning procedure on a cluster with toofew objects or making the result hierarchy too deep, two constants ǫ and ρ areprovided to restrict the size of a cluster and the depth of the generated hierarchy,



0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

0 5 10 15 20 25 30 35 40 45 50

f(x)

the number of generated clusters

n=100n=200n=400

Fig. 8. The cluster number preference function on n = 100, 200, and 400.

respectively, to be further processed (refer to steps 13–14 in Figure 6).In the literature, several criteria for determining the number of clusters have

been suggested [Milligan and Cooper 1985], but they are typically based on prede-termined constants, e.g., the number of final clusters or a threshold for similaritymeasure scores. Relying on predefined constants is harmful in practice when ap-plying the clustering algorithm because these criteria are very sensitive to the dataset and hard to properly determine. There is another branch of clustering meth-ods, called model-based clustering [Vaithyanathan and Dom 2000], which have theadvantages of offering a way to estimate the number of groups present in the data.However, they suffer from high computational cost and the risk of a wrong assump-tion of the data model. Our approach combines an objective function to measurethe quality of the generated clusters with a heuristically acceptable preference func-tion on the number of clusters. It can automatically determine a reasonable clusternumber based on the given data set and still keep the advantage of efficiency ofnon-parametric clustering methods.

3.3 Cluster Naming

To provide users a more comprehensive hierarchy of clusters, the internal nodes,i.e., the nodes between the root and the leaves, should be labeled with some concisenames. Although it is essential to label clusters, only a few works really dealtwith it [Muller et al. 1999; Lawrie et al. 2001; Glover et al. 2002]. In Muller et al.[1999], the labels of a cluster were chosen as the n most frequent terms in thecluster. Lawrie et al. [2001] extracted salient words and phrases of the instancesin a cluster from retrieved documents to organize them hierarchically using a typeof co-occurrence known as subsumption. Glover et al. [2002] inferred hierarchicalrelationships and descriptions by employing a statistical model they created todistinguish between the parent, self, and child features in a set of documents.

To name a cluster is a rather intellectual and challenging work. As we havementioned, this work focuses on how to link the clusters with close concepts and to



Table I. The information of the paper data set.

Conference # Paper titles Conference # Paper titles

AAAI’02 29 SIGCOMM’02 25

ACL’02 65 SIGGRAPH’02 67

JCDL’02 69 SIGIR’02 44

decide appropriate levels in the hierarchy to position them. We use a hierarchicalclustering technique to put similar instances together in a cluster and relevantclusters at the same or near levels. The cluster naming is not fully investigatedin our current stage of study. In this work, we simply take the most frequent co-occurred feature terms from the composed instances to name the cluster. Even so,as illustrated in Figure 12, such a primitive approach still provides an easier wayfor users to understand the concepts of the generated cluster hierarchy.

4. EXPERIMENTS

Extensive experiments have been conducted to test the feasibility and the perfor-mance of the proposed approach on different domains of text segments, includingcategory names in a popular Web directory, famous people names, academic papertitles, and natural language questions.

To have a standard basis for performance evaluation, we collected the followingexperimental data:

YahooCS. The category names in the top three levels of the Yahoo! ComputerScience (CS) directory were collected. There were 36, 177, and 278 category namesin the first, second, and third levels, respectively. These category names were shortin length and specifically expressed some key concepts in the CS domain; therefore,they could play the role of typical text segments and would be used as the majorexperimental data in our study. Notice that some category names could be placedin multiple categories; i.e., they had multi-class information.

People. Named entities are an important problem of concern in information ex-traction. To assess the performance of our approach on such kind of data, wecollected the people names listed in the Yahoo! People/Scientist directory. In thisdata set, there were 250 famous people distributed in nine science fields, e.g., math-ematicians and physicists.

Paper. For the people in research communities, papers are a major demand ofsearch, and their titles or author names are often used as queries submitted tosearch engines. Therefore, the paper titles from several named conferences werecollected for the test, and Table I listed the information of this data set.

QuizNLQ. Besides the data described above, we also collected a data set of nat-ural language questions from a Web site that provided quiz problems5. In ourcollection, there were 163 questions distributed in seven categories. Figure 9 listsseveral examples. From this data set, we want to realize whether the proposed ap-proach is helpful in clustering common NLQs with the retrieved relevant contexts,even though no advanced natural language understanding techniques are applied.

5http://www.coolquiz.com/trivia/



Questions About PeopleWhat is a Booger made of?What is a FART and why does it smell?How come tears come out of our eyes when we cry?What makes people sneeze?Why do my feet smell?

Questions About Insects and AnimalsHow do houseflies land upside down on the ceiling?Why do some animals hibernate in the winter?What is honey? How do honey bees make honey?Why and how do cats purr?How does catnip work? Why is it only for cats?

Questions About Inventions and MachinesWhy do golf balls have dimples?Did Thomas Edison really invent the light bulb?How a newspaper tears?How did coins get their names?How do mirrors work?

Questions About The Environment, Space and ScienceHow does a helium balloon float?Why do bubbles attract?What would happen if there was no dust?What are sunspots?Why are there 5,280 feet to a mile?

Questions About Food, Drinks and SnacksWhy is Milk white?Why is it called a ”hamburger” when there is no ham in it?What are hot dogs made of?Why do doughnuts have holes?Why do Onions make us cry?

Fig. 9. Examples of the testing natural language questions.

Notice that all the target instances in the above data sets have class information,e.g., the conference of a paper and the field(s) of a person. The class informationwill be taken as the external information to evaluate the clustering results.

We adopted the F-measure [Larsen and Aone 1999] as the evaluation metric forthe generated cluster hierarchy. The F-measure of cluster j with respect to class iis defined as

Fi,j =2Ri,jPi,j

Ri,j + Pi,j,

where Ri,j and Pi,j are recall and precision, and are defined as ni,j/ni and ni,j/nj ,respectively, where ni,j is the number of members of class i in cluster j, nj is thenumber of members in cluster j, and ni is the number of members of class i. Forthe entire cluster hierarchy, the F-measure of any class is the maximum value itattains at any node in the tree, and an overall F-measure is computed by takingthe weighted average of all the F-measure values as follows:

F =∑

i

ni

nmax{Fi,j}, (2)

where the maximum is taken over all clusters at all levels, n is the total number ofinstances, and ni is the number of instances in class i.

For comparison, a k-means method was modified to make it hierarchical (HK-Means) and used as the reference method. The basic algorithm of k-means is shownas following steps:

(1) Select randomly k instances as the initial elements of the k clusters.

(2) Assign each instance to its most-similar cluster.

(3) Repeat Step 2 until the clusters do not change or the predetermined numberof iterations are used up.

By HKMeans, all the instances are first clustered into k clusters using k-means,



Table II. The F-measure results of clustering Yahoo! CS category names using various clusteringmethods and various similarity measure strategies.

HAC HAC+P HAC+P HKMeans(ρ = 2) (ρ = 2,k =

√

n)

AL .8626 .8385 .8197 .5873

CL .8324 .8116 .7838 .2892

SL .6848 .6529 .2716 N/A

CE .4177 .4080 .2389 .5873

Random .2893

and then the same k-means procedure is recursively applied to each cluster untilthe specified depth ρ is reached. The similarity between an instance and a clusteris measured using not only the conventional centroid method, but also the average-and complete-linkage methods described previously. The single-linkage methodis not suitable for k-means because, in the single-linkage similarity measure, oneinstance’s most-similar cluster is the one that contains the instance itself; therefore,the resulting clusters totally depend on the k initial clusters. Because of the randomselection of k initial clusters, the resulting clusters vary in quality. Some sets ofinitial clusters lead to poor convergence rates or poor cluster quality.

Table II shows the resulting F-measure values for clustering the 177 Yahoo! CSsecond-level category names with the 36 first-level categories as the target classesusing various clustering methods and various similarity measure strategies. The F-measure values of the binary-tree hierarchies produced by the conventional HAC areprovided as the upper-bound reference values for the other HAC variants6. HAC+Pis HAC with the partitioning procedure, and ρ = 2 means that the depth of theresult hierarchy was constrained to be at most 2. The parameter k in HKMeans wasdynamically set as the nearest integer of

√n, where n was the number of instances

to be clustered at each step. Besides the various similarity measure strategies, arandom clustering experiment was also performed with HKMeans; that is, at eachstep, each instance was randomly assigned to one of the k clusters. The reported F-measure score of random clustering is an average of twenty trials, and is provided asa reference score used to reflect the degree of the goodness of the cluster hierarchiesproduced using various clustering and similarity measure methods. The resultobtained using HKMeans with complete-linkage was poor because the specifiedmaximum number of iterations were used up and did not converge to a set of stableclusters. Comparably, HKMeans did not perform very well in this task.

From the experimental results shown in Table II, we found that the average- andcomplete-linkage methods performed much better than the single-linkage and cen-troid ones under the F-measure metric, and that the average-linkage method waseven slightly better. The incorporation of partitioning and the constraint of hierar-chy depth only caused very small decrement of the F-measure. To provide readers amore comprehensive and clearer understanding of the clustering results, Figure 10

6Since every cluster node in the hierarchy produced by HAC+P is a node in the hierarchy producedby HAC, the F-measure value of HAC is definitely the upper bound for variants of HAC+Paccording to the definition of F-measure in Equation 2.



CS:Linguistics (II) (recall=.23,precision=1)

Aphasia CS:LinguisticsSpeechreading CS:LinguisticsSlang CS:LinguisticsEtymology CS:LinguisticsWords and Wordplay CS:Linguistics

CS:Linguistics (I) (recall=.77,precision=1)

Metaphor CS:LinguisticsPhilosophy of Language CS:LinguisticsSemiotics CS:LinguisticsLinguists CS:LinguisticsWriting Systems CS:LinguisticsLanguage Policy CS:LinguisticsLanguages CS:LinguisticsLexicography CS:LinguisticsTranslation and Interpretation CS:LinguisticsPhonetics and Phonology CS:LinguisticsDialectology CS:LinguisticsSociolinguistics CS:LinguisticsComputational Linguistics CS:LinguisticsNatural Language Processing CS:AI, CS:LinguisticsLanguage Acquisition CS:LinguisticsPsycholinguistics CS:Linguistics

CS:MobileComputing (recall=.57,precision=.80)

Tablet Computers and Webpads CS:MobileComputingLaptop Computers CS:MobileComputingPersonal Digital Assistants (PDAs) CS:MobileComputingPersonal Area Networks CS:NetworksWearable Computing CS:MobileComputing

CS:Security (recall=.70,precision=.93)

Kerberos CS:SecurityS/KEY CS:Security

Privacy CS:SecurityCryptography CS:SecurityEncryption Policy CS:SecurityPGP - Pretty Good Privacy CS:SecurityDigital Signatures CS:SecurityRSA CS:Security

Anonymous Mailers CS:Networks, CS:SecurityEmail CS:NetworksEncrypted Email Providers CS:Security

Firewalls CS:SecurityViruses and Worms CS:Security

Hacking CS:SecuritySoftware Piracy CS:Security

CS:AI (recall=.67,precision=.60)

Turing Machines CS:ModelingTuring Test CS:AIHigher-Order Logic Theorem Provers CS:FormalMethodsSet Theory CS:LogicProgrammingSimulated Annealing CS:AlgorithmsInformation Retrieval CS:Lib&InfoSciInference CS:AIOntology CS:AIArtificial Life CS:AICellular Automata CS:AI, CS:ModelingExpert Systems CS:AIMachine Learning CS:AINeural Networks CS:AI, CS:NeuralNetworksFuzzy Logic CS:AIGenetic Algorithms CS:Algorithms

Fig. 10. Selected examples for clustering Yahoo! CS category names.

shows the selected clusters from the result of HAC+P with the average-linkagemethod and two-level depth constraint; there, each table contains a generated first-level cluster and each row contains a second-level sub-cluster. For each cluster,the Yahoo!’s corresponding category name (notated manually) and the achievedrecall/precision rates are presented in the headline row, and the clustered textsegments are attached with their correct class names. It is easy to see that the seg-ments grouped in the same clusters are most highly relevant. Although there aresome text segments whose corresponding classes are not matched with the gener-ated clusters, their meanings are semantically related to the automatically assignedclusters.

Let’s take a look at the generated hierarchy structure that the F-measure metricdid not measure. In order to have a clearer understanding of the generated hier-archical cluster structure, Table III lists some statistics of the hierarchy structuresgenerated by clustering Yahoo! CS category names using the average-linkage simi-larity measure. The reported information includes the depth of the hierarchy, thetotal number of clusters generated at all levels, and the average of the child clusternumbers among all the clusters, which is a metric that can be used to examine ahierarchy’s degree of the broadness. Obviously, the decrement of the F-measurevalue caused by partitioning and the constrained hierarchy depth was very small(refer to Table II), but the generated structures were considered more natural andhelpful in observing the facts contained. To provide readers a more comprehensiveresulting structure, Figure 11 shows a subset of the generated binary-tree hierarchyusing the conventional HAC algorithm with the average-linkage measure, and Fig-ure 12 shows the corresponding multi-way-tree hierarchy generated using HAC+Pwith two-level depth constraint. (The five most frequent feature terms from thecomposed instances are chosen as the labels of the internal nodes, as those enclosedin the large left parentheses in Figure 12.) Readers can have a comparison of the



Metaphor

Philo

sophy

ofLanguage

Sem

iotic

s◦

W WW

ggg

◦Z Z

Zrr

r

Lin

guis

ts

Writin

gSystem

s◦

W WW

ggg

Language

Polic

y

Languages

◦U U

Uee

e

Lexic

ography

Transla

tio

nand

Interpretatio

n◦

U UU

eee

◦Q Q

Qqq

q◦

G GG

www

Phonetic

sand

Phonolo

gy

Dia

lectolo

gy

Socio

linguis

tic

s◦

W WW

ggg

◦T T

Tll

l

Com

putatio

nalLin

guis

tic

s

NaturalLanguage

Processin

g◦

W WW

ggg

Language

Acquis

itio

n

Psycholin

guis

tic

s◦

W WW

ggg

◦O O

Ooo

o◦

B BB B

}}}}

◦

2 2 2 2 2

��

◦

0 0 0 0 0 0

��

Aphasia

Speechreadin

g◦

W WW

ggg

Sla

ngE

tym

olo

gy

Words

and

Wordpla

y◦

W WW

ggg

◦T T

Tll

l◦

M MM

rrr

◦

��

/ / / / / /

Turin

gM

achin

es

Turin

gTest

◦W W

Wgg

gH

igher-O

rder

Logic

Theorem

Provers

Set

Theory

◦W W

Wgg

g◦

Q QQ

qqq

Sim

ula

ted

Annealin

g

Info

rm

atio

nR

etrie

val

Infe

rence

Ontolo

gy

◦W W

Wgg

g

Artific

ialLife

Cellu

lar

Autom

ata

◦W W

Wgg

gExpert

System

s

Machin

eLearnin

g

NeuralN

etworks

◦W W

Wgg

gFuzzy

Logic

Genetic

Alg

orithm

s◦

W WW

ggg

◦O O

Ooo

o◦

ttt

L LL

◦

I II

uuu

◦

E EE

||||

◦

E EE

zzzz

◦

J JJ

uuuu

◦

A AA A

}}}}

◦

( ( ( ( ( ( ( ( ( ( ( ( ( (

��

Table

tC

om

puters

and

Webpads

Laptop

Com

puters

PersonalD

igitalA

ssis

tants

(PD

As)

◦W W

Wgg

g◦

T TT

lll

PersonalA

rea

Networks

Wearable

Com

putin

g◦

W WW

ggg

◦

uuu

J JJ

Kerberos

S/K

EY

◦W W

Wgg

gPriv

acy

Cryptography

Encryptio

nPolic

y◦

W WW

ggg

◦T T

Tll

l

PG

P-Pretty

Good

Priv

acy

Dig

italSig

natures

RSA

◦W W

Wgg

g◦

T TT

lll

◦

J JJ

zzz

◦

? ?? ?

��

Anonym

ous

Maile

rs

Em

ail

Encrypted

Em

ail

Provid

ers

◦W W

Wgg

g◦

T TT

lll

Firewalls

Viruses

and

Worm

s◦

W WW

ggg

Hackin

g

Software

Piracy

◦W W

Wgg

g◦

O OO

ooo

◦

B BB B

}}}}

◦

/ / / / / /

��

◦

3 3 3 3 3 3 3

◦

& & & & & & & & & & & & & & & & & & & &

��

Fig. 11. Example binary-tree hierarchy.

Table III. Some statistics about the generated hierarchy structures.

AL HAC HAC+P HAC+P(ρ = 2)

Depth 17 6 2

# Clusters at all levels 352 278 76

Avg.#Child Clusters 2 2.70 5.43

two structures. Although it is hard to have a quantitative approach to measurethe goodness of a generated hierarchy structure, we believe that the multi-way-treerepresentation is more natural and easier for humans to browse and interpret thana deep and narrow binary-tree hierarchy.

Table IV shows the results of clustering people names, paper titles, and naturallanguage questions using the average-linkage similarity measure. The proposedapproach still achieved very good performance on these data under the F-measuremetric. The performance of clustering natural language questions was comparablya bit poorer than the others although the achieved F-measure scores were stillconsidered very good in fact. This was mainly because the domains of our testingnatural language questions were too diverse (refer to the examples in Figure 9),and it was difficult to have common composed features among those questions withthe same domain. During the experiment, we found that there were a subset ofpaper titles too specific to obtain adequate search results. Concretely, among thetotal 299 paper titles, there were 37 paper titles obtaining less than 20 search-result entries. This prevented these paper titles from having adequate featuresfor clustering themselves with relevant instances. From our analysis, the situationwould occur when dealing with long text segments because of the limitation ofusing existing Web search engines. In the next section, we present a specific query



Table

tC

om

puters

and

Webpads

Laptop

Com

puters

PersonalD

igitalA

ssis

tants

(PD

As)

PersonalA

rea

Networks

Wearable

Com

putin

g

◦

hhhh

hhhh

h] ]

] ]] ]

] ]]

V VV V

V VV V

V

Kerberos

S/K

EY

Priv

acy

Cryptography

Encryptio

nPolic

yPG

P-Pretty

Good

Priv

acy

Dig

italSig

natures

RSA

Anonym

ous

Maile

rs

Em

ail

Encrypted

Em

ail

Provid

ers

Firewalls

Viruses

and

Worm

s

Hackin

gSoftware

Piracy

◦

��

��

��

mmmm

mmmm

mW W

W WW W

W WW

J JJ J

J JJ J

J J

> >> >

> >> >

> >> >

Turin

gM

achin

es

Turin

gTest

Hig

her-O

rder

Logic

Theorem

Provers

Set

Theory

Sim

ula

ted

Annealin

g

Info

rm

atio

nR

etrie

val

Infe

rence

Ontolo

gy

Artific

ialLife

Cellu

lar

Autom

ata

Expert

System

sM

achin

eLearnin

gN

euralN

etworks

Fuzzy

Logic

Genetic

Alg

orithm

s

◦

uuuu

uuuu

uuhh

hhhh

hhh

bbbb

bbbb

b

I II I

I II I

I I

Metaphor

Philo

sophy

ofLanguage

Sem

iotic

s

Lin

guis

ts

Writin

gSystem

s

Language

Polic

yLanguages

Lexic

ography

Transla

tio

nand

Interpretatio

n

Phonetic

sand

Phonolo

gy

Dia

lectolo

gy

Socio

linguis

tic

s

Com

putatio

nalLin

guis

tic

sN

aturalLanguage

Processin

g

Language

Acquis

itio

nPsycholin

guis

tic

s

◦

��

��

��

�

uuuu

uuuu

uuee

eeee

eee

S SS S

S SS S

S

E EE E

E EE E

E EE

: :: :

: :: :

: :: :

:

Aphasia

Speechreadin

g

Sla

ng

Etym

olo

gy

Words

and

Wordpla

y

◦

hhhh

hhhh

h__

_ ___

___

V VV V

V VV V

V

◦

��

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

C CC C

C CC C

C CC C

C CC

oooo

oooo

oooo

computers

personal

wearable

networks

laptop

email

encryption

privacy

software

security

words

aphasia

slang

etymology

speechreading

language

linguists

computational

philosophy

systems

turing

machines

computer

logic

program

Fig. 12. Example multi-way-tree hierarchy. The words enclosed by the large left parentheses neareach internal node are the five most frequent feature terms among the composed instances andare chosen as the labels of the corresponding internal node.

Table IV. The F-measure results of clustering people names, paper titles, and natural languagequestions using the average-linkage method.

HAC HAC+P HAC+P HKMeans(ρ = 2) (ρ = 2,k =

√

n)

People .7554 .7250 .7250 .5074

Paper .7657 .7075 .7045 .5361

QuizNLQ .6954 .6235 .6235 .4619

processing technique to overcome this difficulty.

5. QUERY RELAXATION FOR LONG TEXT SEGMENTS

Sometimes, there exist text segments that are too specific to obtain adequate searchresults using current keyword-matching-based search engines. Our purpose of usingsearch-result snippets is not to fulfill the search requests, but mainly to extractfeatures for text segments and to estimate the frequency distribution of featuresin the segments’ intended topic(s). Insufficient snippets would cause the obtainedinformation sparse and, hence, unreliable, and may hurt the relevance measurementamong text segments.

The situation of retrieving inadequate search results mostly occurs when dealingwith long text segments, e.g., paper titles and natural language queries. Comparedwith a short text segment, a long one contains more information, e.g., more terms,and it’s rather difficult to obtain documents exactly matching all of the terms.However, as a long text segment contains more information, not all terms in thesegment are equally informative to its intended topic(s). This motivates our in-



vention of a query processing technique, named query relaxation, to acquire morerelevant feature information for long text segments through a bootstraping processsof search requests to search engines.

To clarify the idea of query relaxation, let’s take the title of this paper as anexample of long text segment: “Topic Hierarchy Generation for Text Segments:A Practical Web-based Approach.” Suppose that one needs to select a subset ofterms as the query that can mostly represent the topical concept of this segment,one most probably selects those of “Topic Hierarchy Generation for Text Segments”rather than “A Practical Web-based Approach.” This is because the former seemsthe major theme of the paper. If one needs to further reduce the sub-segment“Topic Hierarchy Generation for Text Segments,” “Topic Hierarchy Generation”seems better than “Text Segments.” Of course, this selection is not always themost feasible one and it depends on the circumstances when making the decision.The idea shown here demonstrates that the textual part of a long text segmentcan be reduced, or relaxed. The reduced segment represents a concept that is stillclose to (or usually it is broader than) the main topical concept of the originalsegment, and it usually can retrieve more search results because the reduced seg-ment holds less terms in it. For example, if we cannot get adequate search resultsfor “Topic Hierarchy Generation for Text Segments: A Practical Web-based Ap-proach,” the results for “Topic Hierarchy Generation for Text Segments” may beborrowed to augment the representation of the original text segment. If they arestill not enough, those for “Topic Hierarchy Generation” can be further added.Although those augmented search results may contain stuff about topic hierarchygeneration for documents rather than for text segments, their topics are close, andthe corresponding features and feature distributions would be similar. This ideais somewhat analogous to the idea of hierarchical shrinkage that smooths parame-ter estimates of a data-sparse child with its parent in order to obtain more robustparameter estimates [McCallum et al. 1998].

Thus the precise goal of our query relaxation is to reduce a text segment or,in other words, to relax the query formed by the given text segment submittedto search engines so as to obtain more search results. Our assumption is that asub-segment, i.e., with a subset of terms of the original segment, would represent abroader or, at least, equal concept of the original segment.

The above example suggests a possible approach in an inclusion manner, that is,to select a subset of terms that are most informative from the given text segment.Instead of following such inclusion manner, we design our approach in an exclusionmanner. That is, when the search results of the given text segment are not adequate,a single term is removed from the segment, and the rest form a new query tosearch engines. The term to be removed is determined under the considerationof a comparison between the information retrieved via the sub-segments with andwithout that term. The newly retrieved search results are then augmented intothe set of the original search results. (Of course, those overlap entries would bechecked out and not added.) This relaxation process is repeated until the obtainedinformation is considered enough. The whole process is formalized as the following.Suppose that the candidate text segment is p. Let p0 = p; the query relaxationis to remove a term from pk, 0 ≤ k, to form a new text segment pk+1, which is



q=Polynomial-Time Reinforcement Learning of Near-Optimal Policies

q1=Reinforcement Learning Near-Optimal Policies

q2=Reinforcement Learning Policies

q=Named Entity Recognition using an HMM-based Chunck Tagger

q1=Named Entity Recognition HMM-based Tagger

q2=Named Entity Recognition Tagger

q=A digital library of conversational expressions: helping profoundly

disabled users communicateq1=digital library conversational expressions helping disabled

users communicateq2=digital library conversational expressions helping users

communicate

Fig. 13. Examples of paper title and their relaxed versions.

then submitted to search engines to obtain the corresponding search-result snippets,Dpk+1 . The new information for p, i.e., Dp, is set as

⋃

0≤i≤k+1 Dpi . Let Nmin be aconstant expressing the minimum required number of search-result entries. A textsegment pk, 0 ≤ k, is relaxed when |Dp| < Nmin. The number of iterations of therelaxation process mainly depends on whether adequate information is obtained,i.e., |Dp| ≥ Nmin, and the text segment is still long enough to remove some terms.

Regarding the term to be removed in an iteration, we use a similarity comparisonbetween all possible candidates to decide it. Suppose that the given text segment pk

has m terms, and let them be pk1 , . . ., pk

m. There are m candidate sub-segments witha length of m − 1 terms. For example, the segment “topic hierarchy generation”has three candidate sub-segments after removing one term: “hierarchy generation,”“topic generation,” and “topic hierarchy.” (We assume the order of terms in thesegment does not matter.) These m sub-segments are then represented as featurevectors using the corresponding search-result snippets. For a term which is moreinformative, the search-result snippets retrieved by the sub-segments containingthat term would be more cohesive than those snippets retrieved by the sub-segmentswithout that term. Such a term should not be early removed. Let Vpk

ibe the set of

feature vectors of the sub-segments containing term pki . A score that estimates the

informative degree of a term in text segment pk is assigned to each term pki as the

average of all pairwise similarity among the sub-segments containing pki , i.e., (refer

to Section 2 for simA)

s(pki ) = simA(Vpk

i, Vpk

i).

A term with the minimum s score most likely plays the role of modifier, and isconsidered for removal. Figure 13 shows several examples of paper title with theirrelaxed versions obtained using the above approach. These paper titles are directlycopied from the programs available in the corresponding conferences’ Web sites,in which there are accidentally some typos, e.g., “Chunck” in the second example.These typos will be removed in the very early stage of the relaxation process, whichis a positive by-product effect of this approach. For the constants used in this study,we set Nmax = 100 and Nmin = 20 by default, and a comparison of various valueswill be made in a later experiment

Notice that query relaxation is, in its essense, some sort of fuzzy search, whichallows matches on documents with only a subset of query words in it. However,our query relaxation is somewhat different from the fuzzy search mechanism thatcurrent real-world search engines provide. For handling a repository with scale as



Table V. The F-measure results of clustering paper titles with (+QR) and without (-QR) queryrelaxation using the average-linkage method.

Paper HAC HAC+P HAC+P(ρ = 2)

-QR .7657 .7075 .7045

+QR .8259 .7860 .7860

large as the Web and for considering the response time of each query, it is verydifficult to provide any sophisticated fuzzy search mechanism by real-world searchengines, except the simple boolean “or” operation. In boolean “or” operation, allquery words are treated equally important. However, the search-result snippetswe obtain to enrich the representation of a text segment should better reserve thetopical meaning as close to that of the original segment as possible. With ourquery relaxation, the term not to match is selected under the consideration of acomparison between the information retrieved with and without the term, which isreally beyond the “or” operation on query words.

Of course, there probably exist short text segments without adequate searchresults for describing themselves although we did not encounter such situation inour experiments. If such case happens, there may be a need of thesauri or techniquesof query expansion, usually used in the field of information retrieval. There havebeen several feasible approaches in the literature, e.g., Xu and Croft [1996]. Inorder to keep our study focused, we do not address this special problem in ourcurrent stage of research.

More Experiments

Now let’s re-examine the experiment on clustering paper titles. Originally, therewere 37 paper titles, among 299 ones, obtaining less than 20 search-result entries.After one iteration of query relaxation, there remained only five paper titles withless than 20 search-result entries. Table V shows the resulting F-measure scoresof clustering paper titles using the average-linkage method. There were two setof scores reported: one was achieved without the involvement of query relaxationand the other was achieved by applying one iteration of query relaxation. Theexperimental results show that the extra snippets brought by query relaxation canhelp the clustering of long text segments achieve better performance. The onlyweakness is that the involvement of query relaxation requires costly computationand more network accesses. This is certainly a trade off in terms of achieving betterperformance.

Two other experiments were conducted to examine the effects of other factors thatmay affect the accuracy of text-segment clustering. The first one was performed toexamine the effects of the ranking of search results returned by search engines andthe number of snippets needed to achieve good performance. Table VI shows the F-measure results of clustering Yahoo! CS category names using HAC+P(ρ = 2) withthe average-linkage measure on various ranges of search-result snippets. The secondrow contains the ranges of search-result snippets used; for example, 1–25 indicatesthat the snippets between the 1st and 25th entries in the returned search-result pageare used, and 51–100 indicates that the snippets between 51st and 100th entriesare used, and so on. From the resulting scores, it seemed that the ranking of search



Table VI. The F-measure results of clustering Yahoo! CS category names using various numbersand ranges of search-result snippets.

Number of Snippets 25 50 100

Range of Snippets 1–25 26–50 51–75 76–100 1–50 51–100 1–100

HAC+P(ρ = 2) .7079 .7189 .7055 .7217 .7788 .7574 .8197

Table VII. The F-measure results of clustering paper titles under various combinations of constantvalues.

Paper Nmin=0 20 50 100

Nmax=20 .8248 .8350 N/A N/A

50 .7934 .7989 .8570 N/A

100 .7045 .7860 .7640 .7179

results by search engines did not affect the performance too much, e.g., 1–25 and1–50 achieved almost the same performance as 26–50 and 51–100 did, respectively;i.e., the result entries, not highly dependent on the rank, provided equally effectiveinformation for clustering text segments. Also, more snippets seemed more helpfulin achieving good performance.

The second experiment was conducted to examine the performance of variousconstants used in our study, i.e., Nmax and Nmin. Table VII shows the F-measureresults of clustering paper titles using HAC+P(ρ = 2) with average-linkage measureon various constant values. Notice that Nmin = 0 meant query relaxation was notapplied. The results showed that query relaxation did aid the clustering of papertitles. All trials with query relaxation, i.e., the cells with Nmin > 0, achievedbetter performance than those without query relaxation. However, augmentingmore search results (i.e., larger Nmin values) did not guarantee to achieve betterperformance. The reason might be that relaxing more, i.e., adding more searchresults, would bring more irrelevant stuff and, therefore, degrade the improvement.From the results, the performances varied a lot with different combinations of theseconstant values. A point worth of notice was that larger Nmax values did not getbetter performance. This contradicted the results we attained in the last experimentshown in Table VI, in which more snippets achieved better performance. Thismay be due to the different characteristics of the testing instances. Yahoo! CScategory names could be treated as subject terms, and more information werehelpful to distinguish with other target instances because most of search resultsreturned by search engines were considered relevant. However, the situation wasdifferent for paper titles. Too many search-result snippets seemed to bring a lot ofnoisy information into the task. In conclusion, an optimal number of search-resultsnippets depends on the characteristics of target text segments, and it is not easyto find one generally acceptable principle for choosing the values of these constants.

Overall, the experimental results strongly support the feasibility of our approach.In the following, we have more evaluation through user studies, more discussionsof our overall approach, and some observations in the experiments.



Table VIII. User Study I: Manually evaluating the hierarchies according to various qualitativemeasures.

Yahoo! CS HAC HAC+P HKMeans(ρ = 2) (ρ = 2,k =

√

n)

Cohesiveness 6.5 5.5 5.5 2.5

Isolation 6.0 4.0 4.5 3.0

Hierarchy 4.5 2.0 3.0 2.5

Navigation Balance 5.5 2.0 4.5 3.0

Readability 6.0 3.0 4.0 3.0

6. USER EVALUATION

As the field of automatic topic hierarchy generation or taxonomy creation is rela-tively new in IR research, there is no formal way of evaluating a topic hierarchyto date [Sanderson and Croft 1999; Suan N. M. 2004]. In addition to using the F-measure as a basic metric, it would be helpful to conduct sort of user studies, i.e., toask humans to judge whether they accept the auto-generated classifications. Herea user evaluation with two different tests was, therefore, carried out, and describedas follows.

Test I: Comprehension Test

The first test was conducted to compare the quality of the auto-generated clusterhierarchies with that of manual hierarchies. There were several qualitative measuresneeded to be assigned by the volunteers. These measures were collected based onreviewing some previous works that were in an attempt to make the hierarchiesmore comprehensive [Sanderson and Croft 1999; Suan N. M. 2004; Muller et al.1999].

Cohesiveness. Judge whether the instances clustered together are semanticallysimilar.

Isolation. Judge whether the auto-generated clusters at the same level are dis-tinguishable and their concepts do not subsume one another.

Hierarchy. Judge whether the generated topic hierarchy is traversed from broaderconcepts at the higher levels to narrower concepts at the lower levels.

Navigation Balance. Judge whether the fan-out at each level of the hierarchy isappropriate.

Readability. Judge whether the concepts of clusters at all levels are easy to rec-ognize with the composed clusters and instances.

For the test, total five volunteers were requested to do the evaluation, two ofthem were librarians and tree were students with computer science background.For each measure, the volunteers were asked to assign numeric values ranging from0 to 7 to indicate the degree of its effectiveness. A value of 0 meant very poor,and a value of 7 meant very well. Table VIII shows the average scores assigned bythe volunteers on evaluating the original Yahoo! CS hierarchy and the hierarchiesgenerated by applying various automatic methods on the Yahoo! CS data set.

It was without surprise that the Yahoo! CS hierarchy obtained the highest scoresin all measures. However, it was encouraging that the HAC+P method performed



better than the other automatic methods in almost every measure, and its overallperformance was very promising, compared with that achieved by manual classi-fication. This is consistent with the result obtained using the F-measure as themetric. The result also showed that the meric of the HAC-based methods, thatis, it performed very well in grouping strongly relevant terms and achieved a highscore in cohesive measure, especially for the smaller clusters. Unfortunately, thismeric did not always exist when using the HKMeans method. From observing ourresult hierarchies, HKMeans produced some casual errors, e.g., it clustered “Nat-ural Language Processing” together with “Constraint Programming” rather than“Computational Linguistics.” In the meantime, the experiment also showed somedrawbacks of manual classification. For example, in Yahoo!’s classification, the sub-ject terms in the same level are alphabetically ordered, and this might not providegood readability in finding classes with close semantics among them. For example,the terms “Linux” and “Unix” were not easily found at first glance, since they wereseparated by a number of other OS names such as “Mach” and “Macintosh OS” inthe classification.

Nevertheless, the above user evaluation revealed two major weaknesses of theautomatic methods. First, the automatic methods performed not well enough in theisolation measure. That was because not all similar clusters could be automaticallymerged into the same larger one, which might cause some confusion in browsingand further decrease the isolation measure. Second, there was an unsatisfactoryand more challenging measure, i.e. the hierarchy value, which could not be revealedusing F-measure. Clustering is an approach to grouping instances based on theirsimilarities. In our work, it is good in finding related or similar instances, but has alimitation in determining the broader or narrower relationship between instances.Thus, determining the different levels of granularity or the subsumption relationshipamong topics could not be performed very well by our current approach. However,these qualitative measures could not be evaluated without some subjectivity, thoughwe have confidence in the result obtained.

Test II: Usability Test

Topic hierarchy creating is a rather intellectual work, which is very difficult evenfor humans. An automatic approach to grouping similar/related instances wouldprovide a lot of help for humans towards further construction of a real and completetopic hierarchy. The purpose of the second test was to realize whether the proposedapproach helps human experts in reducing the time to construct a topic hierarchy,and improving the classification accuracy. We asked four more computer sciencestudents, who did not participate in the first test, to form two separate groupsand construct the Yahoo! CS hierarchy manually. The first group performed thetask from scratch and the second group used the cluster hierarchy generated byHAC+P as a reference. These volunteers were informed the exact number of levelsand classes of Yahoo!’s original CS hierarchy, but did not be asked to constructit with the same scale. We recorded the time they spent and calculated the F-measures of the hierarchies they constructed over the Yahoo!’s. Table IX shows theresult average time spent and F-measure scores.

From this table, the usefulness of the proposed approach in helping topic hierar-chy creation can be seen. The manual construction really took time. It was a bit



Table IX. User Study II: Average time spent and F-measure scores for the manual hierarchiesconstructed from scratch and with references.

Group I: From scratch Group II: Referring thegenerated hierarchy

Time Spent 65 min 35 min

F-measure .6792 .8008

surprising that the manually-constructed topic hierarchies did not achieve too highF-measure values, which meant different people have different expectations of thetopic hierarchies. In addition to the bias of human preference, we found the lackof domain knowledge another reason that affected the performance of manual clas-sification. The test terms covered too broad topic areas in computer science, andsome of them were really unknown to the volunteers. Therefore, the auto-generatedcluster hierarchies can not only help in reducing the time spent but also improvethe classification accuracy if there is a human expert to do a post editing.

The usefulness can be further observed, especially when dealing with a large-scaletask, such as organizing a Web directory or taxonomy to classify users’ queries. Weapplied the proposed approach to construct a topic hierarchy for a set of 1,000 pop-ular queries, which were collected from the log of a real world search engine [Chuangand Chien 2002]. A topic hierarchy of these 1,000 queries can be built up in 39minutes under a PC environment. It has no doubt that no human experts canachieve the same performance within such a short period. This demonstrated theextensibility and scalability of the proposed approach. Nevertheless, the proposedapproach merely provides an initial step. It has not addressed all the issues of thischallenging problem.

7. MORE DISCUSSIONS

A core of the proposed approach is using the Web search-result snippets returned byreal-world search engines to enrich the representation of the target text segments.This gives us several apparent advantages. First, huge amounts of available docu-ments on the Web have been indexed; it has a higher chance to get sufficient andbalanced information to characterize a text segment. Second, as the Web is alwaysrefreshing and enlarging, it rarely lacks the information about text segments, evenfor those of new products and breaking news events. Third, as a result of recentadvances in search technologies, highly ranked documents usually contain docu-ments of interest and can be treated as an approximation of the text segments’topic domains. Therefore, we do not have to prepare domain-specific corpora fordescribing text segments laboriously; existing search engines reduce our load.

Unfortunately, there are also several weaknesses. Using search-result snippetsrequires a lot of Web access; normally, one interaction with the backend searchengines is needed for one text segments. However, this is a necessary cost of anyapproach that uses the Web as corpus. Besides, the dynamic and ever-changing na-ture of the Web becomes a drawback when considering the issue of reproducibility.The search results of a text segment may not be the same in every trial of retrieval.This is certainly a problem but a trade-off. As a result of recent advances in search



technologies, it is highly believed that the performance of every trial will be con-sistent even though not exactly the same. In addition, Web contents are usuallyconsidered heterogeneous and noisy, and need careful treatment. Many informa-tion retrieval problems such as text categorization met big challenges when dealingwith Web contents by directly applying their traditional well-performed methods[Chakrabarti et al. 1998]. However, the situation is a little different when usingthe search-result snippets. With the presentation schemes of most search engines,the neighboring contents surrounding a matched query in Web pages are selectivelyshown in the returned snippets. Therefore, features are extracted from the corre-sponding text segment’s contexts instead of the whole Web page. The observationin Section 2 also shows its promising ability of using search-result snippets as thesource of features; i.e., search-result snippets carry a lot of related terms that arefeasible to be treated as features.

We have shown that highly ranked search-result snippets do contain related termsfor a text segment. Now the question is: Do the features extracted from highlyranked search-result snippets sufficiently characterize a given text segment? For atext segment with only a specific meaning, the retrieved highly ranked snippets arebelieved to represent that meaning. But, for a text segment with several meanings,e.g., the classic “jaguar” query, those highly ranked snippets may contain too fewor even no relevant information for certain of the meanings; i.e., some meaningscan not be revealed. To examine the phenomenon, take the following six words asa clustering example:

jaguar tiger lion toyota bmw nissan.After applying our approach, the result clusters are {lion, tiger} and {jaguar, bmw,toyota, nissan}. It is obviously that “jaguar” is characterized closer to stuff aboutcars rather than cat animals. Now consider the clustering of the following six words:

jaguar tiger lion sheep rabit cow.The result clusters are {rabit}, {sheep, cow}, and {lion, jaguar, tiger}. In this case,“jaguar” is associated correctly with other wild cat animals and distinguished fromthose tame animals. Obviously, the clustering result depends on other instancesof concern. From the above two examples, the search-result snippets of “jaguar”seem containing mostly the stuff about “jaguar” as a car, and lessly that aboutit as an animal but not zero. It is impractical to get all retrieved snippets for atext segment. Relying on highly ranked snippets is certainly practical but mightsuffer from the obstacle described above. Especially, search engines have their ownpreference in ranking their search results. For example, some engines may preferto rank high those Web pages that are organizations or from famous Web sites. Insuch situation, search engines introduce a bias to our determining of the topic sim-ilarity between text segments through the use of search-result snippets. Actually,this problem is not serious when dealing with domain-specific text segments, suchas computer-science subject terms and paper titles in our experiments since theyrepresent requests in specific domains. But it may limit our approach as powerfulenough to dealing with common text segments, e.g., those common words found ina dictionary.

Besides the issue of acquiring features from search-result snippets, there are alsochallenges to our clustering algorithm in dealing with those polysemous text seg-



ments; e.g., “natural language processing” is in both AI and Linguistics categories,and “Newton” is not only a physician but also a mathematician. The proposed ap-proach is a hard-clustering algorithm; there, one instance would be placed in onlyone cluster. Clustering text segments with multiple topic domains into multipleclusters needs further exploration. A possible approach is to apply categorizationtechniques to those clustered text segments in a later stage so as to associate themwith other related clusters [Chuang and Chien 2003].

Readers may feel confused why not just use HAC as the clustering algorithm be-cause HAC gains the highest scores. Although F-measure is a metric so popularlyaccepted, it really has its weakness in measuring the quality of clustering results,especially when we are considering the generated structure. According to the defi-nition in Equation 2, one class is mapped to only one cluster node in the hierarchyto score the F-measure value. This is mainly from the perspective of retrieval, anddoes not account the structure information of the hierarchy. From the examplesshown in Figures 12 and 11, multi-way-tree presentation of text segments seemsmore comprehensive than the binary-tree one. Of course, such conclusion is sub-jective and should be justified through user test, which depends on the applicationsin which the hierarchies are used. There seems no commonly accepted metric usedto compare the quality between structures.

So far, we have discussed many aspects about using search-result snippets inSection 2 and 7. Those strengths show greatly the potential of our approach,while those weaknesses show a limitation of it as a general approach to process-ing segments with multiple meanings or rare segments with very few search-resultsnippets. Here we retain a positive perspective of our approach: the way we incor-porate the search-result snippets can substantially boost the applicability of variousstatistics-based information processing techniques, e.g., categorization and cluster-ing, in dealing with items that have a lack of information for describing themselves.

8. RELATED WORK

To our knowledge, there is little work directly addressing the problem we considerin this paper. Below we review some works that are considered relevant to ours.

Web and Text Mining. Our research is related to Web and text mining [Chakrabarti2002; Hearst 1999]. Feldman and Dagan [1995] created the Knowledge Discoveryin Texts (KDT) system, the precursor to text mining, to find segments in conceptdistributions in textual data. A variety of related studies have focused on differ-ent subjects, such as the automatic extraction of terms or phrases [Ahonen et al.1999], extending knowledge bases such as WordNet using Web corpora [Moldovanand Girju 2001], the discovery of rules for the extraction of specific informationsegments [Soderland 1997], and ontology construction based on semi-structureddata [Agirre et al. 2000]. Different from these previous works, the proposed ap-proach is designed to organize text segments by mining search-result pages anddiscovering topic hierarchies.

Sanderson and Croft [1999] proposed a feasible approach to automatically con-structing a hierarchical organization of terms from a set of documents that couldthen provide an overview of those documents. Instead of using clustering tech-niques, they performed analysis of subsumption term pairs to order the terms from



general to specific, from which a concept hierarchy was derived. The problem weconsider differs from theirs in several aspects. In our study, the text segments ofconcern may not be associated with a set of documents. Deriving the hierarchy ofterms according to the source documents of terms might have advantages, e.g., thederived hierarchy should mostly reflect the domain concepts those terms are usedto describe. Unfortunately, there is not a domain-specific corpus available in somereal tasks. Our approach, instead, uses the Web as a global corpus to discover thesimilarity relationships between text segments and then derives the hierarchical or-ganization of text segments. Also, the text segments we concern are not restrictedas terms from documents; they are more general and include longer sentences, suchas paper or book titles, natural language questions, and free-form query strings.

Document Clustering and Taxonomy Generation. Taxonomy-generation tools cre-ate a Yahoo!-like directory structure for navigating the contents of a portal orintranet. Some tools provide automatic methods for initial taxonomy construc-tion based on hierarchical clustering of documents [Li et al. 2003; Sullivan 2002;Vaithyanathan and Dom 2000]. It is challenging to create a topic hierarchy for agroup of documents. Document clustering varies in accuracy, documents normallycontain diverse subjects in their contents, and the generated categories might notcomprehensively reflect users’ preferences. Sometimes, further topic-term extrac-tion techniques, such as the one of Lawrie et al. [2001], are required to enhancethe readability of the generated document hierarchy. Unlike a document, a textsegment often represents a specific meaning and is more understandable. In someapplications, the need to generate topic hierarchies for text segments is even greaterthan that for documents although the challenge is almost the same. However, com-pared with a general document, a text segment is much shorter and contains toofew words in itself. While natural language processing techniques are not matureenough to understand the semantics of a text segment in a non-controlled subjectdomain, some extra information must be borrowed to enrich the representation ofa text segment. Fortunately, the rich resources on the Web provide a general wayto make up for this insufficiency, greatly realizing the clustering of text segmentsusing various statistics-based methods.

Some research works proposed approaches to exploiting existent document tax-onomies and enriching them using various classification techniques [Agrawal andSrikant 2001; Chakrabarti et al. 1998; Koller and Sahami 1997]. The main concernof such works is the fact: hierarchy creating is a difficult task for a human, so auto-matically creating a meaningful and useful hierarchy seems very hard. Therefore,a practical approach should exploit the existent taxonomies, such as the Yahoo! di-rectory, the Dewey Decimal or ACM Computing classification system. However,in many cases, there is no existent taxonomy or the existent one cannot matchthe problem domain of interest. Creating a taxonomy from scratch, therefore, be-comes the case in real. In conclusion, creating taxonomies and enriching them, e.g.,Chuang and Chien [2003], are the complement of each other. In this work, we onlyaddress the problem of constructing topic taxonomies for text segments.

Word/Term and Query Clustering. A number of approaches have been developedin computational linguistics for clustering words with similar functions. Theseapproaches mainly rely on some sort of context analysis obtained from a tagged



corpus [Brown et al. 1991; Johansson et al. 1986; Pereira et al. 1993]. Similarly,search-result snippets can be treated as a source of contextual information for textsegments. However, it is not restricted to a corpus with limited size and domains.Besides, a text segment could be longer than a word, and usually represents adomain-specific request. Traditional analysis on neighboring context words of thetarget word in natural language processing is not straightforward to apply.

Some researches on clustering indexed words, referred as term clustering in someworks, are in certain degree related to ours [Baker and McCallum 1998; Dhillonet al. 2002; Slonim and Thishby 2000]. Most of them dealt with the automaticclustering of indexed words into clusters and used them to assist information re-trieval systems in improving the classification accuracy or the retrieval precisionand recall ratios. Such word clustering usually exists as an accompaniment of doc-ument clustering or classification. The construction of topic hierarchies for wordswas not the main subject of these investigations. In addition, text segments arenot necessarily associated with documents, which makes text-segment clusteringsignificantly different from the traditional word clustering.

In recent years, a great diversity of Web users’ search interests has led researchersto investigate the problem of query clustering, which is considered related to ourproblem because query is one type of text segment. Beeferman and Berger [2000]proposed a query clustering method based on “click-through data,” a collection ofuser transactions with queries and their corresponding clicked URLs. Wen et al.[2002] developed a similar method that combined the indexed terms from the clickedpages to estimate the similarity between queries. However, the number of distinctURLs is often huge in a Web search service. This would cause many similar queriesnot grouped together because of a lack of common clicked URLs. Besides, all theabove works only considered flat, un-nested clustering of queries. Flat clustering ofqueries is sufficiently appropriate to perform term suggestion routine or the findingof relevant queries. However, to enable humans to browse and do deeper anal-ysis, hierarchical organization seems more feasible, especially when dealing withlarge amounts of queries. In order to have adequate information to characterizethe intended search interests of most users’ queries, Chuang and Chien [2002] ex-ploited the search-result snippets retrieved by queries to estimate the similaritybetween queries and designed an effective algorithm for hierarchical clustering ofqueries. This paper follows this previous work, and extends the approach, with amore refined hierarchical clustering algorithm and some novel auxiliary processingtechniques, to a broader set of text segments than just queries.

9. CONCLUDING REMARKS

This paper has proposed a practical Web-based approach to organizing text seg-ments into a hierarchical structure of topic classes. Although the hierarchical clus-tering of text segments is in essence considered very difficult, the Web provides analternative way to deal with this problem. With huge amounts of available docu-ments on the Web been indexed by real-world search engines, most of text segmentscan get adequate topic-relevant contextual information. Hence, the approach is de-signed to be combined with the search processes of real-world search engines toextract features from the retrieved highly ranked search-result snippets for each



text segment. Also, a clustering algorithm for generating a natural multi-way-treecluster hierarchy is developed; the algorithm is a hierarchical agglomerative clus-tering algorithm, which generates a binary-tree hierarchy at first, followed by ahierarchical cluster partitioning technique to generate a multi-way-tree hierarchy.The partitioning technique is designed based on a min-max objective function tomeasure the quality of clusters, and is combined with a heuristically acceptable pref-erence function on the number of clusters to ensure that the produced hierarchy isnatural for humans. Extensive experiments were conducted on different domainsof text segments, including subject terms, people names, paper titles, and naturallanguage questions. The obtained experimental results have shown the feasibilityof our approach. The approach is believed useful in various Web information appli-cations. Future work encourages us to investigate the possibility of our approachon more types of text segments. For example, dealing with polysemous text seg-ments is not well explored in our current stage of study. In addition, providing amore sophisticated cluster naming technique is another urgent demand in order toprovide users a more comprehensive result topic hierarchy.

ACKNOWLEDGMENTS

The authors would like to thank the associate editor and the anonymous reviewers.Their valuable comments and suggestions greatly improved the quality of this paper.

REFERENCES

Agirre, E., Ansa, O., Hovy, E., and Martinez, D. 2000. Enriching very large ontologies usingthe WWW. In Proceedings of ECAI 2000 Workshop on Ontology Learning. Berlin, Germany.

Agrawal, R. and Srikant, R. 2001. On integrating catalogs. In Proceedings of the 10th Inter-national World Wide Web Conference. Hong Kong, ACM Press, New York, NY, 603–612.

Ahonen, H., Heinonen, O., Klemettinen, M., and Verkamo, A. 1999. Finding co-occurring

text phrases by combining sequence and frequent set discovery. In Proceedings of IJCAI’99Workshop on Text Mining: Foundations, Techniques and Applications. Stockholm, Sweden,1–9.

Baker, L. D. and McCallum, A. K. 1998. Distributional clustering of words for text classifi-cation. In Proceedings of the 21st Annual International ACM SIGIR Conference on Researchand Development in Information Retrieval. Melbourne, Australia, ACM Press, New York, NY,

96–103.

Beeferman, D. and Berger, A. 2000. Agglomerative clustering of a search engine query log. In

Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery andData Mining. Boston, MA, USA, ACM Press, New York, NY, 407–416.

Brown, P., Pietra, S. D., Pietra, V. D., and Mercer, R. 1991. Word sense disambiguationusing statistical methods. In Proceedings of the 29th Annual Meeting of the Association forComputational Linguistics. Berkeley, CA, USA, 264–270.

Buckley, C., Salton, G., and Allan, J. 1992. Automatic retrieval with locality informationusing smart. In Proceedings of the 1st Text REtrieval Conference (TREC-1). Gaithersburg,MD, 59–72.

Chakrabarti, S. 2002. Mining the Web: Discovering Knowledge from Hypertext Data. ElsevierScience & Technology.

Chakrabarti, S., Dorm, B., and Indyk, P. 1998. Enhanced hypertext categorization usinghyperlinks. In Proceedings of 1998 ACM SIGMOD International Conference on Management

of Data. Seattle, USA, ACM Press, New York, NY, 307–318.

Chuang, S.-L. and Chien, L.-F. 2002. Towards automatic generation of query taxonomy: A hier-



archical query clustering approach. In Proceedings of the 2002 IEEE International Conferenceon Data Mining. Maebashi City, Japan, IEEE Computer Society Press, 75–82.

Chuang, S.-L. and Chien, L.-F. 2003. Enriching web taxonomies through subject categoriza-tion of query terms from search engine logs. Decision Support System, Special Issue on WebRetrieval and Mining 35, 1, 113–127.

Dhillon, I. S., Mallela, S., and Kumar, R. 2002. Enhanced word clustering for hierarchical textclassification. In Proceedings of the 8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining. ACM Press, New York, NY, Edmonton, Canada.

Feldman, R. and Dagan, I. 1995. Knowledge discovery in textual databases (KDT). In Proceed-ings of the 1st International Conference on Knowledge Discovery and Data Mining. Montreal,Canada, AAAI Press, 112–117.

Glover, E., Pennock, D. M., Lawrence, S., and Krovetz, R. 2002. Inferring hierarchical de-scriptions. In Proceedings of the 11th International Conference on Information and Knowledge

Management (CIKM). McLean, Virginia, USA, 4–9.

Hearst, M. 1999. Untangling text data mining. In Proceedings of the 37th Annual Meeting of

the Association for Computational Linguistics. Maryland, USA.

Johansson, S., Atwell, E., Garside, R., and Leech, G. 1986. THE TAGGED LOB CORPUS:Users’ Manual.

Koller, D. and Sahami, M. 1997. Hierarchically classifying documents using very few words.In Proceedings of the 14th International Conference on Machine Learning. Morgan Kaufmann,

San Francisco, CA, USA, 170–178.

Larsen, B. and Aone, C. 1999. Fast and effective text mining using linear-time documentclustering. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining. San Diego, CA, USA, ACM Press, New York, NY, 16–22.

Lawrie, D., Croft, W. B., and Rosenberg, A. L. 2001. Finding topic words for hierarchical

summarization. In Proceedings of the 24th Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval. New Orlean, LA, USA, ACM Press, NewYork, NY, 349–357.

Li, T., Zhu, S., and Ogihara, M. 2003. Topic hierarchy generation via linear discriminant pro-jection. In Proceedings of the 26th Annual International ACM SIGIR Conference on Researchand Development in Informaion Retrieval. Toronto, Canada, 421–422.

Manning, C. D. and Schutze, H. 1999. Foundations of Statistical Natural Language Processing.MIT Press, Cambridge, MA, USA.

McCallum, A. K., Rosenfeld, R., Mitchell, T. M., and Ng, A. Y. 1998. Improving textclassification by shrinkage in a hierarchy of classes. In Proceedings of the 15th InternationalConference on Machine Learning, J. W. Shavlik, Ed. Morgan Kaufmann Publishers, San Fran-

cisco, US, Madison, US, 359–367.

Milligan, G. W. and Cooper, M. C. 1985. An examination of procedures for detecting thenumber of clusters in a data set. Psychometrika 50, 159–179.

Mirkin, B. 1996. Mathematical Classification and Clustering. Kluwer.

Moldovan, D. I. and Girju, R. 2001. An interactive tool for the rapid development of knowledgebases. International Journal on Artificial Intelligence Tools 10, 1-2 (Mar & Jun), 65–86.

Muller, A., Dorre, J., Gerstl, P., and Seiffert, R. 1999. The TaxGen framework: Automat-

ing the generation of a taxonomy for a large document collection. In Proceedings of the 32ndHawaii International Conference on System Sciences. Maui, Hawaii.

Pereira, F. C. N., Tishby, N., and Lee, L. 1993. Distributional clustering of english words.In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics.183–190.

Salton, G. and Buckley, C. 1988. Term weighting approaches in automatic text retrieval.Information Processing and Management 24, 513–523.

Sanderson, M. and Croft, B. 1999. Deriving concept hierarchies from text. In Proceedingsof the 22nd Annual International ACM SIGIR Conference on Research and Development inInformation Retrieval. Berkeley, CA, USA, ACM Press, New York, NY, 206–213.



Slonim, N. and Thishby, N. 2000. Document clustering using word clusters via the informationbottleneck method. In Proceedings of the 23rd Annual International ACM SIGIR Conferenceon Research and Development in Information Retrieval. Athens, Greece, ACM Press, New

York, NY, 208–215.

Soderland, S. 1997. Learning to extract text-based information from the world wide web. In

Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining.Newport Beach, CA, USA, AAAI Press, 251–254.

Suan N. M., M. 2004. Semi-automatic taxonomy for efficient information searching. In Proceed-ings of the 2nd International Conference on Information Technology for Application.

Sullivan, D. 2002. Document warehousing & content management: Poor search quality in yourenterprise information portal? DM Review .

Vaithyanathan, S. and Dom, B. 2000. Model-based hierarchical clustering. In Proceedings ofthe 16th Conference on Uncertainty in Artificial Intelligence. Stanford, CA, USA, 599–608.

Wen, J.-R., Nie, J.-Y., and Zhang, H.-J. 2002. Query clustering using user logs. ACM Trans-

actions on Information Systems 20, 1 (January), 59–81.

Willet, P. 1988. Recent trends in hierarchical document clustering: A critical review. Informa-

tion Processing and Management 24, 577–597.

Xu, J. and Croft, B. 1996. Query expansion using local and global document analysis. In

Proceedings of the 19th Annual International ACM SIGIR Conference on Research and De-velopment in Information Retrieval. Zurich, Switzerland, ACM Press, New York, NY, 4–11.

Received May 2003; revised February 2004; accepted February 2005


topic hierarchy generation for text segments: a practical web-based …b91034/hacp.pdf · 2006. 8....

Documents