exploring the use of concept spaces to improve medical information retrieval · 2011. 5. 15. ·...

17
Ž . Decision Support Systems 30 2000 171–186 www.elsevier.comrlocaterdsw Exploring the use of concept spaces to improve medical information retrieval Andrea L. Houston a, ) , Hsinchun Chen b,1 , Bruce R. Schatz c,2 , Susan M. Hubbard d,3 , Robin R. Sewell e,4 , Tobun D. Ng f,5 a ISDS Department, College of Business Administration, Louisiana State UniÕersity, 3178B4 CEBA, Baton Rouge, LA 70803, USA b Management Information Systems Department, UniÕersity of Arizona, Tucson, AZ 85721, USA c ( ) Community Architecture for Network Information Systems CANIS Laboratory, UniÕersity of Illinois, Urbana–Champaign, Urbana, IL 61820, USA d Director of the International Cancer Information Center, National Cancer Institute, Bethesda, MD 20852, USA e School of Library Science, UniÕersity of Arizona, Tucson, AZ 85721, USA f School of Computer Science, Carnegie Mellon UniÕersity, Pittsburgh, PA 15213, USA Abstract This research investigated the application of techniques successfully used in previous information retrieval research, to the more challenging area of medical informatics. It was performed on a biomedical document collection testbed, Ž . CANCERLIT, provided by the National Cancer Institute NCI , which contains information on all types of cancer therapy. The quality or usefulness of terms suggested by three different thesauri, one based on MeSH terms, one based solely on Ž . terms from the document collection, and one based on the Unified Medical Language System UMLS Metathesaurus, was explored with the ultimate goal of improving CANCERLIT information search and retrieval. Researchers affiliated with the University of Arizona Cancer Center evaluated lists of related terms suggested by different thesauri for 12 different directed searches in the CANCERLIT testbed. The preliminary results indicated that among the thesauri, there were no statistically significant differences in either term recall or precision. Surprisingly, there was almost no overlap of relevant terms suggested by the different thesauri for a given search. This suggests that recall could be significantly improved by using a combined thesaurus approach. q 2000 Elsevier Science B.V. All rights reserved. Keywords: Information retrieval; Medical informatics; Medical information retrieval; Concept space; MeSH terms; UMLS Metathesaurus ) Corresponding author. Tel.: q 1-225-578-2503; fax: q 1-225- 578-2511. Ž . E-mail addresses: [email protected] A.L. Houston , Ž . Ž [email protected] H. Chen , [email protected] B.R. . Ž . Schatz , [email protected] S.M. Hubbard , [email protected] Ž . Ž . R.R. Sewell , [email protected] T.D. Ng . 1 Tel.: q 1-520-621-4153. 2 Tel.: q 1-217-244-0651. 3 Tel.: q 1-301-496-9096. 4 Tel.: q 1-520-621-2748. 5 Tel.: q 1-412-268-4499. 1. Introduction Medicine is a dynamic field incorporating numer- ous specialties, each with its own preferred terminol- ogy. This diversity of vocabularies can be an obsta- cle for medical professionals requiring access to w x current medical information 16 . While advances in medical database technology have improved infor- mation accessibility, retrieval speed, and searching flexibility, they have not resolved the problems of 0167-9236r00r$ - see front matter q 2000 Elsevier Science B.V. All rights reserved. Ž . PII: S0167-9236 00 00097-X

Upload: others

Post on 09-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploring the use of concept spaces to improve medical information retrieval · 2011. 5. 15. · Decision Support Systems 30 2000 171–186 . Exploring the use of concept spaces to

Ž .Decision Support Systems 30 2000 171–186www.elsevier.comrlocaterdsw

Exploring the use of concept spaces to improve medicalinformation retrieval

Andrea L. Houston a,), Hsinchun Chen b,1, Bruce R. Schatz c,2, Susan M. Hubbard d,3,Robin R. Sewell e,4, Tobun D. Ng f,5

a ISDS Department, College of Business Administration, Louisiana State UniÕersity, 3178B4 CEBA, Baton Rouge, LA 70803, USAb Management Information Systems Department, UniÕersity of Arizona, Tucson, AZ 85721, USA

c ( )Community Architecture for Network Information Systems CANIS Laboratory, UniÕersity of Illinois, Urbana–Champaign,Urbana, IL 61820, USA

d Director of the International Cancer Information Center, National Cancer Institute, Bethesda, MD 20852, USAe School of Library Science, UniÕersity of Arizona, Tucson, AZ 85721, USA

f School of Computer Science, Carnegie Mellon UniÕersity, Pittsburgh, PA 15213, USA

Abstract

This research investigated the application of techniques successfully used in previous information retrieval research, tothe more challenging area of medical informatics. It was performed on a biomedical document collection testbed,

Ž .CANCERLIT, provided by the National Cancer Institute NCI , which contains information on all types of cancer therapy.The quality or usefulness of terms suggested by three different thesauri, one based on MeSH terms, one based solely on

Ž .terms from the document collection, and one based on the Unified Medical Language System UMLS Metathesaurus, wasexplored with the ultimate goal of improving CANCERLIT information search and retrieval.

Researchers affiliated with the University of Arizona Cancer Center evaluated lists of related terms suggested by differentthesauri for 12 different directed searches in the CANCERLIT testbed. The preliminary results indicated that among thethesauri, there were no statistically significant differences in either term recall or precision. Surprisingly, there was almost nooverlap of relevant terms suggested by the different thesauri for a given search. This suggests that recall could besignificantly improved by using a combined thesaurus approach. q 2000 Elsevier Science B.V. All rights reserved.

Keywords: Information retrieval; Medical informatics; Medical information retrieval; Concept space; MeSH terms; UMLS Metathesaurus

) Corresponding author. Tel.: q1-225-578-2503; fax: q1-225-578-2511.

Ž .E-mail addresses: [email protected] A.L. Houston ,Ž . Ž[email protected] H. Chen , [email protected] B.R.

. Ž .Schatz , [email protected] S.M. Hubbard , [email protected]Ž . Ž .R.R. Sewell , [email protected] T.D. Ng .

1 Tel.: q1-520-621-4153.2 Tel.: q1-217-244-0651.3 Tel.: q1-301-496-9096.4 Tel.: q1-520-621-2748.5 Tel.: q1-412-268-4499.

1. Introduction

Medicine is a dynamic field incorporating numer-ous specialties, each with its own preferred terminol-ogy. This diversity of vocabularies can be an obsta-cle for medical professionals requiring access to

w xcurrent medical information 16 . While advances inmedical database technology have improved infor-mation accessibility, retrieval speed, and searchingflexibility, they have not resolved the problems of

0167-9236r00r$ - see front matter q 2000 Elsevier Science B.V. All rights reserved.Ž .PII: S0167-9236 00 00097-X

Page 2: Exploring the use of concept spaces to improve medical information retrieval · 2011. 5. 15. · Decision Support Systems 30 2000 171–186 . Exploring the use of concept spaces to

Report Documentation Page Form ApprovedOMB No. 0704-0188

Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering andmaintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, ArlingtonVA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if itdoes not display a currently valid OMB control number.

1. REPORT DATE 2000 2. REPORT TYPE

3. DATES COVERED 00-00-2000 to 00-00-2000

4. TITLE AND SUBTITLE Exploring the use of concept spaces to improve medical information retrieval

5a. CONTRACT NUMBER

5b. GRANT NUMBER

5c. PROGRAM ELEMENT NUMBER

6. AUTHOR(S) 5d. PROJECT NUMBER

5e. TASK NUMBER

5f. WORK UNIT NUMBER

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) University of Arizona ,Management Information Systems Department ,Tucson,AZ,85721

8. PERFORMING ORGANIZATIONREPORT NUMBER

9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)

11. SPONSOR/MONITOR’S REPORT NUMBER(S)

12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited

13. SUPPLEMENTARY NOTES

14. ABSTRACT This research investigated the application of techniques successfully used in previous information retrievalresearch, to the more challenging area of medical informatics. It was performed on a biomedical documentcollection testbed CANCERLIT, provided by the National Cancer Institute NCI., which containsinformation on all types of cancer therapy. The quality or usefulness of terms suggested by three differentthesauri, one based on MeSH terms, one based solely on terms from the document collection, and onebased on the Unified Medical Language System UMLS.Metathesaurus, was explored with the ultimate goalof improving CANCERLIT information search and retrieval. Researchers affiliated with the University ofArizona Cancer Center evaluated lists of related terms suggested by different thesauri for 12 differentdirected searches in the CANCERLIT testbed. The preliminary results indicated that among the thesauri,there were no statistically significant differences in either term recall or precision. Surprisingly, there wasalmost no overlap of relevant terms suggested by the different thesauri for a given search. This suggeststhat recall could be significantly improved by using a combined thesaurus approach.

15. SUBJECT TERMS

16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT Same as

Report (SAR)

18. NUMBEROF PAGES

16

19a. NAME OFRESPONSIBLE PERSON

a. REPORT unclassified

b. ABSTRACT unclassified

c. THIS PAGE unclassified

Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18

Page 3: Exploring the use of concept spaces to improve medical information retrieval · 2011. 5. 15. · Decision Support Systems 30 2000 171–186 . Exploring the use of concept spaces to

( )A.L. Houston et al.rDecision Support Systems 30 2000 171–186172

vocabulary differences among biomedical specialties,variations in indexing and classification systems, norvariations in information accessing systems. Medicalinformation retrieval, as a specialized case of infor-mation retrieval, is subject to classic informationretrieval problems such as: Ainformation overloadBw x Ž w x.3 , the Avocabulary problemB semantic barrier 17 ,synonymy and polysemy. It also has some interesting

w xproblems of its own 12,14,15 .The community of medical information users is

extremely varied in its level of biomedical expertise,its familiarity with various biomedical indexing vo-cabularies and its information usage requirements.For example, biomedical expertise ranges from pa-tients and families encountering terms for the firsttime, to specialists in focused research areas who areconsidered experts. Compounding this problem is thefact that there is no single commonly acceptedbiomedical indexing vocabulary. This lack of aninformation standard and the existence of thousandsof different medical databases containing informa-tion that can be formatted, indexed and stored in avariety of different ways make it difficult, if notimpossible, to locate and exchange medical informa-tion. Users requiring information from a variety ofmedical sources may have to learn several differentinformation retrieval systems and several differentindexing vocabularies to locate the information theyneed.

Depending on medical information usage require-ments, the goals of indexing vocabularies may con-flict. For example, biomedical research information,databases of clinical studies or drug trials, and medi-cal insurance databases all need to have data orga-

Ž .nized or summarized by categories generalization .However, primary care professionals dealing withindividual patient records require a detailed, preciseand expressive vocabulary that can accurately de-

w xscribe patient information 11,13 . Patient recordscan be a composite of every potential data formatŽ .numeric, free text, tables, graphs, images and audio .Patient record information systems, therefore, require

Ža standard vocabulary that can specialize the direct.opposite of generalization and accommodate a mas-

sive quantity of highly variable and volatile informa-tion, thereby increasing medical information systemchallenges.

We are investigating improving medical informa-tion retrieval by building on techniques successfully

Žapplied to other information retrieval domains e.g.WormrFly Genome, the Internet, and a large scien-

.tific abstract collection . Previous research demon-strated that the creation of automatically generated

Ž .concept spaces thesauri is an efficient, effectivetechnique to improve document precision and recallin directed searches of large information spaces. The

w xWormrFly genome research 5 indicated that acombined thesaurus approach could improve recallwithout sacrificing precision. Currently, we are in-vestigating augmenting automatically generated con-cept space terms with terms from existing medical

Ž .thesauri: Medical Subject Headings MeSH and theŽ .Unified Medical Language System UMLS

Metathesaurus, both developed and maintained byŽ .the National Library of Medicine NLM .

2. Literature review

There are four major approaches to textual medi-cal information retrieval. They are: keyword index-

Ž .ing and retrieval traditional method , statisticallyŽ .based methods Salton-based syntactic techniques ,

Žrelevance feedback using searcher feedback to im-. Žprove future searches , and semantic methods in-

cluding extensions to Salton’s techniques and Natu-.ral Language Processing .

2.1. Human indexing and keyword search

The most common example of the keyword ap-Žproach is human indexing using a standard set of

pre-determined subject terms that domain experts.assign to documents . This approach relies entirely

on indexer expertise in both subject domain andw xstandard indexing vocabularies. Previous research 2

demonstrated that different, well-trained indexers of-ten assign different terms to the same documentŽ .synonymy and that an indexer may use differentterms for the same document at different times.Meanwhile, different users tend to use different terms

Ž .to seek identical information polysemy . Furnas etw xal. 10 , showed that the probability of two people

using the same term to classify an object was lessthan 20%.

Page 4: Exploring the use of concept spaces to improve medical information retrieval · 2011. 5. 15. · Decision Support Systems 30 2000 171–186 . Exploring the use of concept spaces to

( )A.L. Houston et al.rDecision Support Systems 30 2000 171–186 173

Because of these discrepancies, an exact matchbetween searcher terms and indexer terms is un-likely, resulting in poor document recall and preci-sion. Individual keywords are simply not adequatediscriminators of semantic content. Furthermore,manual indexing is too time-consuming for process-ing large volumes of information. Human indexingmethods need to be either supplemented or replacedby more effective and efficient techniques, hence,the major research effort in developing automaticindexing techniques.

2.2. Statistical techniques — Õector space and the-saurus

Other approaches to addressing the vocabularyproblem include using either a thesaurus or a vector

w x Žspace document representation 19 . Thesauri mainlyused to expand users’ queries by translating queryterms into alternative terms that match document

.indexes can be generated manually or automatically.w xSrinivasan 21 presents a medical informatics exam-

ple that evaluates different query expansion strate-gies using a MEDLINE testbed. Most automaticallygenerated thesauri are syntactically based techniquesthat use vector space document representation andword statistical co-occurrence analysis. Many incor-porate other statistical techniques such as clusteranalysis, co-occurrence efficiency analysis, and fac-tor analysis. Relationships are then represented inmathematical matrices.

Most statistical methods concentrate on solvingsynonymy by adding associative terms to keywordindexes. A major disadvantage is that some addedterms have meanings that are different from theintended meanings, resulting in rapid degradation of

w xinformation retrieval precision. Cimino et al. 7 hasan interesting discussion on the problems of auto-mated medical information translation using thesauri.

ŽAutomatically generated neural-like thesauri con-.cept spaces provide an alternative to traditional

thesauri. In a neural knowledge base, conceptsŽ .terms are represented as nodes, and relationships asweighted links. The associative memory feature ofthis thesaurus type allows a new paradigm forknowledge discovery and document searching using

Ž .spreading activation algorithms e.g. Hopfield net .

2.3. ReleÕance feedback

Relevance feedback is a method that automatesthe intellectual process of evaluating the results of aninitial search to improve future searches. It can beused with both vector queries and Boolean searches.

w xSalton and Buckley 20 discuss a variety of proce-dures for relevance feedback and outline three bene-fits: automatic expansion of queries, gradual ad-vancement towards the subject, and selective keyterm emphasis. Their experiments demonstrated that

Žrelevance feedback can be very effective 90% preci-w x.sion 20 and its usefulness has been well-docu-

mented in TIPSTER conferences. The disadvantagesof this technique include the facts that concerns overprocessing speed and information storage outweighthe benefits, and that it cannot improve effectivequeries.

2.4. Semantic approaches

Another approach is to index documents semanti-cally, allowing users to search using conceptualmeanings instead of keywords. Multi-dimensionalsemantic space techniques attempt to enhance infor-

Ž .mation retrieval by placing documents vectors bymeaning in a designated space. The most representa-tive multi-dimensional semantic space techniques are

Ž .Metric Similarity Modeling MSM and Latent Se-Ž .mantic Indexing LSI .

MSM represents both queries and documents withvectors in a multi-dimensional semantic space using

w xtechniques from Multi-Dimensional Scaling 1 . Doc-ument vectors are computed using standard statisticaltechniques and then placed in a multi-dimensionalsemantic space, their positions determined by simi-larity constraints. One disadvantage of MSM is thatit can only be used when external sources exist to

Ždetermine similarity constraints i.e. co-citation anal-ysis, relevance feedback or document classificationinformation — Library of Congress Subject Head-

.ings or Compendex Classification Codes .w xLSI 8 is an optimal method of MSM. It repre-

sents documents, queries and terms as vectors in amatrix determined by multi-dimensional SingularValue Decomposition. LSI takes advantage of thefact that semantic relations exist within a document

Page 5: Exploring the use of concept spaces to improve medical information retrieval · 2011. 5. 15. · Decision Support Systems 30 2000 171–186 . Exploring the use of concept spaces to

( )A.L. Houston et al.rDecision Support Systems 30 2000 171–186174

and attempts to place similar documents close toeach other in a multi-dimensional space. Chute et al.w x6 have applied the technique to medical informat-

w xics. Deerwester et al. 8 have tested LSI on twoŽstandard document collections MED and CISI —

.chosen because relevance judgments already existedwith promising results in both document recall andprecision. LSI proved equal to or better than eithersimple term matching or SMART, and better thanVoorhees’ term disambiguation process. Unfortu-nately, the meanings of these mathematically derivedsemantic techniques are difficult to understand andcomputationally cumbersome. Their usefulness insuggesting meaningful indexing or searching termshas not been validated on a large real-world collec-tion.

2.5. UMLS

In 1986, NLM began developing the UMLS toaddress medical vocabulary problems by Aimprovingthe ability of computer programs to ‘understand’biomedical meaning in user inquiries and then usingthis understanding to retrieve and integrate relevant

w xmachine-readable informationB 16 . The UMLS hasfour components: the Metathesaurus, the SpecialistLexicon, the Semantic Net, and the InformationSources Map. The Metathesaurus is the largest andmost complex component, incorporating 589 000names for 253 000 concepts from more than 30biomedical vocabularies, thesauri, and classification

Žsystems including MeSH, SNOMED, COSTAR,ICD-9CM, and Dorland’s Illustrated Medical Dictio-

.nary, 27th edn. . The Metathesaurus is not intendedto serve as an Aoverarching classification systemB orcontrolled vocabulary, but to facilitate translationand interpretation of biomedical terminology acrossvocabularies.

NLM makes UMLS copies available to resear-chers for the development of biomedically relatedexpert systems, automatic indexing and classificationtools, and tools to index patient records. Research in

Žthis area includes the SAPHIRE project Oregon.Health Sciences University , the Internet Grateful

ŽMed interface for MEDLARS databases COACH. Ž .Browser , the Interactive Query Workstation IQW ,

Žthe InterMed Vocabulary Server project Stanford,.Columbia, Harvard and the University of Utah , and

the Group Health Cooperative of Puget Sound.Metathesaurus use has been shown to significantly

Ž . Žincrease 60–88% document recall rate over MeSH. w xterms 18 .

3. CANCERLIT experiment

ŽCANCERLIT contains bibliographic records pre-.dominantly abstracts from biomedical journals on

research related to cancer biology, etiology, screen-ing, prevention, and treatment published between1963 and today. Approximately 200 core journalsaccount for the majority of the collection. Additionalcitations come from journals, scientific meeting pro-ceedings, books, dissertations, technical reports, andother publications. The National Cancer InstituteŽ .NCI and NLM share processing costs, therefore,many CANCERLIT citations are cross-indexed inMEDLINE. CANCERLIT is updated monthly to en-sure that the most current published cancer research

Ž .results are available see acknowledgments . Thereare more than 1.2 million records in the completecollection, which NCI estimates increases annuallyby approximately 90 000 abstracts. The record for-mat includes the following fields: authors and ad-dresses, MeSH headings indexing the document, andthe document’s source, title, and abstract. More de-tailed information is available from NCI.

3.1. CANCERLIT concept space

Our CANCERLIT testbed contains 2 months ofŽ .CANCERLIT data May and June 1996 . The ap-

proximately 10 000 abstracts take up 40 MB ofmemory, and took roughly 1 h to create on an HP9000 workstation. We have since expanded thetestbed to include the last 5 years of CANCERLIT

Ž .documents seeai.bpa.arizona.edurCancerLit .Two different options were used to create the

prototype CANCERLIT concept space. A MeSH-based CANCERLIT concept space was created usingexisting MeSH terms that index each document, andan Automatic Indexing CANCERLIT concept spacewas generated using automatic indexing techniquesŽword identification, stop wording, stemming, term-

Page 6: Exploring the use of concept spaces to improve medical information retrieval · 2011. 5. 15. · Decision Support Systems 30 2000 171–186 . Exploring the use of concept spaces to

( )A.L. Houston et al.rDecision Support Systems 30 2000 171–186 175

) .phrase formation, and tf idf term weighing . Con-Ž .cept space automatic thesaurus creation is a stan-

dard process that can be applied to a variety ofdifferent kinds of textual information. Users withdifferent levels of expertise have successfully usedthe output in different subject domains. Fig. 1 illus-trates the complete process as it might be applied tomedical knowledge spaces. A brief overview of theprocess is described below.

3.1.1. Document collectionIn any automatic thesaurus building effort, the

first task is to identify the collection of documentsthat will serve as the basis of the thesaurus. We useda 2-month collection of CANCERLIT documents.

3.1.2. Automatic indexingThe purpose of this step is to automatically iden-

tify each document’s content. We used a Salton-based

Ž .Fig. 1. The Medical Knowledge Representation System MKRS Architecture.

Page 7: Exploring the use of concept spaces to improve medical information retrieval · 2011. 5. 15. · Decision Support Systems 30 2000 171–186 . Exploring the use of concept spaces to

( )A.L. Houston et al.rDecision Support Systems 30 2000 171–186176

w xtechnique 19 that identifies document subject de-scriptors and computes descriptor frequency for theentire collection. Then, a stop-word list eliminates

Žnon-content bearing words e.g. AtheB, AaB, AonB,.AinB , and a stemming algorithm identifies the re-

maining word stems. A document term-frequencyrequirement removes AnoiseB.

3.1.3. Co-occurrence analysisŽ .The importance of each descriptor term in repre-

senting document content varies. Using term fre-quency and inverse document frequency, the clusteranalysis step assigns weights to each document termto represent term importance. Term frequency indi-cates how often a particular term occurs in the entire

Žcollection. Inverse document frequency indicating.term specificity allows terms to have different

Ž .strengths importance based on specificity. A termcan be a one-, two-, or three-word phrase. Fig. 2describes the frequency computation. Cluster analy-sis is then used to convert raw data indexes andweights into a matrix indicating term similarityrdis-similarity using a distance computation based on

w xChen and Lynch’s 4 asymmetric ACluster Func-tionB which represents term association better than

Ž .the cosine function see Fig. 3 for more detail . Anet-like concept space of terms and weighted rela-tionships is then created, using the cluster function.

3.1.4. AssociatiÕe retrieÕalThe Hopfield algorithm is ideal for concept-based

information retrieval. Each term in the network-like

thesaurus is treated as a neuron and the asymmetricweight between any two terms is the unidirectional,weighted connection between neurons. With user-supplied terms as input patterns, the Hopfield algo-

Žrithm activates term neighbors strongly associated.terms , combines weights from all associated neigh-Ž .bors by adding collective association strengths , and

Ž .repeats this process until convergence see Fig. 4 .Fig. 5 illustrates a CANCERLIT session. The user

entered the term Abreast cancerB and the prototype’scombined MeSHrAutomatic Index concept spacesuggests a list of related terms. Next, the user selectsthe term AFamily HistoryB from the list of relatedterms, which combines it with the original termAbreast cancerB to narrow the search. The prototypefinds 506 documents that contain the two terms andthe user selects the first document to review.

3.2. Experimental design

The primary goal in this preliminary phase was toevaluate the usefulness of suggested terms from threedifferent thesauri:

v MeSH concept space — a thesaurus based onNLM’s controlled medical information retrievalcontrolled vocabulary: MeSH.

v Auto Index concept space — our automaticallygenerated thesaurus based exclusively on termscontained in the collection’s documents.

v Internet Grateful Med — the most commonlycited on-line tool based on the UMLS Metathe-saurus.

Fig. 2. Frequency computation.

Page 8: Exploring the use of concept spaces to improve medical information retrieval · 2011. 5. 15. · Decision Support Systems 30 2000 171–186 . Exploring the use of concept spaces to

( )A.L. Houston et al.rDecision Support Systems 30 2000 171–186 177

Fig. 3. Cluster analysis computations.

Our subjects were five cancer researchers affili-ated with the University of Arizona Cancer Center,and a veterinarian. Phase one of the experimentinvolved evaluating term usefulness during adirected search of the CANCERLIT testbed. Twelvesearches were performed on each of the three the-

Ž .sauri 36 total searches . First, we demonstrated eachthesaurus using a subject-provided term. Then, eachsubject was asked to provide one or two terms to

begin a document search, and to suggest five relatedterms for each search term.

During phase two, a subject-supplied search termwas entered into a thesaurus and subjects evaluated

Žthe top 40 thesaurus-suggested terms ArelevantB or.Anot relevantB . This step was repeated for the other

two thesauri. The entire process was then repeatedŽ .for other subject-supplied term s . The order in which

we searched the thesauri was pre-assigned by subject

Fig. 4. Hopfield net algorithm.

Page 9: Exploring the use of concept spaces to improve medical information retrieval · 2011. 5. 15. · Decision Support Systems 30 2000 171–186 . Exploring the use of concept spaces to

( )A.L. Houston et al.rDecision Support Systems 30 2000 171–186178

Ž . Ž .Fig. 5. CANCERLIT session: User entered Abreast cancerB in 1 . CANCERLIT suggests related terms in 2 . User selects AFamilyŽ . Ž .HistoryB. System locates 506 documents related to Abreast cancerB and AFamily HistoryB in 3 . User chooses to read first document in 4 .

number in a random fashion. We recorded the verbalprotocols of the searching and evaluation process.

Subjects were allowed to access documents linked tothe thesaurus-suggested terms, but we did not evalu-

Page 10: Exploring the use of concept spaces to improve medical information retrieval · 2011. 5. 15. · Decision Support Systems 30 2000 171–186 . Exploring the use of concept spaces to

( )A.L. Houston et al.rDecision Support Systems 30 2000 171–186 179

ate that part of the search. Next, we solicited feed-back on the usefulness of the three thesauri, theusers’ searching experiences with CANCERLIT,MEDLINE, and Internet Grateful Med, and our userinterface. Precision and recall for phase two wascomputed.

3.3. Precision and recall

Precision and recall were calculated as follows:Ž . Ž .1 Avery relevantB terms — 1 point; 2 Apossibly

Ž .relevantB terms — 0.5 points; and 3 Anot relevantBterms — zero points. For each search, all relevant

Žterms from each thesaurus were combined eliminat-.ing duplicates for a total relevant score. Term recall

for each thesaurus was calculated by dividing therelevant score for that thesaurus by the total relevantscore for all thesauri. Term precision for each the-saurus was calculated by dividing the total relevantscore for the thesaurus by the total number of termssuggested by the thesaurus.

Fig. 6 illustrates Minitab’s one-way ANOVA testfor term recall for each thesaurus, and for the variousthesauri combinations. Fig. 7 illustrates the same

information for term precision. There were no statis-tically significant differences among the three the-sauri in term recall or precision, indicating that termssuggested by our tool are comparable to terms sug-

Žgested by Internet Grateful Med UMLS Metathe-.saurus-based and MeSH indexing terms. Based on

subject qualitative feedback, we believe this is par-Žtially due to the prototype’s size 2 months worth of

.data vs. the entire MEDLINE collection . We werepleased that our tool could perform at a comparablelevel based on such limited input.

An interesting result from this research is the lackof duplicate relevant terms suggested by the threedifferent thesauri. In previous research, the mostrelevant terms typically were suggested by all the-sauri. We were surprised at the lack of term overlapŽfor example, four searches had no overlappingterms, and four searches had only one overlapping

.term , which suggests that a combined approachwould probably be more useful to searchers. Subjectsconfirmed this in their verbal feedback, and there is

w xsupporting evidence in the literature 5,9,21 . Ourdata also support the literature’s premise that the-saurus combination can increase recall without sacri-ficing precision.

Fig. 6. Recall comparison by term source.

Page 11: Exploring the use of concept spaces to improve medical information retrieval · 2011. 5. 15. · Decision Support Systems 30 2000 171–186 . Exploring the use of concept spaces to

( )A.L. Houston et al.rDecision Support Systems 30 2000 171–186180

Fig. 7. Precision comparison by term source.

In general, our subjects liked the Automatic In-dexing concept space best. They subjectively felt thatit came up with the most interesting and most rele-vant terms the majority of the time. Instances when it

Ž .did not i.e. AWiskott–Aldrich syndromeB were ex-Ž .plained by the subjects as follows: AThat is a very

specific and narrow topic. It is likely that it wasn’tmentioned in just 2 months of CANCERLIT ab-stracts, which is why your system can’t find it.BSubjects were very impressed by the quality of ourconcept space based on only 2 months of data, andmost of them requested that we contact them whenthe larger collection’s concept spaces become avail-able.

3.4. QualitatiÕe eÕaluation

Figs. 8–10 illustrate a search using the subject-Ž .supplied term AapoptosisB a type of cell death for

each of the three thesauri. The MeSH thesaurussuggested 40 related terms, of which, 10 terms were

Ž .considered useful two extremely useful , five mod-Žerately useful, and 25 not useful eight were too

.general . The Automatic Indexing thesaurus also

suggested 40 terms, of which 26 terms were ratedŽ .useful three extremely useful , three moderately

Ž .useful, and 10 not useful five were too general .ŽInternet Grateful Med suggested nine terms one

.useful, one moderately useful and seven not useful .ŽThere were three duplicate terms all duplication

occurred between the MeSH and Automatic Indexing.thesauri .

Most of our subjects were familiar with MeSHterms and some had previously used Internet Grate-ful Med or MEDLINE. Currently, most of them useOVID for their reference searching. One subjectsuggested extending the concept space to include allof MEDLINE instead of restricting it to CANCER-LIT. Another subject, who had spent time at NIH,and had extensive experience with both InternetGrateful Med and MeSH terms, suggested that aMeSH-based thesaurusrAutomatic Indexing conceptspace combination would be more effective. Weshowed him a combined MeSHrAutomatic Indexingconcept space and told him that future plans includeincorporating the UMLS Metathesaurus.

Interestingly, we had difficulty getting subjects tosuggest five specific relevant terms before the search

Page 12: Exploring the use of concept spaces to improve medical information retrieval · 2011. 5. 15. · Decision Support Systems 30 2000 171–186 . Exploring the use of concept spaces to

( )A.L. Houston et al.rDecision Support Systems 30 2000 171–186 181

Fig. 8. MeSH only concept space: terms related to apoptosis.

Page 13: Exploring the use of concept spaces to improve medical information retrieval · 2011. 5. 15. · Decision Support Systems 30 2000 171–186 . Exploring the use of concept spaces to

( )A.L. Houston et al.rDecision Support Systems 30 2000 171–186182

Fig. 9. Automatic Indexing only concept space: terms related to apoptosis.

Page 14: Exploring the use of concept spaces to improve medical information retrieval · 2011. 5. 15. · Decision Support Systems 30 2000 171–186 . Exploring the use of concept spaces to

( )A.L. Houston et al.rDecision Support Systems 30 2000 171–186 183

Fig. 10. Internet Grateful Med: terms related to apoptosis.

process began. Subjects were more comfortable sug-Žgesting categories of information e.g. related drugs,

. Žtreatment regimes as opposed to specific terms e.g..a specific drug or treatment . Later, during thesauri-

suggested term evaluation, subjects frequently said,AThat term is not identical to the one that I sug-

Žgested, but it means the same thing.B Nicely illus-.trating synonymy .

4. Conclusions and future directions

Different users with different goals approach largeinformation spaces in different ways. We focused onmedical researchers and a highly technical,research-based biomedical document collectionŽ .CANCERLIT . This type of medical informationuser is a very technical, extremely focused expertwho is intimately familiar with a particular section ofthe information space. Our subjects were interestedin very narrow, directed searches. Due to their busyschedules, they had no interest in browsing or ex-ploring the collection. Based on their qualitativefeedback, an automated thesaurus or concept spaceapproach to indexing the CANCERLIT collectionwas preferred for information retrieval over the useof currently existing biomedical thesauri. We feelthat this result is consistent with other research onconcept space use in information retrieval.

We were especially encouraged by the precisionand recall performance of the Automatic Indexing

Žthesaurus no statistical differences between it and.the other two existing biomedical thesauri , since it

was based on a very limited number of CANCER-Ž .LIT documents only 2 months’ worth of data . We

believe it will be significantly better when a largerset of documents serves as its basis.

For this type of user, already familiar withbiomedical terminology, a combined concept spacethat augments automatic indexing terms with termsfrom existing biomedical thesauri could potentiallyimprove information retrieval. To this end, we are inthe process of creating a set of concept spaces for theCANCERLIT collection that include MeSH termsand UMLS terms. Future plans may include incorpo-rating the UMLS Semantic Network. An importantadvantage of including the UMLS information is thatwe may be able to use it to address the generaliza-tionrspecialization criticism of statistical techniques.Statistical techniques do not take into account theterm’s part of speech or level of abstraction becauseterms are analyzed statistically and syntactically, notsemantically. The UMLS products capture aparentrchild relationship between concepts and wemay be able to use this feature to generalize and toorganize terms by level of abstraction.

Other important future enhancements would be toallow the searcher to select what concept space touse and to allow a searcher to dynamically add termsof interest to the thesaurus for future indexing andretrieval. Ideally, future interfaces will allow subjectsto interactively weigh both individual terms and termsource to improve their searches.

Another common criticism of the concept spacetechnique is that because it is syntactic, not seman-tic, it ‘analyzes’ terms Aout of contextB. To address

Page 15: Exploring the use of concept spaces to improve medical information retrieval · 2011. 5. 15. · Decision Support Systems 30 2000 171–186 . Exploring the use of concept spaces to

( )A.L. Houston et al.rDecision Support Systems 30 2000 171–186184

this concern we are investigating incorporating aNatural Language parser at the front-end of ourconcept space analysis, allowing term analysis in thecontext of the noun or verb phrase in which theyoccur. We are currently investigating only nounphrase parsing. Qualitative feedback indicated thatprecise terms were especially important to manymedical information users including primary care

Ž .professionals e.g. doctors, nurses and medical re-search specialists. We believe that the precision andquality of our terms can be improved using NaturalLanguage Processing techniques, which identify keynoun phrases in documents. An Arizona NounPhraser has been developed and implemented against

Žboth of our biomedical document testbeds TOXLINE.from NLM and CANCERLIT from NCI . Future

research will involve investigating the impacts ofArizona Noun Phraser usage on usability and infor-mation retrieval quality.

Novice users and others unfamiliar with the CAN-CERLIT collection andror biomedical terms mayprefer to browse or explore as opposed to performingnarrow directed searches. It is likely that this type of

Žuser will prefer other types of tools for example, a.Kohonen-based category map over concept spaces

and existing biomedical thesauri. Future researchwith the CANCERLIT collection will need to in-clude a larger and more varied group of subjects andinformation retrieval tools.

Finally, medical information already contains bothstatic and moving images. Several image indexingand retrieval techniques have been applied to medi-cal image databases. Indeed any complete medicalinformatics system must address image indexing andretrieval, which are especially important to peoplewho use patient record medical information. Our labis currently investigating image indexing and re-

Ž .trieval using visual thesauri and visual SOMs andimage similarity analysis on a GIS collection. Futureresearch on medical information retrieval, in particu-lar, patient record medical information, will includecombining image indexing and textual informationindexing and retrieval techniques.

The results from our current experimentation onthe CANCERLIT and TOXLINE testbeds are pre-liminary, but encouraging. In our ongoing effort inthe Illinois Digital Library Initiative project, we arein the process of fine-tuning these techniques and

exploring other general-purpose artificial intelligenceand mathematical pattern analysis techniques for var-ious digital library and medical information retrievaland analysis applications.

Acknowledgements

This project was funded primarily by a grant fromthe NCI AInformation Analysis and Visualization for

Ž .Cancer LiteratureB 1996–1997 , a grant from theNLM, ASemantic Retrieval for Toxicology and Haz-

Ž .ardous Substance DatabasesB 1996–1997 , anNSFrCISE AIntelligent Internet Categorization and

Ž .SearchB project 1995–1998 , and the NSFrARPArNASA Illinois Digital Library Initiative project,

Ž .ABuilding the InterspaceB 1994–1998 .We would like to thank the following individuals

for their generous donation of time and theirthoughtful evaluations and suggestions: Dr. MargaretBriel, Dr. Sherri Chow, Dr. Bernie Futcher, Dr. JesseMartinez, and Dr. Luke Whitecell.

The CANCERLIT collection is available throughmany sources including:

w x Ž .1. CANCERLIT database online Bethesda MD :NCI, 1997. Available through the NLM, Bethesda,MD, and

w x2. CANCERLIT Internet URL www.ovid.com.Available through Ovid Technologies, New York,NY.

References

w x1 B.T. Bartell, G.W. Cottrell, R.K. Belew, Representing docu-ments using an explicit model of their similarities, Journal of

Ž . Ž .the American Society for Information Science 46 4 1995254–271.

w x2 M.J. Bates, Subject access in online catalogs: A designmodel, Journal of the American Society for Information

Ž . Ž .Science 37 6 1986 357–376, November.w x3 D.C. Blair, M.E. Maron, An evaluation of retrieval effective-

ness for a full-text document-retrieval system, Communica-Ž . Ž .tions of the ACM 28 3 1985 289–299.

w x4 H. Chen, K.J. Lynch, Automatic construction of networks ofconcepts characterizing document databases, IEEE Transac-

Page 16: Exploring the use of concept spaces to improve medical information retrieval · 2011. 5. 15. · Decision Support Systems 30 2000 171–186 . Exploring the use of concept spaces to

( )A.L. Houston et al.rDecision Support Systems 30 2000 171–186 185

Ž . Ž .tions of Systems, Man and Cybernetics 22 5 1992 885–902, SeptemberrOctober.

w x5 H. Chen, J. Martinez, D.T. Ng, B.R. Schatz, A concept spaceapproach to addressing the vocabulary problem in scientificinformation retrieval: An experiment of the Worm Commu-nity System, Journal of the American Society for Information

Ž . Ž .Science 48 1 1997 157–170, February.w x6 G.C. Chute, Y. Yang, D.A. Evans, Latent semantic indexing

of medical diagnoses using UMLS semantic structures, Pro-ceedings of the 15th SCAMC, Washington, DC, 1991, pp.185–189.

w x7 J.J. Cimino, S.B. Johnson, P. Peng, A. Aguirre, From ICD9-CM to MeSH using the UMLS: A how-to guide, In Proceed-ings — the Annual Symposium on Computer Applications inMedical Care, 1994, pp. 730–734.

w x8 S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R.Harshman, Indexing by latent semantic analysis, Journal of

Ž . Ž .the American Society for Information Science 41 6 1990391–407, September.

w x9 F.C. Ekmekcioglu, A.M. Robertson, P. Willett, Effectivenessof query expansion in ranked-output document retrieval sys-

Ž .tems, Journal of Information Science 18 1992 139–147.w x10 G.W. Furnas, T.K. Landauer, L.M. Gomez, S.T. Dumais,

The vocabulary problem in human–system communication,Ž . Ž .Communications of the ACM 30 11 1987 964–971,

November.w x11 P.N. Gorman, Information needs of physicians, Journal of the

Ž . Ž .American Society for Information Science 46 10 1995729–736.

w x12 J.N. Guidi, E.A. Fox, Information retrieval and genomics —Ž .An introduction, Computers in Biology and Medicine 26 3

Ž .1996 179, May.w x13 W.R. Hersh, The electronic medical record: Promises and

problems, Journal of the American Society for InformationŽ . Ž .Science 46 10 1995 772–776.

w x14 W.R. Hersh, Information Retrieval: A Health Care Perspec-tive, Springer, New York, NY, 1995.

w x15 S.M. Huff, J.J. Cimino, Medical data dictionaries and theiruse in medical information system development, in: U.

Ž .Prokosch, J. Dudeck Eds. , Hospital Information Systems:Design and Development Characteristics; Impact and FutureArchitecture, Elsevier, 1995, pp. 53–75.

w x16 D.A.B. Lindberg, B.L. Humphreys, A.T. McCray, The Uni-fied Medical Language System, Methods of Information in

Ž .Medicine 32 1993 281–291.w x17 S. Nadis, Computation cracks semantic barrier between

Ž .databases, Science 272 1996 1419, 7 June.w x18 P.W. Richwine, R. Lilly, A study of MeSH and UMLS for

subject searching in an online catalog, Bulletin of the Medi-Ž . Ž .cal Library Association 81 2 1993 229–233, April.

w x19 G. Salton, Automatic Text Processing, Addison-Wesley Pub-lishing, Reading, MA, 1989.

w x20 G. Salton, C. Buckley, Improving retrieval performance by

relevance feedback, Journal of the American Society forŽ . Ž .Information Science 41 4 1990 288–297.

w x21 P. Srinivasan, Optimal document-indexing vocabulary forŽ .MEDLINE, Information Processing & Management 32 4

Ž .1996 431–443.

Dr. Andrea Houston is an Assistant Professor in the InformationSystems and Decision Sciences Department at Louisiana StateUniversity. She received her PhD from the Department of Man-agement Information Systems at The University of Arizona in

Ž .1998, her MBA from the University of New Hampshire 1981Ž .and her BA 1976 from the University of Pennsylvania. She

worked in industry as a project leader and systems analyst forover 15 years before returning to academics. Her research interestsinclude looking at information retrieval issues, medical informat-ics, digital libraries and electronic publishing, human factors inHumanrComputer Interaction, and Natural Language Processing.She is also interested in software team development and organiza-tional memory. She is a member of ACM, IEEE, AIS and ASIS.

Dr. Hsinchun Chen is the McClelland Professor of ManagementInformation Systems at the University of Arizona and head of theUArMIS Artificial Intelligence Lab. He received the PhD degreein Information Systems from New York University in 1989. He isauthor of more than 70 articles covering semantic retrieval, searchalgorithms, knowledge discovery, and collaborative computing inleading information technology publications. He serves on theeditorial board of Journal of the American Society for InformationScience and Decision Support Systems. He is an expert in digitallibrary and knowledge management research whose work hasbeen featured in various scientific and information technologiespublications including Science, Business Week, NCSA AccessMagazine, WEBster, and HPCWire.

Susan M Hubbard, RN, is the director of the International CancerŽ .Information Center since 1984 and an Associate Director of the

National Cancer Institute. She received a BS from the HonorsCollege of the University of Connecticut and a Master’s in Public

Ž .Administration from the American University 1993 . As theDirector of NCI’s International Cancer Information Center, shedirects the NCI’s efforts to identify, implement, and evaluatestate-of-the-art communication technologies to maximize the po-tential impact of advances in cancer research on health care. Shedirected the development of PDQ, NCI’s primary mechanism forcommunicating state-of-the-art information about cancer. The De-

Ž .partment of Health and Human Services DHHS designated ICICas a Areinvention laboratoryB in 1994 under Vice President Albert

Ž .Gore’s National Performance Review NPR . The InternationalCancer Information Center has also received the NPR AHammerAwardB and the Department of Health and Human Service’sContinuous Improvement Program Award. Ms. Hubbard has pub-lished extensively in nursing and medical journals and texts.

Page 17: Exploring the use of concept spaces to improve medical information retrieval · 2011. 5. 15. · Decision Support Systems 30 2000 171–186 . Exploring the use of concept spaces to

( )A.L. Houston et al.rDecision Support Systems 30 2000 171–186186

Bruce R. Schatz is Director of the Community Architecture forŽ .Network Information SYSTEMS CANIS Laboratory at the Uni-

versity of Illinois at Urbana–Champaign. He is the PrincipalInvestigator of the Digital Libraries Initiative project and theDARPA Information Management Program, which performs re-search in information systems building analysis environments to

Ž .support community repositories Interspace , and in informationscience performing large-scale experiments in semantic retrievalfor vocabulary switching. Dr. Schatz holds faculty appointmentsin Library and Information Science, Computer Science, Neuro-science, and Health Information Sciences. He is also a SeniorResearch Scientist at the National Center for Supercomputing

Ž .Applications NCSA , serving as the scientific advisor for digitallibraries and information systems. He has served in this role since1989, including the period during which NCSA developed Mo-saic.

Robin Sewell received her Doctor of Veterinary Medicine degreeŽ .from Washington State University 1986 and her MLA from the

Ž .University of Arizona 1997 . She currently serves as the AI Lab’sProgram Coordinator and a Research Specialist. Her interests arein Medical Informatics and the National Library of Medicine’sUnified Medical Language System.

Dorbin Ng received the PhD in 2000 from the Department ofManagement Information Systems at the University of Arizona,from which he also received a BS in Business Administration

Ž .majoring in Management Information Systems and Finance 1990Ž .and a MS in MIS 1993 . He is currently a Systems Scientist in

the Computer Science Department at Carnegie Mellon Universityworking on the Informedia Digital Video Library Project. Hisresearch interests include digital libraries, intelligence and multi-media information retrieval, semantic interoperability for informa-tion analysis and knowledge management environment, large-scaleknowledge discovery using high-performance supercomputers,search engine and user interface development in Internet, neural-network computing, and collaborative computing.