passage retrieval based hidden knowledge discovery from biomedical literature

7
Passage retrieval based hidden knowledge discovery from biomedical literature Ran Chen, Hongfei Lin, Zhihao Yang Department of Computer Science and Engineering, Dalian University of Technology, Dalian 116023, China article info Keywords: Concept retrieval Passage retrieval Knowledge discovery abstract Biomedical literature is growing at a double-exponential pace and automatic extraction of the implicit biological relationship from biomedical literature contributes to building the biomedical hypothesis that can be explored further experimentally. This paper presents a passage retrieval based method which can explore the hidden connection from MEDLINE records. In this method, the MeSH concepts are retrieved from the sentence-level windows and are therefore more relevant with the starting term. This method is tested on three classical implicit connections: Alzheimer’s disease and indomethacin, Migraine and Magne- sium, Schizophrenia and Calcium-independent phospholipase A2 in the open discovery. In our experiments, three computational methods for scoring and ranking the MeSH terms are explored: z-score, TFIDF (Term Frequency Inverse Document Frequency) and PMI (pointwise mutual information). Experimental results show this method can significantly improve the hidden knowledge discovery performance. Ó 2011 Elsevier Ltd. All rights reserved. 1. Introduction Nowadays, biomedical literature is growing at a double- exponential pace and new relationships are often implicit from existing information. However, as people’s reading ability is very limited and the retrieval system has its own incompleteness, people cannot discover these implicit relationships exactly among different fields. So automated knowledge extraction techniques become always more necessary to valorize the huge amounts of data stored in the information systems (Martin & Romaric, 1998). For example, text mining, also known as text data mining or knowledge discovery from textual databases, focuses on the computerized exploration of large amounts of literature and on the discovery of interesting relationship within them. In the biomedical field, researchers usually use the documents of MEDLINE as their primary corpora for text mining studies. MEDLINE database is the US National Library of Medicine’s (NLM) premier bibliographic database that has contained over 17 million references with a concentration on biomedicine in life sciences since 1950 (http://www.nlm.nih.gov/pubs/factsheets/pubmed.html). They are the representative resource of studying modern biomedical and storing the valuable literatures. Knowledge discovery from biomedical literature databases have been widely studied on by many researchers. Don R. Swanson introduced the hypothesis that the medical literature might be full of some undiscovered connections for the first time (Swanson, 1987, 1989). He proposed a simple ‘‘A influences B, and B influ- ences C, therefore A may influence C’’ model for detecting the link- age between concepts. The words that co-occur in titles with A are extracted and this list contains all possible B-word candidates for A-word. Each of the remaining B-word candidates is then searched in MEDLINE. Then a list that contains all possible C-words is obtained. Meanwhile, they developed an interactive system based on complementary literatures, ARROWSMITH, which could help the specialist to discover the novel and interested connections. However, using title-words as a basis for detecting complementary literatures is both the strength and the weakness of ARROWSMITH (Swanson & Smalheiser, 1994, 1997). It does not contain automatic linkage detection in abstracts or subject headings and large manual interactive operations are needed. Other researchers further improved Swanson’s method. Gordon and Lindsay demonstrated that the connection between the fish oil and Raynaud’s syndrome which was found by Swanson could be replicated using an automated method (Gordon & Lindsay, 1996; Lindsay & Gordon, 1999). They used a variety of information- retrieval methods to weight terms including term frequency and inverse document frequency. Moreover, they evaluated the perfor- mance using the terms which Swanson had discovered as the gold standard. Weeber et al. investigated the knowledge hypothesis on diet and drugs based on the NLP technology and developed the expert-oriented universal discovery tool-DAD-system (Weeber, Klein, Aronson, & Mork, 2000). They made use of UMLS semantic types (http://mmtx.nlm.nih.gov) to prone the redundancy linking terms. Comparing with the previous research works, there are more automatic manipulation in their work though the help of 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.02.034 Corresponding author. Address: Department of Computer Science and Engineering, Dalian University of Technology, No. 2 LingGong Road, ShaHeKou district, Dalian 116023, China. Tel.: +86 0411 84706009 3926; fax: +86 0411 84706550. E-mail address: [email protected] (Z. Yang). Expert Systems with Applications 38 (2011) 9958–9964 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Upload: ran-chen

Post on 21-Jun-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Expert Systems with Applications 38 (2011) 9958–9964

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Passage retrieval based hidden knowledge discovery from biomedical literature

Ran Chen, Hongfei Lin, Zhihao Yang ⇑Department of Computer Science and Engineering, Dalian University of Technology, Dalian 116023, China

a r t i c l e i n f o

Keywords:Concept retrievalPassage retrievalKnowledge discovery

0957-4174/$ - see front matter � 2011 Elsevier Ltd. Adoi:10.1016/j.eswa.2011.02.034

⇑ Corresponding author. Address: DepartmentEngineering, Dalian University of Technology, No. 2district, Dalian 116023, China. Tel.: +86 0411 84784706550.

E-mail address: [email protected] (Z. Yang).

a b s t r a c t

Biomedical literature is growing at a double-exponential pace and automatic extraction of the implicitbiological relationship from biomedical literature contributes to building the biomedical hypothesis thatcan be explored further experimentally. This paper presents a passage retrieval based method which canexplore the hidden connection from MEDLINE records. In this method, the MeSH concepts are retrievedfrom the sentence-level windows and are therefore more relevant with the starting term. This method istested on three classical implicit connections: Alzheimer’s disease and indomethacin, Migraine and Magne-sium, Schizophrenia and Calcium-independent phospholipase A2 in the open discovery. In our experiments,three computational methods for scoring and ranking the MeSH terms are explored: z-score, TFIDF (TermFrequency Inverse Document Frequency) and PMI (pointwise mutual information). Experimental resultsshow this method can significantly improve the hidden knowledge discovery performance.

� 2011 Elsevier Ltd. All rights reserved.

1. Introduction

Nowadays, biomedical literature is growing at a double-exponential pace and new relationships are often implicit fromexisting information. However, as people’s reading ability is verylimited and the retrieval system has its own incompleteness, peoplecannot discover these implicit relationships exactly among differentfields. So automated knowledge extraction techniques becomealways more necessary to valorize the huge amounts of data storedin the information systems (Martin & Romaric, 1998). For example,text mining, also known as text data mining or knowledge discoveryfrom textual databases, focuses on the computerized exploration oflarge amounts of literature and on the discovery of interestingrelationship within them.

In the biomedical field, researchers usually use the documents ofMEDLINE as their primary corpora for text mining studies. MEDLINEdatabase is the US National Library of Medicine’s (NLM) premierbibliographic database that has contained over 17 million referenceswith a concentration on biomedicine in life sciences since 1950(http://www.nlm.nih.gov/pubs/factsheets/pubmed.html). They arethe representative resource of studying modern biomedical andstoring the valuable literatures.

Knowledge discovery from biomedical literature databases havebeen widely studied on by many researchers. Don R. Swansonintroduced the hypothesis that the medical literature might be full

ll rights reserved.

of Computer Science andLingGong Road, ShaHeKou

06009 3926; fax: +86 0411

of some undiscovered connections for the first time (Swanson,1987, 1989). He proposed a simple ‘‘A influences B, and B influ-ences C, therefore A may influence C’’ model for detecting the link-age between concepts. The words that co-occur in titles with A areextracted and this list contains all possible B-word candidates forA-word. Each of the remaining B-word candidates is then searchedin MEDLINE. Then a list that contains all possible C-words isobtained. Meanwhile, they developed an interactive system basedon complementary literatures, ARROWSMITH, which could helpthe specialist to discover the novel and interested connections.However, using title-words as a basis for detecting complementaryliteratures is both the strength and the weakness of ARROWSMITH(Swanson & Smalheiser, 1994, 1997). It does not contain automaticlinkage detection in abstracts or subject headings and large manualinteractive operations are needed.

Other researchers further improved Swanson’s method. Gordonand Lindsay demonstrated that the connection between the fish oiland Raynaud’s syndrome which was found by Swanson could bereplicated using an automated method (Gordon & Lindsay, 1996;Lindsay & Gordon, 1999). They used a variety of information-retrieval methods to weight terms including term frequency andinverse document frequency. Moreover, they evaluated the perfor-mance using the terms which Swanson had discovered as the goldstandard. Weeber et al. investigated the knowledge hypothesis ondiet and drugs based on the NLP technology and developed theexpert-oriented universal discovery tool-DAD-system (Weeber,Klein, Aronson, & Mork, 2000). They made use of UMLS semantictypes (http://mmtx.nlm.nih.gov) to prone the redundancy linkingterms. Comparing with the previous research works, there aremore automatic manipulation in their work though the help of

R. Chen et al. / Expert Systems with Applications 38 (2011) 9958–9964 9959

experts remains needed in some important steps. Srinivasan pre-sented open and closed text mining algorithms which were builtwithin the discovery framework established by Swanson andSmalheiser (Srinivasan, 2004). Different with the previous work,Srinivasan’s work mined the concepts connections based on MeSHterms and UMLS semantic types. She carried on the experiments onthe five open and five close problems. However, she did not rankthe discovery knowledge.

Most recently, Yetisgen-Yildiz and Pratt developed the LitLinkersystem in which the MeSH concepts and UMLS semantic types arealso used (Yetisgen-Yildiz & Pratt, 2006). Moreover, they groupedthe semantic types and chose the linking terms and target termswithin the same semantic groups. From one starting term,LitLinker completes the whole text mining process automatically.The information retrieval metrics of precision and recall are intro-duced to evaluate the performance of LitLinker. However, theprecision values that LitLinker achieves are rather low.

This paper presents a passage retrieval based hidden knowledgediscovery method in which sentence-level windows are used to ex-tract MeSH concept. In previous research works linking terms areextracted from the whole Medline record while, in our method,they are extracted from the sentence-level windows and are there-fore more relevant with the starting term. Experimental resultsshow our method can improve the hidden knowledge discoveryperformance significantly.

The remainder of this paper is organized as follows. Section 2introduces our passage retrieval based hidden knowledge discov-ery method. Section 3 discusses the experimental results on threesets of experiments: Alzheimer’s disease and indomethacin, Migraineand Magnesium, Schizophrenia and Calcium-independent phospholi-pase A2. Section 4 draws the conclusions.

2. Methods

Our knowledge discovery method is described in Fig. 1. Knowl-edge discovery begins with a starting term. Next, a set of terms

Fig. 1. The process of open

(called linking terms) directly correlated with the starting termare found using a text mining process. These terms are computed,pruned and ranked. Then the same text mining process is used toidentify a set of terms that are correlated with each linking term(called target terms). Finally, all possible target term candidatesare obtained for the starting term. In previous research works thelinking terms and target terms are extracted from the wholeMedline record while, in our method, they are extracted from thesentence-level windows and are therefore more relevant withthe starting term.

2.1. Definition of the window

In our method, the linking terms and target terms are extractedfrom the sentence-level windows which definition contains thefollowing two conditions:

Firstly, the window is made up of several complete sentences.Usually, in the overlapping window based retrieval methods(either fixed or variable length), the completeness of the sentenceis ignored. However, sentences should convey a single idea; para-graphs should be about one topic (Kaszkiel & Zobel, 2001). The re-turned complete sentences are more preferable. In our method, asshowed in Fig. 2, the document is segmented into severalwindows: window 1, window 2, window 3, etc. Each window hasrelatively fixed length but must be made up of complete sentences.Secondly, the windows should not cross the paragraphs since dif-ferent paragraphs usually be about different topics and thereforethe last sentence(s) of the previous paragraph and the first sen-tence(s) of the next paragraph should not be included in onewindow.

2.2. Index building

To locate the position which contains the starting term rapidly,an index is built for the documents in the original set. In our meth-od Indri (http://www.lemurproject.org/indri/) is used to index and

knowledge discovery.

Fig. 2. Definition of the window.

Table 1Semantic group of selected for our experiments.

Linking term Target term

Disorders Chemicals and drugsPhysiology Genes and molecular sequenceAnatomyGenes and molecular sequenceChemicals and drugs

9960 R. Chen et al. / Expert Systems with Applications 38 (2011) 9958–9964

retrieve the MeSH terms. The stop word list is not used since theMeSH terms are usually the compound noun phrase like the termDental Care for Disabled. In our experiments, two phase indexes arebuilt: MeSH concept index and passage index. Then a search on thestarting term will return all documents in which it appears (set A).The knowledge discovery process begins with analyzing the MeSHterms in set A. These terms are considered to be directly relatedwith the starting term.

2.3. MeSH concept extraction

In concept retrieval, the returned results are MeSH concepts in-cluded in the MEDLINE records assigned by trained indexers atNLM. In passage retrieval, the returned results are the sentence-le-vel windows. Then maximally match method is used to extractMeSH terms from these windows with a MeSH dictionary.

2.4. Implicit relationships discovery

2.4.1. Original linking terms for set AThe set A is composed of document segmentations containing

the article title ID and all MeSH terms co-occurred with A (seeFig. 1). Depending on how fully the topic it denotes is discussedin the document, each MeSH term is assigned a ‘‘minor’’ or ‘‘major’’weight. The major MeSH terms represent the main content of thedocument and are marked with ‘‘*’’ (e.g. the MeSH term string‘‘*Cloning, Organism/ethics/legislation & jurisprudence’’) (Mork &Aronson, 2007). For the problem addressed here focuses on themajor MeSH terms which express the main idea of a document,the MeSH terms with the signal ‘‘*’’ are extracted. In addition, thespecial signals such as space, semicolon, ‘‘*’’ and ‘‘&’’ are removedand the Porter stemmer is used to remove different suffixes(Porter, 1980). The final set is the original linking terms set.

2.4.2. Linking terms rankingIt is important to find the most meaningful linking terms for

discovering the target term. Our goal is to find the new linkagesof one disease with chemicals, drugs and genes. However, theMeSH terms too broad or too close will influence the efficiencyof finding the final target terms. For instance, only a small part ofdocuments contain migraine is useful for the knowledge discoveryprocess although the total number of these documents is about

9000. So it is unreasonable to select the linking term only accord-ing to the concept co-occurrence.

In our method, UMLS (http://www.nlm.nih.gov/research/umls/about_umls.html) is used to filter and prune the candidate linkingterms through their semantic types. UMLS is a standard medicalknowledge source developed by the National Library of Medicine.It includes some 900,000 biomedical concepts and every MeSH termis assigned to one or more semantic types. The semantic types areclassified into a smaller number of semantic groups by the experts.In earlier work, they established fifteen high-level semantic groupsthat help reduce the conceptual complexity of the large domain cov-ered by the UMLS. Groupings of semantic types (the semanticgroups) prove to be useful in a number of applications including im-proved visualization and display of the knowledge in a particular do-main (Bodenreider & McCray, 2003). In our method, some semantictype groups (shown in Table 1) are used to filter the terms: the MeSHterms which don’t belong to these semantic types will be removed.

Various metrics have been used to assign a score to a termrevealing to what extent it qualifies as a linking term or targetterm. Three computational methods are compared in our experi-ments: z-score (Yetisgen-Yildiz & Pratt, 2006), TFIDF (Liu, 2005)and PMI. The z-score is a measure of how far a given value is fromthe mean, expressed as a number of standard deviations. It is de-fined as follows:

PmA ¼

FmA

DAð1Þ

where PmA is the probability of MeSH term m in set A. Fm

A is the num-ber of documents with m in set A. DA is the total number of docu-ments in set A.

The mean probability of MeSH term m can reflect the meanfrequency in the whole literature impersonally. The formula isdefined as follows:

R. Chen et al. / Expert Systems with Applications 38 (2011) 9958–9964 9961

Pm ¼PNm

A¼1PmA

Nm ð2Þ

where Nm the total number of documents in which MeSH term moccurs. The standard deviation definition is used to computeprobability distribution deviation:

rm ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1Nm � 1

XNm

A¼1Pm

A � Pm� �2

rð3Þ

The standard deviation is a statistic that shows to what extentthe probability is around the mean probability in set A. Whenthe probabilities are tightly bunched together, the standard devia-tion is small. When the probabilities are spread apart, the standarddeviation is large. In the paper, rm reflects the fluctuation and dis-tribution situation of MeSH term m in set A.

The z-score of MeSH term m in set A is defined as:

zmA ¼

PmA � Pm

rmð4Þ

It describes a comparison between a specific distribution ofMeSH term m in set A and the term distribution in the backgroundset. For each candidate linking term, its z-score is calculated. Thenthe term is chosen if its score is greater than or equal to the thresh-old. These final linking terms comprises the set B.

The TFIDF is defined as follows:

TFIDFmA ¼ log TFm

A þ 1� �

� IDFm ð5Þ

where TFmA (term frequency) is the number of documents the MeSH

term m occurs in set A which reflects the importance of m to the set.When the set is larger, the term frequency of MeSH term m will behigher which causes a higher TFIDF value if the IDF is the same.Logarithm is applied in TFIDF so that the terms in the more docu-ments are not unfairly given more weight (Liu, 2005). IDFm (Inversedocument frequency) is defined as follows:

IDFm ¼ logN

DFm ð6Þ

where DFm is the number of documents in which the MeSH term moccurs in the background set. N is the total number of documents inthe background set. Ranking the terms by their TFIDF scores empha-sizes the term which has high term frequency and also eliminatesthe influence of the common words with high frequency.

Then all the TFIDF scores are divided by the highest value ofTFIDF so that every TFIDF score is endowed a weight among(0, 1). The formula is defined as follows:

weightmA ¼

TFIDFmA

Highest TFIDFmA

� � ð7Þ

We choose the linking terms whose TFIDF scores are higherthan or equal to the threshold to comprise set B.

PMI is another measure of the degree of statistical dependencebetween two items. It has been applied to several natural languageprocessing problems including word clustering and word sensedisambiguation (Read, 2004). It is defined as follows:

PMIðm1;m2Þ ¼ log2Pðm1;m2Þ

Pðm1ÞPðm2Þ

� �ð8Þ

The probabilities of m1 and m2 can be estimated simply bycounting the number of occurrences of each word. The m1 andm2 co-occurrence probability can be estimated by counting thenumber of times which the two words appear within a specifiednumber of documents.

2.4.3. Discovery and selection of the target termsIn next step every linking term in set B is used as the starting

term to retrieve the candidate target terms. On one hand, each

candidate target term corresponds to the linking terms that haveconnection with it. On the other hand, the candidate target termis not included in the linking terms of set B. In addition, the linkingterm count threshold (LTC) introduced by LitLinker system is usedto filter the target terms. The candidate terms with fewer connec-tions to the linking terms are filtered out. So the candidate targetterms with higher frequency scores and more connections are se-lected as the final target terms (set C). They have most potentialand meaningful evidence with the starting term.

3. Experimental results

3.1. Experimental dataset

Our experimental dataset is the whole set of MEDLINE citationspublished before September, 2005. To test our discovery results,we select the citations published from January, 2004 to September,2005 as the test corpus (which is the same dataset used to testLitLinker system so that the results are comparable). Each citationin MEDLINE includes the article title, abstract, author’s name, MeSHterms, affiliations, publication date, journal name and other infor-mation. In corpus preprocessing, the article title ID, MeSH terms,their semantic types and publication date are extracted. MeSHterms are used to retrieve and analyze the co-occurrence term fre-quency. Their semantic types are used to filter out the terms toobroad for target term (e.g. human) or too close for the starting term.

The MEDLINE citations corpus is divided into two parts: the firstpart is the MEDLINE citations published before January, 2004; thesecond part is the MEDLINE citations published from January, 2004to September, 2005. In our experiments three starting terms: Alz-heimer’s disease, Migraine disorders and Schizophrenia are used(which have also been used to evaluate LitLinker system).

3.2. Evaluation

Don R. Swanson has done much work for discovering the impli-cit relation using text mining from MEDLINE. For example, hefound the linkages between fish oil and Raynaud’s syndrome, Mi-graine and Magnesium, etc. Since there is a lack of a universal def-inition on evaluation, Swanson’s discoveries become the goldstandard for evaluation. Other researchers usually carry on theirwork through repeating Swanson’s discoveries to prove the effec-tiveness of their methods.

LitLinker system presents a new evaluation method in whichthe results are evaluated with precision and recall metrics. Ourexperimental results are also evaluated using these metrics whichare defined as follows:

Precision : Pi ¼Ti \ Gi

Tið9Þ

Recall : Ri ¼Ti \ Gi

Gið10Þ

where Ti is the set of target terms generated on the corpuspublished before January, 2004 with the starting term i. Gi is goldstandard which are the terms found on the second part of thecorpus and meet two conditions: Firstly, there exists a co-occur-rence between the linking term and the starting term i. Secondly,the linking term does not appear in the first part of the corpus(the MEDLINE citations published before January, 2004).

In addition, another common metric in information retrieval,F-score (the weighted harmonic mean of precision and recall, de-fined as F = (2PR)/(P + R) where P denotes precision and R recall)is introduced in our experiments to evaluate the overall per-formance.

Table 2Summary of results for Alzheimer’s disease (concept retrieval).

TFIDF PMI z-Score LitLinker (z-score)

LTC = 3 LTC = 5 LTC = 3 LTC = 5 LTC = 3 LTC = 5 LTC = 3 LTC = 5

No. of linking terms 212 212 212 212 212 212 212 212No. of target terms (T) 6999 6667 448 432 131 83 600 250No. of terms in gold standard (G) 1094 173Precision 0.17 0.18 0.12 0.13 0.13 0.15 0.062 0.064Recall 0.07 0.087 0.08 0.051 0.085 0.083 0.214 0.092F-score 0.099 0.117 0.096 0.073 0.103 0.107 0.096 0.075

Table 3Summary of results for Alzheimer’s disease (passage retrieval).

TFIDF PMI z-Score

LTC = 3 LTC = 5 LTC = 3 LTC = 5 LTC = 3 LTC = 5

No. of linking terms 212 212 212 212 212 212No. of target terms 1700 1655 1201 560 41 41No. of terms in gold standard 386Precision 0.089 0.09 0.143 0.188 0.22 0.22Recall 0.39 0.386 0.44 0.27 0.023 0.023F-score 0.148 0.148 0.216 0.222 0.042 0.042

Table 4Summary of results for migraine (concept retrieval).

TFIDF PMI z-Score Litlinker (z-score)

LTC = 3 LTC = 5 LTC = 3 LTC = 5 LTC = 3 LTC = 5 LTC = 3 LTC = 5

No. of linking terms 250 250 250 250 250 250 250 250No. of target terms 337 350 1251 772 30 24 1230 734No. of terms in gold standard 110 69Precision 0.03 0.03 0.02 0.02 0.20 0.19 0.026 0.029Recall 0.14 0.12 0.11 0.10 0.09 0.08 0.464 0.304F-score 0.049 0.048 0.034 0.033 0.124 0.113 0.049 0.053

Table 5Summary of results for migraine (passage retrieval).

TFIDF PMI z-Score

LTC = 3 LTC = 5 LTC = 3 LTC = 5 LTC = 3 LTC = 5

No. of linking terms 24 24 24 24 24 24No. of target terms 13758 12427 9962 8687 1498 1498No. of terms in gold standard 1519Precision 0.107 0.12 0.11 0.13 0.89 0.89Recall 0.985 0.981 0.761 0.758 0.033 0.033F-score 0.193 0.214 0.192 0.222 0.064 0.064

Table 6Summary of results for schizophrenia (concept retrieval).

TFIDF PMI z-Score Litlinker (z-score)

LTC = 3 LTC = 3 LTC = 3 LTC = 5 LTC = 3 LTC = 5 LTC = 3 LTC = 5

No. of linking terms 211 211 211 211 211 211 211 211No. of target terms 23 21 79 54 27 18 317 124No. of terms in gold standard 715 161Precision 0.12 0.15 0.07 0.07 0.11 0.14 0.076 0.064Recall 0.04 0.03 0.08 0.05 0.05 0.04 0.149 0.05F-score 0.06 0.05 0.075 0.058 0.069 0.062 0.101 0.056

9962 R. Chen et al. / Expert Systems with Applications 38 (2011) 9958–9964

3.3. Results and discussion

Three sets of experiments are carried out: Alzheimer’s diseaseand indomethacin, Migraine and Magnesium, Schizophrenia and Cal-cium-independent phospholipase A2. In these experiments threecomputational methods are compared to score and rank the

results: z-score, TFIDF and PMI. The results are discussed in detailbelow.

3.3.1. Alzheimer’s disease and indomethacinIn this set of experiments, we evaluate the pertinence using 212

linking terms which are extracted from the retrieval results with

Table 7Summary of results for schizophrenia (passage retrieval).

TFIDF PMI z-Score

LTC = 3 LTC = 5 LTC = 3 LTC = 5 LTC = 3 LTC = 5

No. of linking terms 212 212 212 212 212 212No. of target terms 2428 2401 1792 1042 51 51No. of terms in gold standard 367Precision 0.07 0.075 0.11 0.15 0.20 0.20Recall 0.50 0.49 0.567 0.436 0.027 0.027F-score 0.123 0.130 0.184 0.223 0.048 0.048

R. Chen et al. / Expert Systems with Applications 38 (2011) 9958–9964 9963

the starting term Alzheimer’s disease. Meanwhile, 1094 terms areobtained from the test corpus as the gold standard. For eachmethod, precision and recall values are computed.

Table 2 shows the comparative results of our three methods andthe z-score method of the LitLinker system in the concept retrievalexperiments. Since our experiments have different gold standardsfrom the LitLinker system, the results can not be compared di-rectly. Among our three methods, the TFIDF method achieves thebest performance (a precision of 18%, a recall of 8.7% and an F-scoreof 11.7% all of which are obtained when LTC is 5) and its overallperformance (F-score) is much better than the one of the LitLinkersystem (9.6% when LTC is 3).

Table 3 shows the results of three methods in the passageretrieval experiments. The z-score method achieves the highestprecision value while the other two methods have higher recallvalues. In term of F-score, the performances of the TFIDF and PMImethods are much better than those in the concept retrieval exper-iments. The F-score value of the PMI method (22.2% when LTC is 5)is best among others and it is much better than the one of theLitLinker system (9.6% when LTC is 3). The reason is that, in ourmethod, the linking terms and target terms are extracted fromthe sentence-level windows (instead of from the whole Medlinerecord) and are therefore more relevant with the starting term.However, the F-scores of the z-score method are worse than thosein the concept retrieval experiments due to its rather low recallvalues: with the z-score method only 41 target terms are returnedleading to a high precision of 22% and a low recall of 2.3%.

3.3.2. Migraine and MagnesiumIn this set of experiments, we evaluate the pertinence using a

set of 250 linking terms in the concept retrieval experiment and24 linking term in the passage retrieval experiment with the start-ing term migraine. Here 24 linking terms is all that we can extractfrom the MEDLINE.

Tables 4 and 5 show the comparative results with three meth-ods in the concept retrieval experiment and the passage retrievalexperiment respectively. In both experiments the TFIDF methodachieves better recall performance than other two methods andit performs the best in the passage retrieval experiment (98.1%).The reason is that, in the TFIDF method, the MeSH term which ap-pears only in some special documents is meaningful for any docu-ment in which it appears. On the other hand, since the worddistribution in the background set is not taken into account, theTFIDF method can obtain more target terms.

z-Score method uses statistical distributions to deal with thedistributions of the special documents which contain the MeSHterm rather than a single distribution and achieves the best preci-sion performance at passage retrieval condition (89%). Differentfrom the TFIDF method, z-score needs to calculate the mean prob-ability and the standard deviation. Since it considers the relation-ship among the documents whether it contains the MeSH termor not, the term distribution eliminates the orientation of high orlow frequency.

Similar as in the experiments of Alzheimer’s disease andindomethacin the F-scores of the TFIDF and PMI methods are muchbetter than those in the concept retrieval experiments while theones of the z-score method get worse. The F-score value of the PMImethod (22.2% when LTC is 5) is also best among others and muchbetter than the one of the LitLinker system (5.3% when LTC is 5).

3.3.3. Schizophrenia and Calcium-independent phospholipase A2In this set of experiments, we evaluate pertinence using a set of

211 linking terms in the concept retrieval experiment and passageretrieval experiment with the starting term schizophrenia.

Table 6 shows the comparative results with three methods inthe concept retrieval experiment. Since the number of the returnedtarget terms are too small, all three methods achieve very low re-call values (3%–5%) leading to inferior F-scores (5%–7.5%) to theone of the LitLinker system (10.1% when LTC is 3).

Table 7 shows the results in the passage retrieval experiment.Among other methods, z-score performs the best in precision.One interesting observation is that for our best recall value, thePMI score (56.7% when LTC is 3) is higher than other methodswhich is not consistent across the rest of the experiments. Similaras in the two previous experiments the F-scores of the TFIDF andPMI methods are much better than those in the concept retrievalexperiments while the ones of the z-score method get worse. TheF-score value of the PMI method (22.3% when LTC is 5) is also bestamong others and much better than the one of the LitLinker system(10.1% when LTC is 3).

4. Conclusions

This paper presents a passage retrieval based method to explorethe hidden connection from MEDLINE records. In our method, theMeSH concepts are retrieved from the sentence-level windows andare therefore more relevant with the starting term. We conductedthree sets of Swanson’s experiments: Alzheimer’s disease and indo-methacin, Migraine disorders and Magnesium, Schizophrenia andCalcium-independent phospholipase A2. In these experiments, threecomputational methods are compared for scoring and ranking theMeSH terms: z-score, TFIDF and PMI. In the passage retrievalexperiments, since the linking terms and target terms are retrievedfrom the sentence-level windows and are therefore more relevantwith the starting term, the TFIDF and PMI methods can achievemuch better performance than those in the concept retrievalexperiment. In addition, the F-score values of the PMI methodare best among others and much better than the ones of the LitLin-ker system. All these results show our passage retrieval basedmethod can significantly improve the performance of hiddenknowledge discovery in biomedical literature.

Acknowledgments

This work is supported by grant from the Natural ScienceFoundation of China (No. 60373095 and 60673039) and the

9964 R. Chen et al. / Expert Systems with Applications 38 (2011) 9958–9964

National High Tech Research and Development Plan of China(2006AA01Z151).

References

Bodenreider, O., & McCray, A. T. (2003). Exploring semantic groups through visualapproaches. Journal of Biomedical Informatics, 36(6), 414–432.

Gordon, M. D., & Lindsay, R. K. (1996). Toward discovery support systems: areplication, re-examination, and extension of Swanson’s work on literature-based discovery of a connection between Raynaud’s and fish oil. Journal of theAmerican Society for Information Science, 47(2), 116–128.

Kaszkiel, M., & Zobel, J. (2001). Effective ranking with arbitrary passages. Journal ofthe American Society for Information Science and Technology, 52(4), 344–364.

Lindsay, R. K., & Gordon, M. D. (1999). Literature-based discovery by lexicalstatistics. Journal of the American Society for Information Science, 50(7), 574–587.

Liu, Y. (2005). Text mining biomedical literature for genomic knowledge discovery.Doctor’s Thesis, Georgia Institute of Technology, Atlanta, GA.

Martin, A. R., & Romaric, B. (1998). Text Mining-Knowledge Extraction from semi-structured Textual Data. In Proceeding of the 6th conference of the internationalfederation of classification societies (pp.473–480). Rome, Italy.

Mork, J. G., & Aronson, A. R. (2007). Automatic indexing of specialized documents:using generic vs. domain-specific document representations. In Proceedings ofBioNLP 2007: Biological, translational, and clinical language processing (pp. 183–190). Prague, Czech Republic.

Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.Read, J. (2004). Recognising affect in text using Pointwise-Mutual Information.

Master’s Thesis, University of Sussex, UK.Swanson, D. R. (1987). Two medical literatures that are logically but not

bibliographically connected. Journal of the American Society for InformationRetrieval, 38(4), 228–233.

Swanson, D. R. (1989). Online search for logically related non-interactive medicalliteratures: A systematic trial-and-error strategy. Journal of the American Societyfor Information Science, 40(5), 356–358.

Swanson, D. R., & Smalheiser, N. R. (1994). Assessing a gap in the biomedicalliterature: magnesium deficiency and neurologic disease. Neuroscience ResearchCommunications, 15(1), 1–9.

Swanson, D. R., & Smalheiser, N. R. (1997). An interactive system for findingcomplementary literatures: a stimulus to scientific discovery. Journal of ArtificialIntelligence, 91(2), 183–203.

Srinivasan, P. (2004). Text mining: generating hypotheses from MEDLINE. Journal ofthe American Society for Information Science and Technology, 55(5), 396–413.

Weeber, M., Klein, H., Aronson, A. R., Mork, J. G. et al. (2000).Text-based discovery inbiomedicine: The architecture of the DAD-system. In Proceedings of the 2000AMIA annual fall symposium (pp. 903–907). Los Angeles, CA.

Yetisgen-Yildiz, M., & Pratt, W. (2006). Using statistical and knowledge-basedapproaches for literature-based discovery. Journal of Biomedical Informatics,39(6), 600–611.