Conference Review Extracting information automatically from biological literature A presentation for the ESF workshop ‘Proteomics: Focus on Protein Interactions’ Christian Blaschke, Robert Hoffmann, Juan Carlos Oliveros and Alfonso Valencia* Protein Design Group, CNB-CSIC, Madrid E-28049, Spain * Correspondence to: A. Valencia, Protein Design Group, National Center for Biotechnology, CNB-CSIC, Cantoblanco, Madrid E-28049, Spain. E-mail: [email protected] Received: 4 July 2001 Accepted: 27 July 2001 Keywords: information in text sources; genomics; proteomics Information contained in sources of biological text In the past few decades, biologists have generated a large amount of data that has been published mainly in biological journals. It is now important to be able to recover as much as possible of this information as it constitutes a precious source of additional information for helping to understand the new genomics and proteomics data. More than 10 million abstracts of such papers are contained in the Medline collection and are available on the World Wide Web Via PubMed [10], and this collection will expand considerably once journals become freely available on the Web (PubMed Central [15], E-BioSci [7]). In parallel with these plain text information sources, basic molecular biology data has been stored in various semi-structured repositories, such as protein and gene sequence databases, and more recently in databases of protein structures, protein interactions, transcription factors, point mutations, metabolic pathways and many others. There is a commonly-recognized need for linking and complementing the information contained in these databases with the information stored in the literature, a task that right now requires detailed work by scientists and in some cases database users. Ways of extracting information automatically from text Three main types of systems are being developed: $ Statistical methods. These are based on the frequency of occurrence of words in a large text corpus that has been previously organized in line with some form of external knowledge (for example, groups of genes with similar expression patterns or proteins that belong to the same protein family). Significant patterns detected, and the information associated with them, are used to characterize the corresponding groups of genes or proteins. $ Computational linguistics methods. These meth- ods use parsers and grammars to extract syn- tactic information and internal dependencies within individual sentences. This approach is quite general and can be applied to different Abbreviations: Information extraction in Biology Comparative and Functional Genomics Comp Funct Genom 2001; 2: 310–313. DOI: 10.1002 / cfg.102 Copyright # 2001 John Wiley & Sons, Ltd.

Information contained in sources ofbiological text

In the past few decades, biologists have generated alarge amount of data that has been publishedmainly in biological journals. It is now importantto be able to recover as much as possible of thisinformation as it constitutes a precious source ofadditional information for helping to understandthe new genomics and proteomics data. More than10 million abstracts of such papers are containedin the Medline collection and are available onthe World Wide Web Via PubMed [10], and thiscollection will expand considerably once journalsbecome freely available on the Web (PubMedCentral [15], E-BioSci [7]).

In parallel with these plain text informationsources, basic molecular biology data has beenstored in various semi-structured repositories, suchas protein and gene sequence databases, and morerecently in databases of protein structures, proteininteractions, transcription factors, point mutations,metabolic pathways and many others.

There is a commonly-recognized need for linking

and complementing the information contained inthese databases with the information stored in theliterature, a task that right now requires detailedwork by scientists and in some cases database users.

Ways of extracting informationautomatically from text

Three main types of systems are being developed:

$ Statistical methods. These are based on thefrequency of occurrence of words in a large textcorpus that has been previously organized in linewith some form of external knowledge (forexample, groups of genes with similar expressionpatterns or proteins that belong to the sameprotein family). Significant patterns detected, andthe information associated with them, are used tocharacterize the corresponding groups of genesor proteins.

$ Computational linguistics methods. These meth-ods use parsers and grammars to extract syn-tactic information and internal dependencieswithin individual sentences. This approach isquite general and can be applied to differentAbbreviations: Information extraction in Biology

knowledge domains after careful adaptation tothe specific problems of the field. It is importantto realize that there is still no guarantee that thisadaptation can be successfully achieved for thefield of molecular biology.

$ Frame-based approaches. A third type ofapproach combines features of the two previousmethods with a set of previously defined tem-plates for possible textual relationships, calledframes. In common with computational linguis-tics methods, this approach can make use ofsyntactic information although it can also workwithout prior parsing and tagging of the text. Asin the case of statistical methods, it uses scoringschemes that depend on the number of occur-rences of particles in large collections of text.

Overview of publications on automatedinformation extraction as applied to molecularbiology

Statistical approaches

$ Extraction of keywords from Medline abstractsin order to qualify the function of previouslyclassified protein families [2].

$ Assistance in the annotation of experimentalresults obtained from DNA expression arrays [4].

$ Distribution of extracted terms in order toclassify them in relation to articles linked to theOMIM database of human disease [1].

$ Statistical and linguistic techniques for buildingknowledge bases and domain-specific thesauri[11].

$ Classification of protein sub-cellular localization[6].

Detection of protein names in biomedical texts

$ Combination of syntactic information and mor-phological differences in comparison with thesurrounding text [8,13].

Protein-protein interaction detection systems

$ Co-occurrence of protein names within the sameabstract, and whether this implies a functionalrelationship or not [19,9].

$ Use of frame-based methods to extract large setsof protein-protein interactions [3].

$ Evaluation of the degree of relationship between

experimental protein interactions (DIP database,[21]) and underlying text sources [5].

$ Restricted application of linguistic methodsand various grammars to reduced systems inorder to demonstrate potential applicability[14,16,17,18,20,12].

Commercial tools and systems

A number of companies have announced thecommercialization of basic tools such as part-of-speech taggers and parsers, databases withmanually-organized sets of text as well as variousinformation extraction systems for the detectionand classification of information. Among others,there are general purpose tools from IBM andXerox, and more specialized systems from com-panies such as Autonomy, SAS, Ingenuity, Semio,SRA and Temis.

Two systems for automated informationextraction

We describe here two previously published systemsfor which detailed evaluation of the results isavailable. The first represents a typical statisticalapproach, based on the extraction of keywordsfrom pre-organized text groups, while the second isan example of a hybrid approach based on a set ofpre-defined frames related to protein interactions.


Geisha (Gene Expression Information System forHuman Analysis [4]) is conceptually similar to otherstatistical approaches, such as that previouslydeveloped by Andrade and Valencia [2] for theassignment of functional keywords to proteinfamilies. The Geisha system involves the annotationof function for groups of genes that show similarexpression patterns in DNA array experiments.First the system uses the groups of genes as aframework for clustering the related literature. In asecond step it estimates the frequency of relevantwords in the various literature clusters, and then ina third step these frequencies are compared in orderto assess their statistical relevance (in the form ofZ-scores). A similar procedure is applied to theextraction of complete sentences specific to thevarious gene clusters.

Since biological information is often expressed incomposite terms such as ‘DNA polymerase’ and

‘RNA polymerase’, these constructions are detectedby analyzing the frequency of these co-occurrencesin comparison to the expected frequency of theindividual component words.

The results of the Geisha system have beenextensively compared to the annotations provided bydatabases and human experts, showing how in manycases Geisha was able to extract relevant or alter-native information to that provided by other sources.


Suiseki was designed as an integrated system for theextraction of protein interactions from Medlineabstracts [5]. The system includes some features ofcomputational linguistics methods (such as texttagging) and some from statistical methods (suchas the use of statistics relating to word occurrence),together with a collection of pre-established framesthat capture possible ways of expressing inter-actions in biological text.

The steps followed by Suiseki are: (a) downloadof Medline abstracts (or local access); (b) part-of-speech tagging for the detection of protein names;(c) detection of protein name synonyms; (d)determination of verbs indicating a relationship(interaction keywords); (e) extraction of protein-protein interactions with a minimal set of nineteenframes; (f) exclusion of negative interactions with aset of specific frames; (g) use of the frame scores(proportional to their accuracy in describing inter-actions), and the number of sentences matchingeach frame, to obtain scores for each interaction;(h) finally, the interactions are stored in a databaseof relationships and represented via a dynamic webinterface that allows simultaneous manipulation ofthe underlying information (names and inter-actions), access to the text sources (sentencescorresponding to the interactions and Medlinedata) and manipulation by human experts.

The frames used by the Suiseki system includegeneral patterns such as ‘protein A -particleindicating interaction- protein B’, where variousparticles can indicate interaction (e.g. bind, phos-phorylates), as well as more specialized patterns(e.g. ‘phosphorylation/binding/ ... of protein A ... byprotein B’; ‘complex of protein A and protein B’)that can be more accurate than general patterns, atthe expense of covering a smaller number of cases.

The Suiseki system requires minimal userintervention, and can easily be applied to largecollections of written text.

To evaluate their systems, various authors haveemployed sets of sentences that have previouslybeen analyzed manually. More recently, it hasbecome clear that evaluating the number of knowninteractions that can be retrieved automaticallycould provide more biologically realistic informa-tion. In particular, the detection of gene and proteinnames remains as the main problem in this field,and evaluation using a set of sentences tends toavoid this problem by simply assuming that thenames are already known. In contrast, evaluationscarried out using large collections of known inter-actions directly involve the problem of relatingthose names used in databases and those used in theliterature. For example, in the case of the interac-tions held in the DIP database, more than 40% ofthe entries contain names that could not be foundin any Medline entry [5], and this gives someindication of the severity of the protein nameproblem.

In the case of the Suiseki system, for random setsof sentences the correct interactions were extractedin 30% of cases, but more importantly, for the mostfrequent interactions as many as 80% of theinteractions extracted were correct.


