the emerging landscape of bioinformatics software systems

5
0018-9162/02/$17.00 © 2002 IEEE July 2002 41 The Emerging Landscape of Bioinformatics Software Systems the nature of biological data; data storage, analysis, and retrieval; computational modeling and simulation; biologically meaningful information integration; data mining; image processing and visualization; and closing the loop. Each of this issue’s cover features touches on one or several of these themes. The nature of biological data The life sciences literature presents biological results based on carefully collected, analyzed, and vetted data that researchers can use with high con- fidence. Everyday bioinformatics, however, deals with raw data collected from recently completed experiments in the form of images, charts, or num- bers and with sequence data collected from a wide variety of online databases. We should regard such data with some skepticism. Any significant collection of raw experimental data includes experimental errors—systematic or random. Obtaining statistically meaningful data requires careful experimental design and replica- tion of results. On the other hand, experiments are expensive in terms of professional labor, reagents, equipment, and time. As a consequence, biological data is always incomplete. Fusing computing and biology expertise, bioinformatics software provides a powerful tool for organizing and mining the vast amounts of data genetics researchers are accumulating. Lenwood S. Heath Naren Ramakrishnan Virginia Tech GUEST EDITORS’ INTRODUCTION B iology has increasingly become a data-dri- ven science. The emergence of high- throughput data acquisition technologies for investigating biological phenomena has been an important factor in this process. For example, the recent completion of the human, rice, and flowering plant Arabidopsis thaliana genome maps signifies a major milestone in data acquisition. The development of novel algorithms and data- bases to catalog, organize, harness, and mine the increasing amount of data such research efforts gen- erate has also been important. The potential for sci- entists to infer significant biological knowledge computationally from a desktop is both appealing and real. In this issue, we gather perspectives, articles, and reports to showcase the emergence of bioinformat- ics software as a discipline in its own right. The “Molecular Biology for Computer Scientists” side- bar provides some pertinent background informa- tion. BIOINFORMATICS SOFTWARE SYSTEMS THEMES As life scientists and computational scientists interact to create useful bioinformatics software sys- tems, several themes or lessons recur. We identify seven themes:

Upload: n

Post on 21-Sep-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The emerging landscape of bioinformatics software systems

0018-9162/02/$17.00 © 2002 IEEE July 2002 41

The EmergingLandscape ofBioinformaticsSoftware Systems

• the nature of biological data; • data storage, analysis, and retrieval; • computational modeling and simulation; • biologically meaningful information integration;• data mining;• image processing and visualization; and • closing the loop.

Each of this issue’s cover features touches on oneor several of these themes.

The nature of biological data The life sciences literature presents biological

results based on carefully collected, analyzed, andvetted data that researchers can use with high con-fidence. Everyday bioinformatics, however, dealswith raw data collected from recently completedexperiments in the form of images, charts, or num-bers and with sequence data collected from a widevariety of online databases. We should regard suchdata with some skepticism.

Any significant collection of raw experimentaldata includes experimental errors—systematic orrandom. Obtaining statistically meaningful datarequires careful experimental design and replica-tion of results. On the other hand, experiments areexpensive in terms of professional labor, reagents,equipment, and time. As a consequence, biologicaldata is always incomplete.

Fusing computing and biology expertise, bioinformaticssoftware provides a powerful tool for organizing and mining the vast amounts of data genetics researchersare accumulating.

Lenwood S.Heath

NarenRamakrishnanVirginia Tech

G U E S T E D I T O R S ’ I N T R O D U C T I O N

B iology has increasingly become a data-dri-ven science. The emergence of high-throughput data acquisition technologiesfor investigating biological phenomenahas been an important factor in this

process. For example, the recent completion of thehuman, rice, and flowering plant Arabidopsisthaliana genome maps signifies a major milestonein data acquisition.

The development of novel algorithms and data-bases to catalog, organize, harness, and mine theincreasing amount of data such research efforts gen-erate has also been important. The potential for sci-entists to infer significant biological knowledgecomputationally from a desktop is both appealingand real.

In this issue, we gather perspectives, articles, andreports to showcase the emergence of bioinformat-ics software as a discipline in its own right. The“Molecular Biology for Computer Scientists” side-bar provides some pertinent background informa-tion.

BIOINFORMATICS SOFTWARE SYSTEMS THEMES

As life scientists and computational scientistsinteract to create useful bioinformatics software sys-tems, several themes or lessons recur. We identifyseven themes:

Page 2: The emerging landscape of bioinformatics software systems

42 Computer

For science to progress, we must combine induc-tive reasoning based on existing biological infor-mation with new experimental results. Indeed, somebiological data is inherently unknowable, such asthe genomes of most extinct species. Bioinformaticssoftware system developers must always be awarethat some uncertainty exists in any results the sys-tem generates. Thus, characterizing or quantifyingthe uncertainty is worth consideration.

Data storage, analysis, and retrievalThe high-volume, data-driven nature of modern

experimental biology has led to the creation of manydatabases that contain genomes, protein sequences,gene expression data, and other data types.Researchers often key their retrieval of informationfrom such databases primarily on one characteris-tic, such as the nucleotide or amino acid sequence,organism, gene annotation, or protein name.

Answering queries often involves some form of

data analysis, such as statistical significance, clus-tering, or sequence homology search. The BasicLocal Alignment Search Tool is typically the firstbioinformatics tool a biologist uses when examin-ing a new DNA or protein sequence. BLAST com-pares the new sequence to all sequences in thedatabase to find the most similar known sequences.

Computational modeling and simulationIn addition to generating experimental data, com-

putational simulation also plays a central role inunderstanding many biological processes. For exam-ple, researchers can study processes such as cell divi-sion by modeling reaction networks as a set ofsimultaneous differential equations.1 They can thenuse tools from numerical and scientific computingto address questions such as “At what rate does thisenzyme catalyze cell division?” The 1 March 2002issue of Science showcases major developments insystems biology, many of which rely on simulation.

Every living organism can be properly classified into one of threedomains1: Bacteria, Archaea, and Eucarya. A eukaryote—a mem-ber of the Eucarya domain—is a single- or multicellular organ-ism, each cell of which has a bounding, semipermeable membraneand internal membrane-bound components, called organelles. Thenucleus is one such organelle. It contains the organism’s primarygenetic information and the mechanism for genetic replication.

Plants and animals are eukaryotes. In contrast, everyorganism belonging to the Bacteria or Archaea domains is aprokaryote—a single-celled organism bounded by a semiperme-able membrane that contains no internal organelles. In particular,a prokaryote has no nucleus. The first prokaryotic cells appearedon earth 3.5 billion years ago. Bacteria and Archaea evolved sep-arately from a common ancestor approximately 3 billion yearsago. Eucarya evolved from Archaea about 1.8 billion years ago.

Genomes and transcriptionChromosomes—large molecules of double-stranded deoxyri-

bonucleic acid (DNA) and protein2,3—carry an organism’sgenetic information. Information on a DNA molecule takes theform of a linear code of four bases—A for adenine, C for cyto-sine, G for guanine, and T for thymine.

Pairs of complementary bases form weak intermolecularchemical bonds with each other. A complements T, while C com-plements G. In a chromosome, the two DNA strands each carrythe same information, but one strand has the bases that com-plement the other strand’s bases, in reverse order. For example,once reversed and complemented, the code ATGGCAACCbecomes GGTTGCCAT, and the chemical bonding can be rep-resented as follows:

DNA strands are oriented from a 5' end to a 3' end, so the codeson the two strands read ATGGCAACC and GGTTGCCAT.

An organism’s genome consists of the entire DNA contentsfor all the chromosomes in any of its cells. A chromosome con-tains the organism’s genes, which are particular subsequences ofthe linear code. Each gene includes regulatory sequences and anopen reading frame. The ORF is the sequence within the genethat, in a process called transcription, is copied from the ORF onthe coding strand of DNA to molecules called messenger RNA(mRNA).

Ribonucleic acid (RNA), a molecule similar to DNA, can alsocarry a linear code of four bases—A, C, G, and U for uracil. Ureplaces T as the base that complements A. Transcription occurswhen an mRNA molecule is manufactured from the noncodingstrand of DNA, as in this example:

The mRNA molecule therefore carries an exact copy of theORF from the coding strand. Post-transcriptional processingmay splice out some subsequences of the mRNA molecule, calledintrons. Eukaryotes and Archaea have introns, but Bacteria donot.

The protein factory An mRNA molecule travels out of the nucleus into the cytosol,

where the molecule’s code determines how proteins are manu-factured. The mRNA molecule’s sequence is decoded to specifya sequence of 20 amino acids that form the protein’s primarystructure. A codon—a sequence of three bases drawn from A,C, G, and U—codes for an amino acid according to the geneticcode.

5' 3'DNA strand 1

DNA strand 2

A T G G C A A C C

T A C C G T T G G3' 5'

5' 3'A U G G C A A C C

T A C C G T T G G3' 5'

mRNA

Noncoding strand

Molecular Biology for Computer Scientists

Page 3: The emerging landscape of bioinformatics software systems

July 2002 43

Biologically meaningful information integration

A tremendous quantity of valuable heteroge-neous biological information can be found online,with bullish growth in the future a certainty.Examples of information sources include genomicsequences, gene sequences, expressed sequence tags,protein sequences, microarray experiment imagesand raw data sets, 2D protein gels, proteindomains, and the literature of genetics, biochem-istry, and molecular biology.

Researchers cannot answer the cutting-edge ques-tions in biology with information from just two orthree of these sources. A global database that inte-grates all these sources for all purposes is a pipedream, given that we cannot predict in advance themyriad needs of biologists for accessing the infor-mation. However, resources dedicated to restricteddomains—for example, BioCarta (http://www.biocarta.com/), an encyclopedia of regulatory path-

ways, and CyanoBase (http://www.kazusa.or.jp/cyano), a Web resource for cyanobacterial re-search—have recently become available. Thus,bioinformatics software systems increasingly facethe task of integrating information from diversesources. Further, their developers must address thechallenge of obtaining biologically meaningful anduseful results from those many sources.

Data miningBioinformatics systems benefit from the use of

data mining strategies to locate interesting and per-tinent relationships within massive information. Forexample, data mining methods can ascertain andsummarize the set of genes responding to a certainlevel of stress in an organism. Researchers can usegraphical models such as Bayesian networks andrelational algorithms such as inductive logic pro-gramming to mine such gene sets and model a geneexpression network. Even a cursory glance through

For example, AUG codes for the amino acid methionine (M),GCA codes for alanine (A), and ACC codes for threonine (T).Hence, the example mRNA fragment AUGGCAACC codes forthe amino acid sequence MAT.

The translation process uses an mRNA molecule’s genetic codeto manufacture a protein molecule. Proteins are molecules essen-tial to cell function. A protein may be an enzyme that catalyzesa biochemical reaction, a cell structural component, a transducerthat responds to the environment, or a regulatory element thatempowers or inhibits some process within the cell such as metab-olism, transcription, or translation. Once manufactured, a pro-tein folds into a three-dimensional shape whose geometric andchemical properties determine the protein’s function.

Not all genes are active within a cell at any given time. In par-ticular, transcription factors—molecules that differentially bindto the genes’ regulatory sequences—control or regulate a gene’stranscription. A gene that has been transcribed into mRNA issaid to be expressed. Within a cell, hundreds or thousands ofgenes may be expressed at any time. In principle, to determinewhether a gene is currently expressed and to what extent,researchers can measure how many mRNA molecules with thesame coding sequence as the gene there are at any given time.

Evolving technologiesTwo rapidly evolving technologies account for much of the

burgeoning supply of basic biological data.4

First, sequencing can determine the sequence of bases in aDNA molecule that is a few hundred to a few thousand baseslong. Researchers can cut a large DNA molecule such as a chro-mosome into overlapping fragments of manageable sizes,sequence all the fragments, and reconstruct the original mole-cule computationally. Sequencing and the associated computa-tions are now so sophisticated that we can reconstruct a smallgenome a few million bases long in a few days.

Second, microarrays provide a method for identifying the

expression of many or even all of an organism’s genes in paral-lel at any given time.5 Biologists can now obtain a snapshot orpattern of gene expression corresponding to the organism’sresponse to a particular experimental condition. Unlike agenome, which provides only static sequence information,microarray experiments produce gene expression patterns thatprovide dynamic information about cell function. This infor-mation is essential to investigating complex interactions withinthe cell.

Through mutations in genomes, new species of organismsevolve. If we regard the first species of Earth’s one-celled organ-isms as the root of a binary tree, all the species that currentlyexist or ever have existed form an evolutionary or phylogenetictree. Because two species in close proximity on an evolutionarytree have genomes that are close in sequence, researchers can—and indeed must—study DNA and protein sequences to recon-struct phylogenetic relationships among species.

References1. C.R. Woese, O. Kandler, and M.L. Wheelis, “Towards a Natural

System of Organisms: Proposal for the Domains Archaea, Bacteria,and Eucarya,” Proc. Nat’l Academy of Sciences USA, Nat’l Academyof Sciences, Washington, D.C., vol. 87, no. 12, 1990, pp. 4576-4579.

2. B. Alberts et al., Essential Cell Biology, Garland, New York, 1998.3. G.M. Cooper, The Cell: A Molecular Approach, 2nd ed., ASM Press,

Washington, DC, 2000.4. G. Gibson and S.V. Muse, A Primer of Genome Science, Sinauer,

Sunderland, Mass., 2002.5. L.S. Heath et al., “Studying the Functional Genomics of Stress

Responses in Loblolly Pine with the Expresso Microarray Experi-ment Management System,” Comparative and Functional Genomics,vol. 3, no. 3, June 2002, pp. 226-243.

Page 4: The emerging landscape of bioinformatics software systems

44 Computer

the literature in journals such as Bioinfor-matics reveals the persistent role of data min-ing in experimental biology. Integrating datamining within the context of experimentalinvestigations is central to bioinformatics soft-ware.

Image processing and visualizationMany results in experimental biology first

appear in image form—a photo of an organ-ism, cells, gels, or microarray scans. As the

quantity of these results accelerates, automaticextraction of features and meaning from experi-mental images becomes critical.

At the other end of the data pipeline, naive 2D or3D visualizations alone are inadequate for explor-ing bioinformatic data. Biologists need a visualenvironment that facilitates exploring high-dimen-sional data dependent on many parameters.

Closing the loop Bioinformatics software must support the

strongly iterative and interactive nature of biologyresearch. Biologists typically revise and redesignexperiments based on results from previous exper-iments. Providing feedback to earlier stages of anexperiment based on downstream data—for exam-ple, to reorganize a microarray layout or alter dyeconcentrations—is central to improving the effi-ciency of biological investigation.

IN THIS ISSUEThe best bioinformatics software systems address

the problems in bioinformatics within a context ofinsights into the

• computational complexities of those problems,and

• sophisticated knowledge of biology and cur-rent experimental technologies.

In this spirit, Mihai Pop, Steven L. Salzberg, andMartin Shumway present a tantalizing view of thedevelopment of sequence assembly systems forentire genomes. These systems must recognize thestrengths and drawbacks of experimental technol-ogy and the challenging nature of real genomes,such as the existence of long tandem repeats.

In “Genome Sequence Assembly: AlgorithmicIssues and Practical Considerations,” the authorsdescribe the techniques brought to bear on the com-putational issues of sequence assembly, includingthe theory of computation as it applies to NP-hard-ness, graph theory, and combinatorial algorithms;

quality-assessment statistics; and heuristics thatsupport tradeoffs among genome coverage, thenumber of misassemblies, and the number of con-tigs—a group of overlapping regions of a genome.Sequence assembly is typical of most bioinformat-ics problems in that the correct answer is perhapseither unknowable or only obtainable by payingthe high cost of manual intervention.

The Assembling the Tree of Life (ATOL) projectchallenges biologists and computer scientists to usesequence and other biological information to deter-mine the evolutionary relationships among exist-ing species. ATOL further challenges researchers torepresent these relationships in a tremendous evo-lutionary, or phylogenetic, tree. In “Toward NewSoftware for Computational Phylogenetics,”Bernard M.E. Moret, Li-San Wang, and TandyWarnow present significant insights into the natureof ATOL’s computational challenges and describetheir progress in addressing those challenges.

Algorithms for reconstructing phylogenetic treesfrom sequences must scale to very large sequencesets. Moret and colleagues have thus developeddisk-covering algorithms as a means for accom-plishing scaling. High-performance algorithm engi-neering uses algorithmic and implementation savvyto produce highly efficient applications for chal-lenging computational problems. The authors’Genome Rearrangement Analysis using Parsimonyand other Phylogenetic Algorithms (GRAPPA) is asystem that provides an excellent example of high-performance engineering and can teach valuablelessons to bioinformaticians.

In “BioSig: An Imaging Bioinformatic System forStudying Phenomics,” Bahram Parvin and col-leagues describe a system for archiving and inter-preting microscopic images of small groups of cellsdrawn from mice and treated with ionizing radia-tion. Part of interpreting such an image involvessegregating organelles, such as the nuclei, from theremainder of the image. Even smaller features, suchas the chromatin—the chromosomes and associ-ated proteins present in the nucleus when the cell isnot reproducing—should also be identified as dis-tinct from noise. The BioSig system for cell phe-nomics—the visual characteristics of a control ortreated cell—can ultimately lead to predicting theeffects of ionizing radiation in other mouse—andhuman—organs.

Meeting long-term bioinformatics goals requiresreconciling information collected from studyingbiological phenomena at multiple scales and usingmultiple modes of investigation. For example,researchers can study biological processes at the

Bioinformatics software must

support the stronglyiterative and

interactive nature ofbiology research.

Page 5: The emerging landscape of bioinformatics software systems

DNA, mRNA, protein, enzyme, pathway, reactionnetwork, or physiology levels. Each level gives adifferent view to the underlying mechanism, buttogether they help establish the basis for answer-ing biological questions.

In “A Random Walk Down the Genomes: DNAEvolution in Valis,” Bud Mishra and colleagues pre-sent the Valis system, which prototypes bioinfor-matics applications, and describe its use for studyingcellular events in relation to DNA sequence evolu-tion. This article, which comes closest to our clos-ing-the-loop theme, also describes a sophisticatedmodeling of sequence evolution. The authors coverthe software design in detail and also describe theirsystem’s modeling and computational capabilities.Valis-like systems have the potential to model large-scale genomic processes, a Grand Challenge forbioinformatics.

BioSig and Valis constitute integrated problem-solving environments2 for bioinformatics applica-tions. These software systems provide all thecomputational facilities necessary for solving a tar-get class of problems.

From performing functional studies on genes orgene families in isolation, research has progressedto studying all the genes in a given organism simul-taneously. Microarray bioinformatics has aided inthis massive parallelization of experimental biology.

In “Interactively Exploring Hierarchical ClusteringResults,” Jinwook Seo and Bernard Shneidermanpresent an interactive visualization system for inves-tigating data from microarray experiments. Theauthors introduce the basics of microarray technol-ogy for nonspecialists and describe how to achieveuser-driven interactive data exploration in relationto a hierarchical clustering algorithm. Interactiveexploration and visualization will become increas-ingly important as the dimensionality and diversity ofthe underlying data increase.

In addition to these theme articles, this issue ofComputer also features two perspective articles onbioinformatics. In “Computers Are from Mars,Organisms Are from Venus,” Junhyong Kim exam-ines the new relationships between biology andcomputer science. Kim identifies benefits that eachfield can draw from the other, as well as the chal-lenges in interdisciplinary research. The essay is anexcellent starting point for new researchers, and itprovides the necessary background to appreciatethe five theme articles.

In “The Blueprint for Life?” Dror G. Feitelson andMillet Treinin challenge the assertion that DNAencodes everything needed to understand life. Theauthors examine how information is interpreted,

transported, and communicated in biologicalsystems, and conclude that there is more aboutthese processes than is encoded in DNA.

W e received 20 submissions for thisspecial issue, which we forwarded toreviewers from both biology and

computer science backgrounds. We thankthese reviewers for their timely responses andthe authors for responding quickly to thereviewers’ comments and suggestions.

Because this issue does not aim to providecomprehensive coverage of bioinformaticssoftware, it does not address some importantbiological problems such as pathway model-ing.3 Likewise, it excludes several novel computa-tional techniques, especially from data mining. Thecontent does, however, reveal a glimpse into therichness of bioinformatics software, providing asnapshot of its continuing evolution. �

References1. J.J. Tyson, K. Chen, and B. Novak, “Network

Dynamics and Cell Physiology,” Nature ReviewsMolecular Cell Biology, Dec. 2001, pp. 908-916.

2. E. Gallopoulos, E.N. Houstis, and J.R. Rice, “Com-puter as Thinker/Doer: Problem-Solving Environ-ments for Computational Science,” IEEEComputational Science and Engineering, vol. 1, no.2, 1994, pp. 11-23.

3. P.D. Karp, “Pathway Databases: A Case Study inComputational Symbolic Theories,” Science, vol.293, 2001, pp. 2040-2044.

Lenwood S. Heath is an associate professor of com-puter science at Virginia Tech. His research inter-ests include algorithms, theoretical computerscience, graph theory, bioinformatics, computa-tional biology, and symbolic computation. Heathreceived a PhD in computer science from the Uni-versity of North Carolina at Chapel Hill. He is amember of the IEEE, the ACM, SIGACT, andSIAM. Contact him at [email protected].

Naren Ramakrishnan is an assistant professor ofcomputer science at Virginia Tech. His researchinterests include problem-solving environments,mining scientific data, and personalization.Ramakrishnan received a PhD in computer sci-ences from Purdue University. He is a member ofthe IEEE Computer Society, the ACM, and theAAAI. Contact him at [email protected].

July 2002 45

From performingfunctional studieson genes or gene

families in isolation,research hasprogressed to

studying all thegenes in a given

organism simultaneously.