biocomputing: impact of the genomic revolution

3
DDT Vol. 6, No. 8 April 2001 update conferences 403 1359-6446/01/$ – see front matter ©2001 Elsevier Science Ltd. All rights reserved. PII: S1359-6446(01)01737-8 suggested that the advantages of the system are high expression levels, linear scale-up and the low cost of goods. Additionally, the protein is produced in a purification-friendly environment and the post-translational modifications are human-like. PPL has used the technology to produce a number of proteins, one of which (α1 antitrypsin) is in Phase III clinical trials. Concluding remarks The take-home message for the industry outsider was that the lessons learned from the mistakes made during the initial attempts to bring protein thera- peutics to the market are now being addressed at a very early stage in the discovery process. Parallels can be drawn with the experience gained by the small- molecule therapeutic discovery process in that it is now recognized that multidimensional optimization of the compounds throughout the discovery process will lead to decreased costs and higher success rates in the clinic. The industry as a whole has laid the foun- dations for financial success but, most importantly, it is poised to provide many new medicines to treat the various medical needs. The annual meeting of the Pacific Symposium on Biocomputing, held in Mauna Lani (HI, USA) was greatly over- subscribed this year due to the growing importance of computational biology and bioinformatics in the post-genomic era. The meeting covered a wide range of aspects, from DNA and protein structure, protein–DNA interactions and expression, protein evolution, and human genome variation to phylogenetics, high-performance computing, natural language processing and bioethics. This report will provide an overview of the main talks relating to drug discovery. The full papers presented at the meeting are available in the symposium book 1 . The keynote speaker, David Haussler, provided an overview of the effort to produce the working draft of the human genome sequence. He admitted there will be some cross-contamination pres- ent in the sequence as all the data that passes the NCBI filters is used in the final assembly. He also pointed out that the bioinformatics tools used are not perfect and miss-alignments have created some artefactual duplication. He reiterated the plan to complete the sequence by 2003, and that the next steps will be to iden- tify the complete set of human genes, explore gene regulation, research human, mammalian and vertebrate diversity, and connect genomic data to clinical data. Comparing sequences A range of different computational ap- proaches for sequence comparison was discussed. William Martins (Department of Electrical and Computer Engineering, University of Delaware, Newark, DE, USA) and colleagues have developed a dynamic programming algorithm that uses a rigorous mathematical approach based on a multi-threading system. This system can be scaled up and actually performs better with longer sequences, providing the capability to compare whole genomes. A further advantage of this method is that it enables the visual display of the matching sections of the genomes. A new algorithm based on Monte Carlo multiple sequence alignment techniques was used by C. Guda (San Diego Supercomputer Center, University of California San Diego, La Jolla, CA, USA) and colleagues. Four different types of moves were designed to generate ran- dom changes in the alignment – shifting (shifts residues in one randomly chosen structure); expanding (expands the alignment block by acquiring one residue); shrinking (shrinks the align- ment block by removing one residue); and splitting-and-shrinking (splits longer blocks in two and then shrinks one of the new blocks). Changes in distance-based scores were examined for each trial move and changes were only accepted if the alignment score increased. This process was repeated until the align- ment scores converged. Each step only examines changes of one residue at a time, but many were concerned that more global changes might be missed that could have significantly improved the score. The use of Kestrel single-instruction multiple-data (SIMD) parallel processor methods for sequence analysis was Biocomputing: impact of the genomic revolution Rebecca Lawrence, News & Features Editor

Upload: rebecca-lawrence

Post on 19-Sep-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

DDT Vol. 6, No. 8 April 2001 updateconferences

4031359-6446/01/$ – see front matter ©2001 Elsevier Science Ltd. All rights reserved. PII: S1359-6446(01)01737-8

suggested that the advantages of thesystem are high expression levels, linearscale-up and the low cost of goods.Additionally, the protein is produced in apurification-friendly environment andthe post-translational modifications arehuman-like. PPL has used the technologyto produce a number of proteins, one of which (α1 antitrypsin) is in Phase III clinical trials.

Concluding remarksThe take-home message for the industryoutsider was that the lessons learnedfrom the mistakes made during the initial attempts to bring protein thera-peutics to the market are now being addressed at a very early stage in the discovery process. Parallels can be drawnwith the experience gained by the small-molecule therapeutic discovery process

in that it is now recognized that multidimensional optimization of the compounds throughout the discoveryprocess will lead to decreased costs andhigher success rates in the clinic. The industry as a whole has laid the foun-dations for financial success but, most importantly, it is poised to provide manynew medicines to treat the various medical needs.

The annual meeting of the PacificSymposium on Biocomputing, held inMauna Lani (HI, USA) was greatly over-subscribed this year due to the growingimportance of computational biologyand bioinformatics in the post-genomicera. The meeting covered a wide rangeof aspects, from DNA and protein structure, protein–DNA interactions andexpression, protein evolution, and humangenome variation to phylogenetics,high-performance computing, naturallanguage processing and bioethics. Thisreport will provide an overview of themain talks relating to drug discovery.The full papers presented at the meetingare available in the symposium book1.

The keynote speaker, David Haussler,provided an overview of the effort toproduce the working draft of the humangenome sequence. He admitted therewill be some cross-contamination pres-ent in the sequence as all the data thatpasses the NCBI filters is used in the finalassembly. He also pointed out that thebioinformatics tools used are not perfectand miss-alignments have created some

artefactual duplication. He reiterated theplan to complete the sequence by 2003,and that the next steps will be to iden-tify the complete set of human genes,explore gene regulation, research human,mammalian and vertebrate diversity, andconnect genomic data to clinical data.

Comparing sequencesA range of different computational ap-proaches for sequence comparison wasdiscussed. William Martins (Departmentof Electrical and Computer Engineering,University of Delaware, Newark, DE,USA) and colleagues have developed adynamic programming algorithm thatuses a rigorous mathematical approachbased on a multi-threading system. Thissystem can be scaled up and actuallyperforms better with longer sequences,providing the capability to comparewhole genomes. A further advantage ofthis method is that it enables the visualdisplay of the matching sections of thegenomes.

A new algorithm based on Monte Carlomultiple sequence alignment techniques

was used by C. Guda (San DiegoSupercomputer Center, University ofCalifornia San Diego, La Jolla, CA, USA)and colleagues. Four different types ofmoves were designed to generate ran-dom changes in the alignment – shifting(shifts residues in one randomly chosenstructure); expanding (expands thealignment block by acquiring oneresidue); shrinking (shrinks the align-ment block by removing one residue);and splitting-and-shrinking (splits longerblocks in two and then shrinks one of thenew blocks). Changes in distance-basedscores were examined for each trialmove and changes were only accepted ifthe alignment score increased. Thisprocess was repeated until the align-ment scores converged. Each step onlyexamines changes of one residue at atime, but many were concerned thatmore global changes might be missedthat could have significantly improvedthe score.

The use of Kestrel single-instructionmultiple-data (SIMD) parallel processormethods for sequence analysis was

Biocomputing: impact of the genomicrevolutionRebecca Lawrence, News & Features Editor

discussed by Richard Hughey (Departmentof Computer Engineering, University ofCalifornia Santa Cruz, CA, USA). Three-point pharmacophore fingerprinting,which examines the distances betweeneach possible pair of atoms (or atomgroups), produces trillions of atomtriplet calculations for large libraries andis therefore impractical to carry out on aserial processor. Hence, Hughey’s teamhave developed a parallel processormethod by simultaneously processingmultiple conformations for molecularfingerprinting. This method was thentested on data produced by Affymetrix,where they found it was 35-fold fasterthan standard molecular fingerprinting.The group are now trying to further op-timize this system to enable storage of allthe information on a single chip.

Gene expressionA method using microarray techniquesto analyse whole genome gene ex-pression was discussed by Xiaole Liu(Stanford Medical Informatics, StanfordUniversity, Stanford, CA, USA). Liu andcolleagues used a Gibbs sampling tech-nique (BioProspector), which isolatesone sequence from a motif, scores eachpossible segment of the sequence, sam-ples the segment and then updates themotif. This process is repeated until the whole motif is produced.

The original Gibbs sampler method re-lies on each sequence containing onlyone sample copy of the motif. Becausesome input sequences contain no copiesof the motif while others contain manycopies, the group has used two thresh-olds in the sampler. Everything that isabove the high threshold will automati-cally be added to the alignment. The lowthreshold starts at zero and is graduallyincreased. Only the highest scoring seg-ment between the two thresholds willbe added to the final alignment.

To understand the mechanism of aclass of genes, it is important to keeptrack of not only what motifs occur butalso how often they occur, their order

and spacing, and the promoter classifi-cation. With this in mind, Bill Grundy(Department of Computer Science,Columbia Genome Center, ColumbiaUniversity, New York, NY, USA) and colleagues have devised a method of supervised learning for promoter region-based classification of genes. This involvesdiscovering conserved motifs in trainingset sequences, building a motif-basedhidden Markov model, computing thegradients of the model parameters withrespect to the training set sequences,and then training the model. Using this method for two classes of genes from the budding yeast, SaccharomycesCerevisiae, has enabled them to correctlyclassify the promoters involved.

Protein structure and evolutionThis session was introduced by RichardGoldstein (Chemistry Department andBiophysics Research Division, Universityof Michigan, Ann Arbor, MI, USA) whohighlighted that one of the key post-genomic steps will be to discover theproperties of the proteins and how theseare determined by amino acid proper-ties. One method is to change oneamino acid at a time and examine howthe protein structure changes but thismethod cannot be used to learn thebasis of the changes. The alternative is toexamine variation and selection betweenindividuals and then work back to eluci-date the selective pressures acting onthese proteins.

Goldstein’s team have developed asubstitution model that considers theamino acid residues observed in eachposition and constructs a separate substitution matrix for every location. As the ‘rules’ that these locations follow is unknown, sites have to be assigned prob-abilistically. The resulting Hidden StatesModel can identify which locations in aprotein have similar substitution rates.The main drawback of this approach isthe large number of adjustable para-meters it produces that have to be simultaneously determined.

Their studies found that fast-changingsites had hydrophobic pressure outsidebut hydrophilic pressure inside whileslow-changing sites were the reverse.Furthermore, for all secondary structures,hydrophobicity was the dominant factorfor exposed residues while the bulkindex was dominant for buried residues.The strongest contributions of hydro-phobicity were from exposed α-helices,followed by exposed β-sheets, and thenturns and coils. It was suggested that itwould be interesting to also look at agroup of residues rather than just oneresidue at a time.

Keith Dunker (School of MolecularBiosciences, Washington State University,Pullman, WA, USA) described the compi-lation of three primary and one deriva-tive database of intrinsically disorderedproteins to investigate the determinantsof protein order and disorder. The aminoacid compositions from the combinedfour disordered databases were com-pared with those from a database of or-dered proteins. The disordered databaseswere found to be significantly depletedof 8 amino acids and significantly en-riched of a further 8 amino acids, suggest-ing that there is a set of order-promotingand disorder-promoting amino acids.

The group also identified the top tenattributes of the amino acids that corre-lated as little as possible with each otherand that promoted order or disorder. Ofthese ten properties, the most importantproperty is the 14 Å contact number. Ofthe other nine properties, one is the coordination number, four are associ-ated with hydrophobicity, one is associ-ated with polarity and three are relatedto the propensity of the amino acids to form β-strands. The team is nowworking on determining whether there are any relationships between proteinfunction and types of protein disorder.

Linking genotypes to clinicalphenotypesFrancisco de la Vega (Applied Biosystems,Foster City, CA, USA) gave an overview

DDT Vol. 6, No. 8 April 2001

404

update conferences

DDT Vol. 6, No. 8 April 2001 updateconferences

405

of this field and the technologies anddata required. He suggested future challenges include technological repro-ducibility and quality control at high-throughput, the accurate analysis ofdata, and good data integration. PeterPark (Department of Biostatistics, HarvardSchool of Public Health, Boston, MA,USA) then discussed the use of microar-ray data for predicting clinical pheno-types and the difficulties in knowingwhich genes to use and the unusual sta-tistical problems of there being morevariables than subjects. His group hastherefore developed a nonparametricscoring algorithm, which uses ranksrather than actual data and is less sensi-tive to measurement errors. Throughscoring of each gene based on samplesof known classes, they discussed how asmall set of informative genes can be isolated from the large amount of geneexpression level data produced by thelarge quantity of genes.

Another problem that often arises isthe difficulty in determining what countsas a significant difference in gene ex-pression analyses. Atul Butte (Children’sHospital Informatics Program, Boston,MA, USA) commented on the arbitrarythreshold differences used by researchersand the problems of comparing data be-tween microarray chips. Butte’s grouptested this by comparing RNA data fromfour patients with glucose intolerancebetween two sets of Affymetrix Hu35Kmicroarrays. Initially, they found seem-ingly high reproducibility between thetwo sets of chips with correlation coeffi-cients of intrapatient-repeated measure-ments being 0.76–0.84. However, theyfound poor reproducibility when they examined interpatient-fold differ-ences, ranging from 0.01 to 0.09. In one instance, they found that a gene was expressed tenfold higher or tenfoldlower in patient 1 compared with patient3 depending on which set of chips werebeing used. His explanation for this wasthat the measurements are not ‘stable’:noise changes expression measurements

as the numbers are so small. He sug-gested this could be overcome by re-moving ‘noisy’ patients and by using asmaller range of insignificance. However,he concluded that the best strategywould be different for a small groupworking in academia where money is a significant factor, compared with a large pharmaceutical company wherethere are sufficient funds to follow upfalse-positives.

Pharmacogenomic databaseRuss Altman (Stanford Medical Informatics,Stanford University, Stanford, CA, USA)provided an overview of the pharma-cogenomic database (PharmGKB; http://www.pharmgkb.org/) being set-up inparallel with a private pharmaco-genomic effort to provide a publiclyavailable repository of pharmaco-genomic information. The database willbe run by Teri Klein (Standford MedicalInformatics) and, when complete, willinclude genomic information, alleles,clinical phenotypes, molecular and cellular phenotypes, molecules, drug response systems, drugs and impact ofthe environment. Groups that do nothave enough bioinformatics can submitthe gene sequence information and thedatabase group will then do the necess-ary bioinformatics for them. The groupscurrently involved in the project areworking on asthma, cancer, transportermolecules, oestrogen-related genes, depression and Phase II metabolism enzymes.

BioethicsAn interesting discussion session was heldon bioethics. Pierre Baldi (Department ofInformation and Computer Science andDepartment of Biological Chemistry,University of California Irvine, CA, USA)overviewed some of the recent devel-opments in computers and in science.These included the development ofcomputers that will soon exceed the com-putational storage power of the humanbrain; connectivity that approaches

science fiction; and genetically engi-neered animals such as tigrons and ligras (part lion, part tiger) and glowingmice and plants. He suggested that the challenges for the future would be toidentify what it actually means to behuman and the blurring of the silicon/carbon boundary.

Larry Hunter (University of Colorado,Boulder, CO, USA) highlighted the prob-lems of deciphering ‘good’ from ‘bad’genes. Obvious ‘good’ genes might befor disease resistance, physical health ormental ability, while ‘bad’ genes mightbe for cancer and heart disease, mentaland physical diseases and disabilities.However, the boundaries are more com-plicated with, for example, the globingene, as when heterozygous, the genevariant protects against malaria, butwhen homozygous, it produces sicklecell anaemia. Other determinants in-clude how problematic some ‘bad’ genesreally are. For example, poor eyesightmight take 20 years to develop by whichtime a non-genetic intervention mightbe available that means genetic inter-vention is not required.

However, James Sikela (University ofColorado Health Sciences Center, Boulder,CO, USA) pointed out that all courses ofaction have consequences, includingdoing nothing at all about the geneticrevolution. The challenges are to antici-pate the consequences, build consensuson goals and devise effective strategies.He concluded the session by remindingeveryone that we all have a responsibil-ity to carefully weigh up all sides of ouractions relating to the use of geneticinformation before we act.

AcknowledgementsI would like to thank Keith Dunker for hisvaluable comments on this article.

Reference01 Altman, R.B. et al. (eds) (2001) Pacific

Symposium on Biocomputing 2001, WorldScientific Publishing Co.