bioinformatics as an integrative science jaap heringa faculty of sciences faculty of earth and life...

Post on 19-Dec-2015

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Bioinformatics as an integrative science

Jaap Heringa

Faculty of Sciences

Faculty of Earth and Life Sciences

Integrative Bioinformatics Institute VU (IBIVU)

heringa@cs.vu.nl, www.cs.vu.nl/~ibivu, Tel. +31-20-4447649

Gathering knowledge

• Anatomy, architecture

• Dynamics, mechanics

• Informatics(Cybernetics – Wiener, 1948) (Cybernetics has been defined as the science of control in machines and animals, and hence it applies to technological, animal and environmental systems)

• Genomics, bioinformatics

Rembrandt, 1632

Newton, 1726

MathematicsStatistics

Computer ScienceInformatics

BiologyMolecular biology

Medicine

Chemistry

Physics

Bioinformatics

Bioinformatics

Bioinformatics“Studying informational processes in biological systems”

(Hogeweg Utrecht; early 1970s)

Applying algorithms and mathematical formalisms in biology (genomics) USA started but now everywhere

Taking care of the computational infrastructure and data management everywhere

Is a supporting science everywhere

“Information technology applied to the management and analysis of biological data” (Attwood and Parry-Smith)

The Human Genome -- 26 June 2000

Dinner discussion: Integrative Bioinformatics & Genomics VUDinner discussion: Integrative Bioinformatics & Genomics VU

metabolomemetabolome

proteomeproteome

genomegenome

transcriptometranscriptome

physiomephysiome

Genomics

A gene codes for a protein

Protein

mRNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

4-letter alphabet

20-letter alphabet

Humans havespliced genes…

DNA makes RNA makes Protein

Remarks•Proteins can use different combinations of exons =>

alternative splicing

•The human factor VIII gene (whose mutations cause hemophilia A) is spread over ~186,000 bp. It consists of 26 exons ranging in size from 69 to 3,106 bp, and its 25 introns range in size from 207 to 32,400 bp. The complete gene is thus ~9 kb of exon and ~177 kb of intron.

•The biggest human gene yet is for dystrophin. It has > 30 exons and is spread over 2.4 million bp.

•Single Nucleotide Polymorphism (SNP) data important for health

Microarray with about20K genes…

Proteomics

• X-ray crystallography• NMR• Mass spectrometry data • Structural genomics: solving and

categorising all existing protein folds (3D structures)

• Protein-protein interactions • Protein-ligand interactions (drug design)

Metabolic networks

Glycolysis and

Gluconeogenesis

Kegg database (Japan)

Physiome

• Metabolomics + all other little things in the cell

• Ions, protons, etc.

Algorithms in bioinformatics• string algorithms• dynamic programming• machine learning (NN, k-NN, SVM, GA, ..)• Markov chain models• hidden Markov models• Markov Chain Monte Carlo (MCMC) algorithms• stochastic context free grammars• EM algorithms• Gibbs sampling• clustering• tree algorithms• text analysis• hybrid/combinatorial techniques and more…

Free University initiativesIntegrative Bioinformatics Institute VU (IBIVU)

•Centre for Research on BioComplex Systems (CRBCS) – Systems Biology

•Centre for Neurobiology and Cognitive Research (CNCR)

•VU Medical Centre (Microarray, CGH data)

IBIVU supporting Dutch initiatives•BioRange: Pan-Dutch bioinformatics proposal (65M Euro)

•Centre for Medical Systems Biology (Leiden, A’dam, R’dam)

•Ecogenomics (A’dam, Wageningen, Nat. Inst. For Health and Environment (RIVM))

•BioASP: streamline/stimulate bioinformatics teaching across The Netherlands

Dutch Centres of Excellence

•Cancer Genomics Consortium [DCGP]

•Center for Biosystem Genomics [CBSG], focuses on plant genomics (potato, tomato) 

•Kluyver Centre for Genomics of Industrial Fermentation [Kluyver]

•Center for Medical Systems Biology [CMSB], focuses on multifactorial disease

•Netherlands Proteomics Centre for proteomics as an emerging horizontal genomics discipline

Dutch academic/industrial initiatives• Nutrigenomics exploration into the prevention and care of

nutrional inroads in vascular disease, diabetes, hypertension and obesity

• Interaction between the immune system and food; a functional genomics approach to celiac disease

• Mechanisms of life-threatening virus disease and new leads for treatment and vaccines

• Genomics of host – respiratory virus interactions: towards novel intervention strategies;

• Ecogenomics: Functioning of ecosystems targeted at sustainable environmentally friendly and healthy products (ecology, toxicology and sustainable innovation)

In vitro

Life-support functions soil

Biol. response array

Eco-toxicology

In situ

Metagenome array

Technologydevelopment

Research questions

Bio-informatics-

Technology platform

Assessing the Living Soil

Ecogenomics

PROJECTCoordinator

Task force

EPIDEMIOLOGYDorret Boomsma(VU/mc)*Cornelia van Duyn(EMC)

Populations EMC : van Duyn, Hofman, OostraVU/mc : Boomsma, Boers, Dijkmans, Heine, Hoogendijk, van der Knaap, Meier, Pena, PinedoLUMC : Slagboom, Bertina, Breedveld, Breuning, Cornelisse, Devilee, vDissel, Ferrari, Huizinga, Roosendaal, Roos, van der Velde, Westendorp, ZitmanGenotyping LUMC : Slagboom, Sandkuijl, den DunnenEMC : Oostra, HeutinkVU/mc : Boomsma, (Heutink)

SYSTEMS BIOLOGYJan vd Greef (TNO/UL)*Cor Verweij (VU/mc) 

Arraying LUMC: den Dunnen, Boer, FoddeVU/mc: Verweij, Ylstra, BrakenhoffEMC: Oostra

Proteomics LUMC: Koning, Deelder, den Dunnen, van der MaarelUL: Overkleeft, Abrahams, VU/mc: Smit, Li, van Kooyk

Metabolomics

UL: Verduijn Lunel, van de Geer, VerheijTNO: van der Greef, Havekes, te KoppeleVU/mc: Jakobs

TECHNOLOGYHuub de Groot (UL) 

Molecular interactions

UL: Abrahams, Brouwer, IJzerman, van BoomLUMC: Tanke, Raap, Deelder, den DunnenVU/mc: Leurs, Irth

In vivo imaging

UL: de Groot, KokLUMC: Reiber, van Buchem, de Roos, Poelmann, LowickVU/mc: Witter, Bal, LammertsmaEMC: van Duyn, van Swieten

MODEL SYSTEMSRune Frants (LUMC) 

Mouse / RatZebrafishDrosophilaYeast

EMC: OostraLUMC : Verbeek, Fodde, deKloet, Verrijzer, Noordermeer, MullendersTNO: Havekes; UL: Spaink, Brouwer, SchmidtVU/mc: Verhage, Smit, Vandenbroucke-Grauls

CLINICAL APPLICATIONSCornelis Melief (LUMC)  

Cells, vaccines

LUMC : Melief, Goulmy, Falkenburg, Ottenhoff; Spaan, de VriesVU/mc: van Kooiyk; Meijer,Pinedo

Viral LUMC: Spaan, Wiertz, HoebenVU/mc: Gerritsen, Curiel

Methodologies,Pharmaceuticals

UL: IJzerman, Mulder, van BoomLUMC: Huizinga, Breedveld, Breuning, van Deutekom, Ferrari, Fodde, Frants, Jukema, de Kloet, Ottenhoff, van der Velde, ZitmanVU/mc: Maassen, Dijkmans, Leurs, Meijer, PinedoEMC : Stricker

CENTRAL PROJECTCoordinator / Elements

DATA INTEGRATION,ANALYSIS AND LOGISTICSNN

Central Information ManagementTFBI / BIG-VU / EBB - Rosetta Resolver® - LIMS integration/ /interfacing Biostatistics van Houwelingen, Eijlers, Boer, Sandkuijl (LUMC); van der Vaart, de Gunst, Boers (VU/mc); Houwing-Duiistermaat (EMC), van de Geer (UL)

Bioinformatics Boer, Svensson, Gorbalenya (LUMC), Heringa, vBeek (VU/mc), Stijnen, van der Lei, Mons (EMC), Kok (UL)

BioASP Interface ism: Vriend/TellegenGRID – Virtual Laboratory NWO- BMI FLEXwork van Ommen, Boer, Svensson ism: - Stiekema (Wag) - Herzberger (Ams) - Vriend (Nijm)

Medical Systems Biology

• Integrate data sources

• Integrate methods

• Integrate data through method integration (biological model)

Integrative bioinformaticsData integration

Bioinformatics tool

Data

Algorithm

BiologicalInterpretation

(model)

tool

“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975))

“Nothing in Bioinformatics makes sense except in the light of Biology”

Bioinformatics

Pair-wise sequence alignment(more than just string matching)

MDAGSTVILCFVGMDAASTILCGS

Amino Acid Exchange

Matrix

Gap penalties (open,extension)

Search matrix

MDAGSTVILCFVG-MDAAST-ILC--GS

EvolutionGlobal dynamic programming

Data

Algorithm

BiologicalInterpretation

(model)

tool

Integrative bioinformaticsData integration

Integrative bioinformaticsData integration

Data 1 Data 2 Data 3

Integrative bioinformaticsData integration

Data 1

Algorithm 1

BiologicalInterpretation

(model) 1

tool

Algorithm 2

BiologicalInterpretation

(model) 2

Algorithm 3

BiologicalInterpretation

(model) 3

Data 2 Data 3

“The solution includes an infrastructure or data pipeline involving: •a general portal•virtual lab technology (virtual LIMS)•‘petabase’ data handling facilities•methods, software and ‘tools’ to integrate data and extract knowledge from data in the user domain.

This infrastructure calls for •a central facilitation unit providing large storage and computing facilities to run central software packages with user interfaces”

•Could Gridlab do this?

Integrative bioinformaticsData integration

Integrating Primary and Predicted Secondary Structure data for

Multiple Alignment

Using secondary structure in multiple alignment

“Structure more conserved than sequence”

•10 years SS prediction method development: Q3 += 3%

•10 years MA method development: difference in Q3 can be >30%

Protein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE (oligomers)

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE (oligomers)

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE (oligomers)

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Flavodoxin-cheY: Praline alignment (prepro=1500)

1fx1 -PKALIVYGSTTGNT-EYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACF

FLAV_DESDE MSKVLIVFGSSTGNT-ESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLF-EEFNRFGLAGRKVAAf

FLAV_DESVH MPKALIVYGSTTGNT-EYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACf

FLAV_DESSA MSKSLIVYGSTTGNT-ETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLY-DSLENADLKGKKVSVf

FLAV_DESGI MPKALIVYGSTTGNT-EGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLY-EDLDRAGLKDKKVGVf

2fcr --KIGIFFSTSTGNT-TEVADFIGKTLGA---KADAPIDVDDVTDPQALKDYDLLFLGAPTWNTG----ADTERSGTSWDEFLYDKLPEVDMKDLPVAIF

FLAV_AZOVI -AKIGLFFGSNTGKT-RKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFL-PKIEGLDFSGKTVALf

FLAV_ENTAG MATIGIFFGSDTGQT-RKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFT-NTLSEADLTGKTVALf

FLAV_ANASP SKKIGLFYGTQTGKT-ESVaEIIRDEFGN---DVVTLHDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLY-SELDDVDFNGKLVAYf

FLAV_ECOLI -AITGIFFGSDTGNT-ENIaKMIQKQLGK---DVADVHDIAKSS-KEDLEAYDILLLgIPTWYYGE--------AQCDWDDFF-PTLEEIDFNGKLVALf

4fxn -MK--IVYWSGTGNT-EKMAELIAKGIIESG-KDVNTINVSDVNIDELL-NEDILILGCSAMGDEVL-------EESEFEPFI-EEIS-TKISGKKVALF

FLAV_MEGEL MVE--IVYWSGTGNT-EAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVA-SKDVILLgCPAMGSEEL-------EDSVVEPFF-TDLA-PKLKGKKVGLf

FLAV_CLOAB -MKISILYSSKTGKT-ERVaKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQESEGIIFgTPTYYAN---------ISWEMKKWI-DESSEFNLEGKLGAAf

3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFN--NVEEAEDGVDALNKLQAGGYGFVI---SDWNMPNM----------DGLELL-KTIRADGAMSALPVLM

T1fx1 GCGDS-SY-EYFCGA-VDAIEEKLKNLGAEIVQD---------------------GLRIDGD--PRAARDDIVGWAHDVRGAI--------

FLAV_DESDE ASGDQ-EY-EHFCGA-VPAIEERAKELgATIIAE---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL--------

FLAV_DESVH GCGDS-SY-EYFCGA-VDAIEEKLKNLgAEIVQD---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI--------

FLAV_DESSA GCGDS-DY-TYFCGA-VDAIEEKLEKMgAVVIGD---------------------SLKIDGD--PE--RDEIVSwGSGIADKI--------

FLAV_DESGI GCGDS-SY-TYFCGA-VDVIEKKAEELgATLVAS---------------------SLKIDGE--PD--SAEVLDwAREVLARV--------

2fcr GLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKS-VRDGKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------

FLAV_AZOVI GLGDQVGYPENYLDA-LGELYSFFKDRgAKIVGSWSTDGYEFESSEA-VVDGKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L--

FLAV_ENTAG GLGDQLNYSKNFVSA-MRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L------

FLAV_ANASP GTGDQIGYADNFQDA-IGILEEKISQRgGKTVGYWSTDGYDFNDSKA-LRNGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL------

FLAV_ECOLI GCGDQEDYAEYFCDA-LGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA

4fxn G-----SY-GWGDGKWMRDFEERMNGYGCVVVET---------------------PLIVQNE--PDEAEQDCIEFGKKIANI---------

FLAV_MEGEL G-----SY-GWGSGEWMDAWKQRTEDTgATVIGT----------------------AIVNEM--PDNA-PECKElGEAAAKA---------

FLAV_CLOAB STANSIAGGSDIA---LLTILNHLMVKgMLVYSG----GVAFGKPKTHLGYVHINEIQENEDENARIfGERiANkVKQIF-----------

3chy VTAEAKK--ENIIAA---------AQAGAS-------------------------GYVV-----KPFTAATLEEKLNKIFEKLGM------

GIteration 0 SP= 136944.00 AvSP= 10.675 SId= 4009 AvSId= 0.313

Flavodoxin-cheY NJ tree

Secondary structure-induced alignment iteration

3chy-AA SEQUENCE|| AA |ADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEEAEDGVDALNKLQAGGYGFVISDWNMP|

3chy-ITERATION-0|| PHD | EEEEEEE HHHHHHHHHHHHHHHHH E HHHHHHHHHH HHHEEE |

3chy-ITERATION-1|| PHD | EEEEEEEE HHHHHHHHHHHHHHH HHHHHHHH EEEEEE |

3chy-ITERATION-2|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHHH EEEEEE |

3chy-ITERATION-3|| PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE |

3chy-ITERATION-4|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHH EEEEE |

3chy-ITERATION-5|| PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE |

3chy-ITERATION-6|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHH EEEEEE |

3chy-ITERATION-7|| PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE |

3chy-ITERATION-8|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHH EEEEEE |

3chy-ITERATION-9|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHHHH EEEEE |

3chy-AA SEQUENCE|| AA |NMDGLELLKTIRADGAMSALPVLMVTAEAKKENIIAAAQAGASGYVVKPFTAATLEEKLNKIFEKLGM|

3chy-ITERATION-0|| PHD | HHHHHHEEEEEE HHHHHHHHHHHHHHHHH HHHHHHHHHHHHHH |

3chy-ITERATION-1|| PHD | HHHHHHEEEEEE HHH HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |

3chy-ITERATION-2|| PHD | HHHHHHEEEEEE HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |

3chy-ITERATION-3|| PHD | HHHHHHHHHHHH HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |

3chy-ITERATION-4|| PHD | HHHHH EEEEE HHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |

3chy-ITERATION-5|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |

3chy-ITERATION-6|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEEE HHHHHHHHHHHHHH |

3chy-ITERATION-7|| PHD | HHHHHHHH EEEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |

3chy-ITERATION-8|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |

3chy-ITERATION-9|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHH EEEE HHHHHHHHHHHHHH |

Flavodoxin-cheY multiple alignment/ secondary structure iteration

cheY SSEs

Integrating secondary structure prediction and multiple alignment

• Low key example

• But difficult

• How to scale up?

• Need new formalisms and technology

SnapDRAGON

Richard A. George

George R.A. and Heringa, J. (2002) J. Mol. Biol., 316, 839-851.

 

Integrating protein multiple alignment, secondary and tertiary structure

prediction to predict structural domains in sequence data

The DEATH Domain• Present in a variety of Eukaryotic proteins involved with cell death.• Six helices enclose a tightly packed hydrophobic core.• Some DEATH domains form homotypic and heterotypic dimers.

http

://w

ww

.msh

ri.o

n.ca

/paw

son

Pyruvate kinasePhosphotransferase

barrel regulatory domain

barrel catalytic substrate binding domain

nucleotide binding domain

1 continuous + 2 discontinuous domains

Structural domain organisation can be nasty…

Protein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

PRIMARY STRUCTURE (amino acid sequence)

QUATERNARY STRUCTURE

SECONDARY STRUCTURE (helices, strands)

TERTIARY STRUCTURE (fold)

•The C distance matrix is divided into smaller clusters.

•Seperately, each cluster is embedded into a local centroid.

•The final predicted structure is generated from full embedding of the multiple centroids and their corresponding local structures.

3NN

NN

C distancematrix

Targetmatrix

N

CCHHHCCEEE

Multiple alignment

Predicted secondary structure

100 randomisedinitial matrices

100 predictions

Input data

SnapDRAGON

Domains in structures assigned using method by Taylor (1997)

Domain boundary positions of each model against sequence

Summed and Smoothed Boundaries (Biased window protocol)

SnapDRAGON

1

2

3

SnapDRAGON

•Predicting domain boundaries for single average size protein could take hours on 128-node cluster computer with simplified significance testing.

•How to do scale up to structural genomics? 30,000 human proteins of 1 hr each gives 3.5 years..

What we still cannot do well• “Give us sequence, we do rest” failed so far; e.g., number of

human genes• Gene prediction bad, RNA genes missed• Protein structure/function prediction unsolved; we have no clue

about function of 50% of human genes• No theory of gene regulation• Cannot well predict post-translational modification• Many (database) solutions not generic• We have no E=mc2 so need to keep all data• Integrating methods and data• Understand biologically

Future Bioinformatics Research Topics

• Integration of knowledge– We have some formalisms (ontologies,

distributed databases) but we need to develop many completely new formalisms and new technologies beyond what we have now

Conclusions

• Getting important integrative Bioinformatics/Systems Biology applications onto the Grid through Gridlab can be significant

• Bioinformatics and genomics are getting clinical. Gridlab could play an important role

The end. Thanks

Future Bioinformatics Research Topics

Keywords morning session

• Integration of knowledge– Information transfer from one object to another– What are the rules– From genotype to phenoypes, current algorithms and

ontologies not sufficient– Biological interpretation needs context– DB maintenance is dynamic process, most info is static– Need resources– Environment should allow student to make method in 3

hours

• Genomics– Identifying genetic elements still bad– Collect easy primary biological facts– Gene pred, struct pred, functional all unsolved– Genetic “parts” list is uncomplete and scanty– Many omics “unknowme”

• Genomics– Hypothesis driven versus systematic approaches

– Need databases,algorithms, biol knowledge

– Data structures not suitable for complexity

– Solutions such as Ensembl not generic

– Need technologies beyond ontologies

– Need new formalisms to be able to do “vertical genomics”

• Systems Biology– Very promising area

• Health

• Pharmaceuticals

• Biotechnology

• Environment

• (Medical) Systems Biology– Diego di Bernardo– Ilias Jakovidis– Very promising area

Summary

• How can Europe regain ground

Hans Werner Mewes

• DNA contains all• Identifying genetic elements still bad• Collect easy primary biological facts• Gene pred, struct pred, functional all

unsolved• Genetic “parts” list is uncomplete and

scanty• Many omics “unknowme”

• Hypothesis driven versus systematic approaches

• Need databases,algorithms, biol knowledge

• Data structures not suitable for complexity

• Solutions such as Ensembl not generic

• Need technologies beyond ontologies

• Information transfer from one object to another• What are the rules• From genotype to phenoypes, current algorithms not

sufficient• Biological interpretation needs context• DB maintenance is dynamic process, most info is

static• Need resources• Environment should allow student to make method in

3 hours

Diego di Bernardo

• TIGEM: disease genes• Bioinformatics and comp biol not at a par• 81 of genome “genomics&databases” and 19%

“genomics&algorithms• Important topics: regulation, network, digital signal

processing HMMs• Problems : algorithms not biological and no

experimental verification• Bioinformatics helps design biological experimnents• Richard Durbin: value of physics and engineering

• Computational tools for discovery of novel objects

Ilias Jakovidis

• Medical informatics

• Health telematics

• eHealth

• Medical ontologies didn’t help Paul Schofield at all (tried with NCBI-big mess)

• Middleware includes ontologies so covers biology (IBM!)

• Language engineering• Natural language in medicine, computerize

medical community• Biomedical informatics: applications in

healthcare, how to get to clinical?• Synergy between medicine and biology

informatics• Alphonso, med will dominate, lot of money with

unclear methods

• Medical info has worked coherently, how can we do that? How can we change?

• Mewes: Bioinf has achieved usage, not med. Bioinf is entering cliniques.

Gunnar von Heijne

• Databases should be funded• Start problem for 5 years: and then what?• With infrastructure this problem is less, so funding

is relatively OK.• Technology development should not become

dominant• Most biologist are small scale hypothesis driven• Marketing problem

• From 19 bioinf nethods , 15 are European in genomics

• Validation is not always key (Alfonso)

• EMBOSS project European wide, for algorithm driven research. EMBOSS is longstanding. But could not get funding from EC (no funding category)

Alfonso Valencia

• Often, 1 bioinformatician for everything• Need of integration/collaboration

– Social, technical barriers

• People should realise that Bioinf and Bioinformaticians are very different

• Integrated (med) system– Underfunded (1 postdoc)– Difficult to develop– lack of standards and repositories– Difficult to interact with biologist– All these things essential

• 3-4 good bioinf groups in Spain• Make virtual institute for bioinformatics• There are few large groups with national

funding • There are few large groups with European

funding• There are many small groups with weak

institutional funding

• Create framework valid for biology• Interaction reduces overhead• System access for biologists, point to the

right expert• Create new science beyond current needs• This does not compete with basic needs• Support strong European areas (eg. protein

interaction)

• Bioinformatics is a new discipline• Who solves the problem, who is interested in solving

it, and not always who qualifies to solving it (engineers,..)

• Example “information extraction in molecular biology”: after years no real progress made.

• Systems Biology: what to do and how (no linear path), but we have opportunity to develop knowing

• Experimental validation: methods debug databases. Many proteins (90%?) have never seen an experiment

• Should bioinf talk to biology or vice versa?

Jean-Marie Claverie

• 1951 first protein sequence (insulin)• Field has come of age, so outsiders shouldn’t tell

us what to do and how• Bioinformatics is part of the foundation• Clear difference in application of informatics or

bioinformatics• Future will be different• Give us sequence, we do rest failed! Number of

human genes is example.

• Gene finding: standard genes good, RNA genes missed, no theory of gene transcription

• We have no E=mc2 so need to keep data• Computational biology is same as systems biology• Good integration: E. Coli Bioinf-project, find all

genes in small bacterium. Inclusive project. Now good consortium.

• Bioinformatics becomes invisible for biologists (Blast).

Howard Bilofsky

• PRISM forum

• Provide challenges for (bio)informatics

• Drives Bioinf,omics,.. techniques

top related