bioinformatics as an integrative science jaap heringa faculty of sciences faculty of earth and life...
Post on 19-Dec-2015
214 views
TRANSCRIPT
Bioinformatics as an integrative science
Jaap Heringa
Faculty of Sciences
Faculty of Earth and Life Sciences
Integrative Bioinformatics Institute VU (IBIVU)
[email protected], www.cs.vu.nl/~ibivu, Tel. +31-20-4447649
Gathering knowledge
• Anatomy, architecture
• Dynamics, mechanics
• Informatics(Cybernetics – Wiener, 1948) (Cybernetics has been defined as the science of control in machines and animals, and hence it applies to technological, animal and environmental systems)
• Genomics, bioinformatics
Rembrandt, 1632
Newton, 1726
MathematicsStatistics
Computer ScienceInformatics
BiologyMolecular biology
Medicine
Chemistry
Physics
Bioinformatics
Bioinformatics
Bioinformatics“Studying informational processes in biological systems”
(Hogeweg Utrecht; early 1970s)
Applying algorithms and mathematical formalisms in biology (genomics) USA started but now everywhere
Taking care of the computational infrastructure and data management everywhere
Is a supporting science everywhere
“Information technology applied to the management and analysis of biological data” (Attwood and Parry-Smith)
The Human Genome -- 26 June 2000
Dinner discussion: Integrative Bioinformatics & Genomics VUDinner discussion: Integrative Bioinformatics & Genomics VU
metabolomemetabolome
proteomeproteome
genomegenome
transcriptometranscriptome
physiomephysiome
Genomics
A gene codes for a protein
Protein
mRNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
4-letter alphabet
20-letter alphabet
Humans havespliced genes…
DNA makes RNA makes Protein
Remarks•Proteins can use different combinations of exons =>
alternative splicing
•The human factor VIII gene (whose mutations cause hemophilia A) is spread over ~186,000 bp. It consists of 26 exons ranging in size from 69 to 3,106 bp, and its 25 introns range in size from 207 to 32,400 bp. The complete gene is thus ~9 kb of exon and ~177 kb of intron.
•The biggest human gene yet is for dystrophin. It has > 30 exons and is spread over 2.4 million bp.
•Single Nucleotide Polymorphism (SNP) data important for health
Microarray with about20K genes…
Proteomics
• X-ray crystallography• NMR• Mass spectrometry data • Structural genomics: solving and
categorising all existing protein folds (3D structures)
• Protein-protein interactions • Protein-ligand interactions (drug design)
Metabolic networks
Glycolysis and
Gluconeogenesis
Kegg database (Japan)
Physiome
• Metabolomics + all other little things in the cell
• Ions, protons, etc.
Algorithms in bioinformatics• string algorithms• dynamic programming• machine learning (NN, k-NN, SVM, GA, ..)• Markov chain models• hidden Markov models• Markov Chain Monte Carlo (MCMC) algorithms• stochastic context free grammars• EM algorithms• Gibbs sampling• clustering• tree algorithms• text analysis• hybrid/combinatorial techniques and more…
Free University initiativesIntegrative Bioinformatics Institute VU (IBIVU)
•Centre for Research on BioComplex Systems (CRBCS) – Systems Biology
•Centre for Neurobiology and Cognitive Research (CNCR)
•VU Medical Centre (Microarray, CGH data)
IBIVU supporting Dutch initiatives•BioRange: Pan-Dutch bioinformatics proposal (65M Euro)
•Centre for Medical Systems Biology (Leiden, A’dam, R’dam)
•Ecogenomics (A’dam, Wageningen, Nat. Inst. For Health and Environment (RIVM))
•BioASP: streamline/stimulate bioinformatics teaching across The Netherlands
Dutch Centres of Excellence
•Cancer Genomics Consortium [DCGP]
•Center for Biosystem Genomics [CBSG], focuses on plant genomics (potato, tomato)
•Kluyver Centre for Genomics of Industrial Fermentation [Kluyver]
•Center for Medical Systems Biology [CMSB], focuses on multifactorial disease
•Netherlands Proteomics Centre for proteomics as an emerging horizontal genomics discipline
Dutch academic/industrial initiatives• Nutrigenomics exploration into the prevention and care of
nutrional inroads in vascular disease, diabetes, hypertension and obesity
• Interaction between the immune system and food; a functional genomics approach to celiac disease
• Mechanisms of life-threatening virus disease and new leads for treatment and vaccines
• Genomics of host – respiratory virus interactions: towards novel intervention strategies;
• Ecogenomics: Functioning of ecosystems targeted at sustainable environmentally friendly and healthy products (ecology, toxicology and sustainable innovation)
In vitro
Life-support functions soil
Biol. response array
Eco-toxicology
In situ
Metagenome array
Technologydevelopment
Research questions
Bio-informatics-
Technology platform
Assessing the Living Soil
Ecogenomics
PROJECTCoordinator
Task force
EPIDEMIOLOGYDorret Boomsma(VU/mc)*Cornelia van Duyn(EMC)
Populations EMC : van Duyn, Hofman, OostraVU/mc : Boomsma, Boers, Dijkmans, Heine, Hoogendijk, van der Knaap, Meier, Pena, PinedoLUMC : Slagboom, Bertina, Breedveld, Breuning, Cornelisse, Devilee, vDissel, Ferrari, Huizinga, Roosendaal, Roos, van der Velde, Westendorp, ZitmanGenotyping LUMC : Slagboom, Sandkuijl, den DunnenEMC : Oostra, HeutinkVU/mc : Boomsma, (Heutink)
SYSTEMS BIOLOGYJan vd Greef (TNO/UL)*Cor Verweij (VU/mc)
Arraying LUMC: den Dunnen, Boer, FoddeVU/mc: Verweij, Ylstra, BrakenhoffEMC: Oostra
Proteomics LUMC: Koning, Deelder, den Dunnen, van der MaarelUL: Overkleeft, Abrahams, VU/mc: Smit, Li, van Kooyk
Metabolomics
UL: Verduijn Lunel, van de Geer, VerheijTNO: van der Greef, Havekes, te KoppeleVU/mc: Jakobs
TECHNOLOGYHuub de Groot (UL)
Molecular interactions
UL: Abrahams, Brouwer, IJzerman, van BoomLUMC: Tanke, Raap, Deelder, den DunnenVU/mc: Leurs, Irth
In vivo imaging
UL: de Groot, KokLUMC: Reiber, van Buchem, de Roos, Poelmann, LowickVU/mc: Witter, Bal, LammertsmaEMC: van Duyn, van Swieten
MODEL SYSTEMSRune Frants (LUMC)
Mouse / RatZebrafishDrosophilaYeast
EMC: OostraLUMC : Verbeek, Fodde, deKloet, Verrijzer, Noordermeer, MullendersTNO: Havekes; UL: Spaink, Brouwer, SchmidtVU/mc: Verhage, Smit, Vandenbroucke-Grauls
CLINICAL APPLICATIONSCornelis Melief (LUMC)
Cells, vaccines
LUMC : Melief, Goulmy, Falkenburg, Ottenhoff; Spaan, de VriesVU/mc: van Kooiyk; Meijer,Pinedo
Viral LUMC: Spaan, Wiertz, HoebenVU/mc: Gerritsen, Curiel
Methodologies,Pharmaceuticals
UL: IJzerman, Mulder, van BoomLUMC: Huizinga, Breedveld, Breuning, van Deutekom, Ferrari, Fodde, Frants, Jukema, de Kloet, Ottenhoff, van der Velde, ZitmanVU/mc: Maassen, Dijkmans, Leurs, Meijer, PinedoEMC : Stricker
CENTRAL PROJECTCoordinator / Elements
DATA INTEGRATION,ANALYSIS AND LOGISTICSNN
Central Information ManagementTFBI / BIG-VU / EBB - Rosetta Resolver® - LIMS integration/ /interfacing Biostatistics van Houwelingen, Eijlers, Boer, Sandkuijl (LUMC); van der Vaart, de Gunst, Boers (VU/mc); Houwing-Duiistermaat (EMC), van de Geer (UL)
Bioinformatics Boer, Svensson, Gorbalenya (LUMC), Heringa, vBeek (VU/mc), Stijnen, van der Lei, Mons (EMC), Kok (UL)
BioASP Interface ism: Vriend/TellegenGRID – Virtual Laboratory NWO- BMI FLEXwork van Ommen, Boer, Svensson ism: - Stiekema (Wag) - Herzberger (Ams) - Vriend (Nijm)
Medical Systems Biology
• Integrate data sources
• Integrate methods
• Integrate data through method integration (biological model)
Integrative bioinformaticsData integration
Bioinformatics tool
Data
Algorithm
BiologicalInterpretation
(model)
tool
“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975))
“Nothing in Bioinformatics makes sense except in the light of Biology”
Bioinformatics
Pair-wise sequence alignment(more than just string matching)
MDAGSTVILCFVGMDAASTILCGS
Amino Acid Exchange
Matrix
Gap penalties (open,extension)
Search matrix
MDAGSTVILCFVG-MDAAST-ILC--GS
EvolutionGlobal dynamic programming
Data
Algorithm
BiologicalInterpretation
(model)
tool
Integrative bioinformaticsData integration
Integrative bioinformaticsData integration
Data 1 Data 2 Data 3
Integrative bioinformaticsData integration
Data 1
Algorithm 1
BiologicalInterpretation
(model) 1
tool
Algorithm 2
BiologicalInterpretation
(model) 2
Algorithm 3
BiologicalInterpretation
(model) 3
Data 2 Data 3
“The solution includes an infrastructure or data pipeline involving: •a general portal•virtual lab technology (virtual LIMS)•‘petabase’ data handling facilities•methods, software and ‘tools’ to integrate data and extract knowledge from data in the user domain.
This infrastructure calls for •a central facilitation unit providing large storage and computing facilities to run central software packages with user interfaces”
•Could Gridlab do this?
Integrative bioinformaticsData integration
Integrating Primary and Predicted Secondary Structure data for
Multiple Alignment
Using secondary structure in multiple alignment
“Structure more conserved than sequence”
•10 years SS prediction method development: Q3 += 3%
•10 years MA method development: difference in Q3 can be >30%
Protein structure hierarchical levels
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
PRIMARY STRUCTURE (amino acid sequence)
QUATERNARY STRUCTURE (oligomers)
SECONDARY STRUCTURE (helices, strands)
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
PRIMARY STRUCTURE (amino acid sequence)
QUATERNARY STRUCTURE (oligomers)
SECONDARY STRUCTURE (helices, strands)
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
PRIMARY STRUCTURE (amino acid sequence)
QUATERNARY STRUCTURE (oligomers)
SECONDARY STRUCTURE (helices, strands)
TERTIARY STRUCTURE (fold)
Flavodoxin-cheY: Praline alignment (prepro=1500)
1fx1 -PKALIVYGSTTGNT-EYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACF
FLAV_DESDE MSKVLIVFGSSTGNT-ESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLF-EEFNRFGLAGRKVAAf
FLAV_DESVH MPKALIVYGSTTGNT-EYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACf
FLAV_DESSA MSKSLIVYGSTTGNT-ETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLY-DSLENADLKGKKVSVf
FLAV_DESGI MPKALIVYGSTTGNT-EGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLY-EDLDRAGLKDKKVGVf
2fcr --KIGIFFSTSTGNT-TEVADFIGKTLGA---KADAPIDVDDVTDPQALKDYDLLFLGAPTWNTG----ADTERSGTSWDEFLYDKLPEVDMKDLPVAIF
FLAV_AZOVI -AKIGLFFGSNTGKT-RKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFL-PKIEGLDFSGKTVALf
FLAV_ENTAG MATIGIFFGSDTGQT-RKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFT-NTLSEADLTGKTVALf
FLAV_ANASP SKKIGLFYGTQTGKT-ESVaEIIRDEFGN---DVVTLHDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLY-SELDDVDFNGKLVAYf
FLAV_ECOLI -AITGIFFGSDTGNT-ENIaKMIQKQLGK---DVADVHDIAKSS-KEDLEAYDILLLgIPTWYYGE--------AQCDWDDFF-PTLEEIDFNGKLVALf
4fxn -MK--IVYWSGTGNT-EKMAELIAKGIIESG-KDVNTINVSDVNIDELL-NEDILILGCSAMGDEVL-------EESEFEPFI-EEIS-TKISGKKVALF
FLAV_MEGEL MVE--IVYWSGTGNT-EAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVA-SKDVILLgCPAMGSEEL-------EDSVVEPFF-TDLA-PKLKGKKVGLf
FLAV_CLOAB -MKISILYSSKTGKT-ERVaKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQESEGIIFgTPTYYAN---------ISWEMKKWI-DESSEFNLEGKLGAAf
3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFN--NVEEAEDGVDALNKLQAGGYGFVI---SDWNMPNM----------DGLELL-KTIRADGAMSALPVLM
T1fx1 GCGDS-SY-EYFCGA-VDAIEEKLKNLGAEIVQD---------------------GLRIDGD--PRAARDDIVGWAHDVRGAI--------
FLAV_DESDE ASGDQ-EY-EHFCGA-VPAIEERAKELgATIIAE---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL--------
FLAV_DESVH GCGDS-SY-EYFCGA-VDAIEEKLKNLgAEIVQD---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI--------
FLAV_DESSA GCGDS-DY-TYFCGA-VDAIEEKLEKMgAVVIGD---------------------SLKIDGD--PE--RDEIVSwGSGIADKI--------
FLAV_DESGI GCGDS-SY-TYFCGA-VDVIEKKAEELgATLVAS---------------------SLKIDGE--PD--SAEVLDwAREVLARV--------
2fcr GLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKS-VRDGKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------
FLAV_AZOVI GLGDQVGYPENYLDA-LGELYSFFKDRgAKIVGSWSTDGYEFESSEA-VVDGKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L--
FLAV_ENTAG GLGDQLNYSKNFVSA-MRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L------
FLAV_ANASP GTGDQIGYADNFQDA-IGILEEKISQRgGKTVGYWSTDGYDFNDSKA-LRNGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL------
FLAV_ECOLI GCGDQEDYAEYFCDA-LGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA
4fxn G-----SY-GWGDGKWMRDFEERMNGYGCVVVET---------------------PLIVQNE--PDEAEQDCIEFGKKIANI---------
FLAV_MEGEL G-----SY-GWGSGEWMDAWKQRTEDTgATVIGT----------------------AIVNEM--PDNA-PECKElGEAAAKA---------
FLAV_CLOAB STANSIAGGSDIA---LLTILNHLMVKgMLVYSG----GVAFGKPKTHLGYVHINEIQENEDENARIfGERiANkVKQIF-----------
3chy VTAEAKK--ENIIAA---------AQAGAS-------------------------GYVV-----KPFTAATLEEKLNKIFEKLGM------
GIteration 0 SP= 136944.00 AvSP= 10.675 SId= 4009 AvSId= 0.313
Flavodoxin-cheY NJ tree
Secondary structure-induced alignment iteration
3chy-AA SEQUENCE|| AA |ADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEEAEDGVDALNKLQAGGYGFVISDWNMP|
3chy-ITERATION-0|| PHD | EEEEEEE HHHHHHHHHHHHHHHHH E HHHHHHHHHH HHHEEE |
3chy-ITERATION-1|| PHD | EEEEEEEE HHHHHHHHHHHHHHH HHHHHHHH EEEEEE |
3chy-ITERATION-2|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHHH EEEEEE |
3chy-ITERATION-3|| PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE |
3chy-ITERATION-4|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHH EEEEE |
3chy-ITERATION-5|| PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE |
3chy-ITERATION-6|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHH EEEEEE |
3chy-ITERATION-7|| PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE |
3chy-ITERATION-8|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHH EEEEEE |
3chy-ITERATION-9|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHHHH EEEEE |
3chy-AA SEQUENCE|| AA |NMDGLELLKTIRADGAMSALPVLMVTAEAKKENIIAAAQAGASGYVVKPFTAATLEEKLNKIFEKLGM|
3chy-ITERATION-0|| PHD | HHHHHHEEEEEE HHHHHHHHHHHHHHHHH HHHHHHHHHHHHHH |
3chy-ITERATION-1|| PHD | HHHHHHEEEEEE HHH HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |
3chy-ITERATION-2|| PHD | HHHHHHEEEEEE HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |
3chy-ITERATION-3|| PHD | HHHHHHHHHHHH HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |
3chy-ITERATION-4|| PHD | HHHHH EEEEE HHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |
3chy-ITERATION-5|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |
3chy-ITERATION-6|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEEE HHHHHHHHHHHHHH |
3chy-ITERATION-7|| PHD | HHHHHHHH EEEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |
3chy-ITERATION-8|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH |
3chy-ITERATION-9|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHH EEEE HHHHHHHHHHHHHH |
Flavodoxin-cheY multiple alignment/ secondary structure iteration
cheY SSEs
Integrating secondary structure prediction and multiple alignment
• Low key example
• But difficult
• How to scale up?
• Need new formalisms and technology
SnapDRAGON
Richard A. George
George R.A. and Heringa, J. (2002) J. Mol. Biol., 316, 839-851.
Integrating protein multiple alignment, secondary and tertiary structure
prediction to predict structural domains in sequence data
The DEATH Domain• Present in a variety of Eukaryotic proteins involved with cell death.• Six helices enclose a tightly packed hydrophobic core.• Some DEATH domains form homotypic and heterotypic dimers.
http
://w
ww
.msh
ri.o
n.ca
/paw
son
Pyruvate kinasePhosphotransferase
barrel regulatory domain
barrel catalytic substrate binding domain
nucleotide binding domain
1 continuous + 2 discontinuous domains
Structural domain organisation can be nasty…
Protein structure hierarchical levels
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
PRIMARY STRUCTURE (amino acid sequence)
QUATERNARY STRUCTURE
SECONDARY STRUCTURE (helices, strands)
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
PRIMARY STRUCTURE (amino acid sequence)
QUATERNARY STRUCTURE
SECONDARY STRUCTURE (helices, strands)
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
PRIMARY STRUCTURE (amino acid sequence)
QUATERNARY STRUCTURE
SECONDARY STRUCTURE (helices, strands)
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
PRIMARY STRUCTURE (amino acid sequence)
QUATERNARY STRUCTURE
SECONDARY STRUCTURE (helices, strands)
TERTIARY STRUCTURE (fold)
•The C distance matrix is divided into smaller clusters.
•Seperately, each cluster is embedded into a local centroid.
•The final predicted structure is generated from full embedding of the multiple centroids and their corresponding local structures.
3NN
NN
C distancematrix
Targetmatrix
N
CCHHHCCEEE
Multiple alignment
Predicted secondary structure
100 randomisedinitial matrices
100 predictions
Input data
SnapDRAGON
Domains in structures assigned using method by Taylor (1997)
Domain boundary positions of each model against sequence
Summed and Smoothed Boundaries (Biased window protocol)
SnapDRAGON
1
2
3
SnapDRAGON
•Predicting domain boundaries for single average size protein could take hours on 128-node cluster computer with simplified significance testing.
•How to do scale up to structural genomics? 30,000 human proteins of 1 hr each gives 3.5 years..
What we still cannot do well• “Give us sequence, we do rest” failed so far; e.g., number of
human genes• Gene prediction bad, RNA genes missed• Protein structure/function prediction unsolved; we have no clue
about function of 50% of human genes• No theory of gene regulation• Cannot well predict post-translational modification• Many (database) solutions not generic• We have no E=mc2 so need to keep all data• Integrating methods and data• Understand biologically
Future Bioinformatics Research Topics
• Integration of knowledge– We have some formalisms (ontologies,
distributed databases) but we need to develop many completely new formalisms and new technologies beyond what we have now
Conclusions
• Getting important integrative Bioinformatics/Systems Biology applications onto the Grid through Gridlab can be significant
• Bioinformatics and genomics are getting clinical. Gridlab could play an important role
The end. Thanks
Future Bioinformatics Research Topics
Keywords morning session
• Integration of knowledge– Information transfer from one object to another– What are the rules– From genotype to phenoypes, current algorithms and
ontologies not sufficient– Biological interpretation needs context– DB maintenance is dynamic process, most info is static– Need resources– Environment should allow student to make method in 3
hours
• Genomics– Identifying genetic elements still bad– Collect easy primary biological facts– Gene pred, struct pred, functional all unsolved– Genetic “parts” list is uncomplete and scanty– Many omics “unknowme”
• Genomics– Hypothesis driven versus systematic approaches
– Need databases,algorithms, biol knowledge
– Data structures not suitable for complexity
– Solutions such as Ensembl not generic
– Need technologies beyond ontologies
– Need new formalisms to be able to do “vertical genomics”
• Systems Biology– Very promising area
• Health
• Pharmaceuticals
• Biotechnology
• Environment
• (Medical) Systems Biology– Diego di Bernardo– Ilias Jakovidis– Very promising area
Summary
• How can Europe regain ground
Hans Werner Mewes
• DNA contains all• Identifying genetic elements still bad• Collect easy primary biological facts• Gene pred, struct pred, functional all
unsolved• Genetic “parts” list is uncomplete and
scanty• Many omics “unknowme”
• Hypothesis driven versus systematic approaches
• Need databases,algorithms, biol knowledge
• Data structures not suitable for complexity
• Solutions such as Ensembl not generic
• Need technologies beyond ontologies
• Information transfer from one object to another• What are the rules• From genotype to phenoypes, current algorithms not
sufficient• Biological interpretation needs context• DB maintenance is dynamic process, most info is
static• Need resources• Environment should allow student to make method in
3 hours
Diego di Bernardo
• TIGEM: disease genes• Bioinformatics and comp biol not at a par• 81 of genome “genomics&databases” and 19%
“genomics&algorithms• Important topics: regulation, network, digital signal
processing HMMs• Problems : algorithms not biological and no
experimental verification• Bioinformatics helps design biological experimnents• Richard Durbin: value of physics and engineering
• Computational tools for discovery of novel objects
Ilias Jakovidis
• Medical informatics
• Health telematics
• eHealth
• Medical ontologies didn’t help Paul Schofield at all (tried with NCBI-big mess)
• Middleware includes ontologies so covers biology (IBM!)
• Language engineering• Natural language in medicine, computerize
medical community• Biomedical informatics: applications in
healthcare, how to get to clinical?• Synergy between medicine and biology
informatics• Alphonso, med will dominate, lot of money with
unclear methods
• Medical info has worked coherently, how can we do that? How can we change?
• Mewes: Bioinf has achieved usage, not med. Bioinf is entering cliniques.
Gunnar von Heijne
• Databases should be funded• Start problem for 5 years: and then what?• With infrastructure this problem is less, so funding
is relatively OK.• Technology development should not become
dominant• Most biologist are small scale hypothesis driven• Marketing problem
• From 19 bioinf nethods , 15 are European in genomics
• Validation is not always key (Alfonso)
• EMBOSS project European wide, for algorithm driven research. EMBOSS is longstanding. But could not get funding from EC (no funding category)
Alfonso Valencia
• Often, 1 bioinformatician for everything• Need of integration/collaboration
– Social, technical barriers
• People should realise that Bioinf and Bioinformaticians are very different
• Integrated (med) system– Underfunded (1 postdoc)– Difficult to develop– lack of standards and repositories– Difficult to interact with biologist– All these things essential
• 3-4 good bioinf groups in Spain• Make virtual institute for bioinformatics• There are few large groups with national
funding • There are few large groups with European
funding• There are many small groups with weak
institutional funding
• Create framework valid for biology• Interaction reduces overhead• System access for biologists, point to the
right expert• Create new science beyond current needs• This does not compete with basic needs• Support strong European areas (eg. protein
interaction)
• Bioinformatics is a new discipline• Who solves the problem, who is interested in solving
it, and not always who qualifies to solving it (engineers,..)
• Example “information extraction in molecular biology”: after years no real progress made.
• Systems Biology: what to do and how (no linear path), but we have opportunity to develop knowing
• Experimental validation: methods debug databases. Many proteins (90%?) have never seen an experiment
• Should bioinf talk to biology or vice versa?
Jean-Marie Claverie
• 1951 first protein sequence (insulin)• Field has come of age, so outsiders shouldn’t tell
us what to do and how• Bioinformatics is part of the foundation• Clear difference in application of informatics or
bioinformatics• Future will be different• Give us sequence, we do rest failed! Number of
human genes is example.
• Gene finding: standard genes good, RNA genes missed, no theory of gene transcription
• We have no E=mc2 so need to keep data• Computational biology is same as systems biology• Good integration: E. Coli Bioinf-project, find all
genes in small bacterium. Inclusive project. Now good consortium.
• Bioinformatics becomes invisible for biologists (Blast).
Howard Bilofsky
• PRISM forum
• Provide challenges for (bio)informatics
• Drives Bioinf,omics,.. techniques