feb. 25, 2004 world university network - worldwide broadcast the future of bioinformatics (with...
Post on 20-Jan-2016
218 views
TRANSCRIPT
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
The Future of BioinformaticsThe Future of Bioinformatics(with examples from structural bioinformatics)(with examples from structural bioinformatics)
Philip E. BournePhilip E. BourneThe University of California San DiegoThe University of California San Diego
[email protected]@ucsd.eduhttp://www.sdsc.edu/pb/Talkshttp://www.sdsc.edu/pb/Talks
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
OutlineOutline
Bioinformatics thus farBioinformatics thus far Today – a growth disciplineToday – a growth discipline Drivers Drivers
DataData Complexity – biological and dataComplexity – biological and data
The interface to medical informatics and The interface to medical informatics and systems biologysystems biology
ChallengesChallenges The devil is in the detailsThe devil is in the details Quality controlQuality control Fundamentals versus relevance to biologyFundamentals versus relevance to biology
"You can observe a lot just by watching."
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Bioinformatics Thus Far – Pre 1970Bioinformatics Thus Far – Pre 1970Bioinformatics (2003) 19 2176-2190Bioinformatics (2003) 19 2176-2190
1945 Biochemical Pathways - Horowitz1953 Structure of DNA – W&C1969 Genetic Variation
1953 Game Theory – Neumann and Morgenstern1959 Grammars – Chomsky1962 Information Theory – Shannon & Weaver1966 Cellular automata – Neuman
1962 Molecular Homology – Florkin1965 Evolutionary Patterns – Purling1966 Molecular Modeling - Levinthal1967 Phylogenetic Trees – Fitch1969 Properties – Ptitsyn1970 Dynamic Programming N&W
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Bioinformatics Thus Far – 1970’sBioinformatics Thus Far – 1970’sProblem DefinitionProblem Definition
Improved Sequence AlignmentsSanakoff
Structural patternsAnd PropertiesRichards
Smith Waterman Algorithm
Exon/IntronsGilbert
Structure PredictionLevittChou and FasmanScheraga
Public Resources Dayhoff, PDB
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Bioinformatics Thus Far – 1980’sBioinformatics Thus Far – 1980’sComputational Biology EmergesComputational Biology Emerges
Domains recognizedRashin
Tree of Life Emerges
FASTALipman & Pearson
ProfilesGribskov
Reductionism beginsThorntonSander
Neural netsHopfield
Molecular computingConrad
NanotechnologyDrexler
ClusteringShepard
Relational DatabasesNetworks – EMBLnet, BIONET
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Bioinformatics Thus Far – 1990’s Bioinformatics Thus Far – 1990’s Bioinformatics and Biotechnology Bioinformatics and Biotechnology
EmergeEmerge
Human Genome Human Genome ProjectProject
Internet/WebInternet/Web
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
So What is Bioinformatics Today?So What is Bioinformatics Today?
A relatively new term for a scientific endeavor that has A relatively new term for a scientific endeavor that has been around much longerbeen around much longer
Medical informatics preceded it, and defined some of the Medical informatics preceded it, and defined some of the foundations?foundations?
A scientific endeavor driven out of a paradigm shift in A scientific endeavor driven out of a paradigm shift in which biology became a data driven sciencewhich biology became a data driven science
A scientific endeavor that has gained from fundamental A scientific endeavor that has gained from fundamental developments is computer and information science e.g., developments is computer and information science e.g., algorithms, ontologies, Bayesian networks, neural algorithms, ontologies, Bayesian networks, neural networks, text mining …networks, text mining …
A growth discipline…….A growth discipline…….
"Do you mean now?" -- When asked for the time. "
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Bioinformatics - A Vice Chancellor’s View
Biological Experiment Data Information Knowledge Discovery
Collect Characterize Compare Model Infer
Sequence
Structure
Assembly
Sub-cellular
Cellular
Organ
Higher-life
Year90 05
Computing Power
SequencingTechnology
Data1 10 100 1000 100000
95 00
Human Genome Project
E.ColiGenome
C.ElegansGenome 1 Small
Genome/Mo.ESTs
YeastGenome
Gene Chips
Virus Structure
Ribosome
Model Metaboloic Pathway of E.coli
Complexity Technology
Brain Mapping
Genetic Circuits
Neuronal Modeling
Cardiac Modeling
Human Genome
# People/Web Site
(C) Copyright Phil Bourne 1998
106 102 1
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
http://www.iscb.org/history.shtml
1500
2002
EdmontonCANADA
Growth in Bioinformatics as Growth in Bioinformatics as Measured by ISMB AttendanceMeasured by ISMB Attendance
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Bioinformatics Journal
0
200
400
600
800
1000
1200
1400
1997 1998 1999 2000 2001 2002 2003
Submissions
Bioinformatics Journal
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1997 1998 1999 2000 2001 2002 2003
Impact Factor
Growth in the JournalBioinformatics
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Drivers – Data Growth and Data Drivers – Data Growth and Data ComplexityComplexity
Consider Macromolecular Structure as an Consider Macromolecular Structure as an exampleexample
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Bourne Bioinformatics Editorial 1999 15(9):715 “Over the next 5 years there will be an estimated 10
major structural genomics efforts each yielding 200structures per year. While these efforts will deplete
regular structure determination efforts, improvementsin technology and a general expansion of the field
will continue to yield 50 structures per week worldwideoutside of the structural genomics initiatives.”
Net result 35,000 structures by 2005
"You can observe a lot just by watching."
There were 11,000 structures at the time of this prediction
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
PDB Growth CurvePDB Growth Curve
Approx. 24,000 structures todayIn 2003 approx. 5,000 structures were deposited
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
HistoryHistoryPredictions Can Be Good
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
A Data Centric View of the FutureA Data Centric View of the Future
Data complexityData complexity High throughput data collectionHigh throughput data collection Database versus literatureDatabase versus literature Bioinformatics as data driverBioinformatics as data driver Data representationData representation Data integrationData integration
"If you come to a fork in the road, take it."
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
(a) myoglobin (b) hemoglobin (c) lysozyme (d) transfer RNA(e) antibodies (f) viruses (g) actin (h) the nucleosome (i) myosin (j) ribosome
Numbers and Complexity
Courtesy of David Goodsell, TSRI
Complexity is increasing
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
"The ribosome, together with its accessories, is probably "The ribosome, together with its accessories, is probably the most sophisticated machine ever made.the most sophisticated machine ever made.““ R. Garrett (1999) R. Garrett (1999) NatureNature 400 400
• Translates mRNA into proteinTranslates mRNA into protein
• Molecular Mass: 2.6 millionMolecular Mass: 2.6 million
• Maximum Dimension ~25 nmMaximum Dimension ~25 nm
• 2/3 RNA – performs catalysis2/3 RNA – performs catalysis
• 1/3 protein –outer scaffold for the RNA1/3 protein –outer scaffold for the RNA
Complexity - The Ribosome Complexity - The Ribosome A NanomachineA Nanomachine
proteinmRNA
30s30s
50s50s
Figure from J. Frank, Wadsworth Center, NY
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
High Throughput - The Structural Genomics Pipeline (X-ray Crystallography)
Basic Steps
Target Selection
Crystallomics• Isolation,• Expression,• Purification,• Crystallization
DataCollection
StructureSolution
StructureRefinement
Functional Annotation Publish
Bioinformatics Throughout the Process
Bioinformatics• Distant homologs • Domain recognition
AutomationBioinformatics• Empirical rules
AutomationBetter sources
Software integrationDecision Support
MAD Phasing Automated fitting
Bioinformatics• Alignments• Protein-protein interactions• Protein-ligand interactions• Motif recognition
No?
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
An Aside on the Future of PublishingFull Description Captured as the Paper/Database is
Written/Deposited Does away with ...
… the p53 core domain structure consists of a ß sandwich that serves as a scaffold for two large loops and a loop-sheet- helix motif ... ----Science Vol.265, p346
1TSR
Corresponding structure from the PDB
?Oops!
ß sandwich? Where?Large loop? Which one??
Loop-sheet-helix???
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
BioEditor - A DTD Driven BioEditor - A DTD Driven Domain Specific EditorDomain Specific Editor
http://bioeditor.sdsc.edu
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Bourne et al. 2004 Pacific Symposium on Biocomputinghttp://www-smi.stanford.edu/projects/helix/psb04/bourne.doc
Structural Genomics Targets and their Status from http://targetdb.rcsb.org
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
The Data - Bioinformatics CycleThe Data - Bioinformatics CycleResult – Computation and Experiment Result – Computation and Experiment
Become More SynergisticBecome More Synergistic
Turn Data into Knowledge
Turn Knowledge into New Data Requirements
Data Bioinformatics
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Deuterium Exchange Mass Spec to Predict StructureDeuterium Exchange Mass Spec to Predict Structure
DXMS
COREX
Target ProteinStructure Templates
CASP
X-ray or NMR
Sequence
Homology
Threadingab in
itio
others
Amino Acid
S
tabi
lity
)
Profile Match Method
Best Structure(s)
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Biological RepresentationBiological Representation
The Gene Ontology changes everythingThe Gene Ontology changes everything Molecular functionMolecular function Biochemical processBiochemical process Cellular locationCellular location DAG – machine usableDAG – machine usable
The number of papers referencing the The number of papers referencing the gene ontology has increased dramatically gene ontology has increased dramatically in the last yearin the last year
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Biological Data Representation Biological Data Representation Future Future
Tools to construct ontologies from free Tools to construct ontologies from free text?text?
Ontologies for details of function, protein-Ontologies for details of function, protein-protein interaction, protocols, complete protein interaction, protocols, complete pathway informationpathway information
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Data IntegrationData Integration
Web Services – the Web Services – the holy grail of holy grail of
interoperability? interoperability?
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Web ServicesWeb Services
Its not CORBA – biologists can do itIts not CORBA – biologists can do it Easy to implementEasy to implement Platform independentPlatform independent Driver to force data providers to define and Driver to force data providers to define and
publish a detailed API publish a detailed API Compelling - introduces the prospect of Compelling - introduces the prospect of
global workflowglobal workflow
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Perl Web Services Client ExamplePerl Web Services Client Example A small PERL program to access all Pubmed A small PERL program to access all Pubmed
abstracts containing the word ‘ferritin’abstracts containing the word ‘ferritin’use SOAP::Lite;
$ids_ref = SOAP::Lite
-> uri(‘http://server.location.edu/pdbWebServices’)
-> proxy(‘http://server.location.edu/pdbWebServices’)
-> pubmedAbstractQuery($ARGV[0])
-> result;
@ids = @($ids_ref);
Print “@ids\n”;
Mycomputer(1)% web_service.pl ferritin
1AEW 1AQO 1BCF 1BFR 1BG7 1DPS 1EUM 1FHA 1JGC 1JI5 1JIG 1MFR 1QGH 1RCC 1RCD 1RCE 1RCG 1RCI 1RYT 2FHA
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
A Biological Complexity A Biological Complexity PerspectivePerspective
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Cell BiologyCell Biology
AnatomyAnatomy
PhysiologyPhysiology
ProteomicsProteomicsGenomicsGenomics
MedicinalMedicinal ChemistryChemistry
OrganismsOrganisms
OrgansOrgans
CellsCells
MacromoleculesMacromoleculesBiopolymersBiopolymers
Atoms & MoleculesAtoms & Molecules
SCIENTIFIC RESEARCH& DISCOVERY
REPRESENTATIVE DISCIPLINE
EXAMPLE UNITS
MRIMRI
HeartHeart
NeuronNeuron
StructureStructureSequenceSequence
ProteaseProteaseInhibitorInhibitor
ElectronElectronMicroscopyMicroscopy
Migratory Migratory SensorsSensors
VentricularVentricularModelingModeling
X-rayX-rayCrystallographyCrystallography
ProteinProteinDockingDocking
REPRESENTATIVE TECHNOLOGY
Technologies
Training
Infrastructure
You Are Here
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
The Post-Genomic EraThe Post-Genomic Era
GenomesGene
ProductsStructure &
FunctionPathways &Physiology
The “New” Central Dogma
~ Scientific Challenges - Deciphering the genome, mapping the genotype-phenotype relationships, dissecting organismic function, engineering organisms with altered functionality, figuring out complex traits and polymorphism, understanding physiology.
~ Algorithmic Challenges - comparisons of whole and partial genomes, metrics for similarity and homology, metabolic reconstruction, dissecting pathways, and whole cell modeling.
~ Computational Challenges - creation the informatics infrastructure, creation, annotation, curation and dissemination of databases, development of parallel computational methods.
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Interaction NetworksInteraction Networks
A Protein Interaction Map of Drosophila melanogaster
L. Giot, et al. Science, Vol. 302, Issue 5651, 1727-1736, December 5, 2003
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Phenomena in biological systems may be Phenomena in biological systems may be organized in several layers.organized in several layers.
PopulationsPopulations Ecological CommunitiesEcological Communities Populations of a SpeciesPopulations of a Species
Physiology and Organisms Physiology and Organisms Integrative physiology, HomeostasisIntegrative physiology, Homeostasis Organs, TissuesOrgans, Tissues CellsCells
Pathways and Information TransferPathways and Information Transfer Integrated metabolism, regulatory, developmental pathwaysIntegrated metabolism, regulatory, developmental pathways Simple pathways for information transfer, regulation, developmentSimple pathways for information transfer, regulation, development Simple metabolic pathways for creating & using other molecules Simple metabolic pathways for creating & using other molecules
Biological Macromolecules and StructuresBiological Macromolecules and Structures Biomolecular Assemblies; ligand-receptor complexesBiomolecular Assemblies; ligand-receptor complexes Molecules and Structures created by genes, gene products Molecules and Structures created by genes, gene products Gene Products: RNAs; ProteinsGene Products: RNAs; Proteins Genes and GenomesGenes and Genomes
Physics and ChemistryPhysics and Chemistry e.g. Physical Chemistry, Organic Chemistry, Information theory, Constraints of self-assembling adaptive systemse.g. Physical Chemistry, Organic Chemistry, Information theory, Constraints of self-assembling adaptive systems
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Each system layer builds from lower system layers Each system layer builds from lower system layers & acquires new emergent properties& acquires new emergent properties
PopulationsPopulations Ecological CommunitiesEcological Communities Populations of a SpeciesPopulations of a Species
Physiology and Organisms Physiology and Organisms Integrative physiology, HomeostasisIntegrative physiology, Homeostasis Organs, TissuesOrgans, Tissues CellsCells
Pathways and Information TransferPathways and Information Transfer Integrated metabolism, regulatory, developmental pathwaysIntegrated metabolism, regulatory, developmental pathways Simple pathways for information transfer, regulation, developmentSimple pathways for information transfer, regulation, development Simple metabolic pathways for creating & using other molecules Simple metabolic pathways for creating & using other molecules
Biological Macromolecules and StructuresBiological Macromolecules and Structures Biomolecular Assemblies; ligand-receptor complexesBiomolecular Assemblies; ligand-receptor complexes Molecules and Structures created by genes, gene products Molecules and Structures created by genes, gene products Gene Products: RNAs; ProteinsGene Products: RNAs; Proteins Genes and GenomesGenes and Genomes
Physics and ChemistryPhysics and Chemistry e.g. Physical Chemistry, Organic Chemistry, Information theory, Constraints of self-assembling adaptive systemse.g. Physical Chemistry, Organic Chemistry, Information theory, Constraints of self-assembling adaptive systems
New
Em
ergent Properties
Genes and Genomes
BiomolecularStructure &
Function
Biochemical Pathways &Processes
Tissue & Organismal Physiology
Ecological Processes
& Populations
Physics and Chemistry
Developmental & Physiological
Processes
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
The Next ResponseThe Next Response
Transitional medicine Transitional medicine Personalized medicinePersonalized medicine Merger of medical, chem and Merger of medical, chem and
bioinformaticsbioinformatics Training in cooperative in silico and Training in cooperative in silico and
experimental researchexperimental research Centers that reflect that training ie different Centers that reflect that training ie different
to NCBI or EBIto NCBI or EBI
Think! How the hell are you gonna think and hit at the same time?" "
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide BroadcastStatement of the Director, NIGMS, before the House Appropriations Subcommittee on Labor, HHS, Education Thursday, February 25, 1999
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Near Term ChallengesNear Term Challenges
Better Resources and AlgorithmsBetter Resources and Algorithms
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Current Data Resources and Algorithms are Current Data Resources and Algorithms are Challenged by Biological ComplexityChallenged by Biological Complexity
Our understanding of biological complexity Our understanding of biological complexity is not reflected in the current generation of is not reflected in the current generation of biological data resourcesbiological data resources
Hence these resources do not enable the Hence these resources do not enable the next generationnext generation
Algorithms are often limited since Algorithms are often limited since complexity implies variationcomplexity implies variation
Consider an example - the protein kinase-Consider an example - the protein kinase-like superfamilylike superfamily
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
The The SCOPSCOP Classification Hierarchy Classification HierarchySCOP Root
/ + Class
Familyb-Glucanase a-Amylase(N) b-Amylase
FoldTIM b/a-barrel NAD(P)-binding RossmanCellulases
SuperfamilyTIM PLP-binding barrel(Trans)glycosidases
PDBDomains
1e43(a:1-393)d1e43a2
1e3x(a:1-393)d1e3xa2
1e3z(a:1-393)d1e3za2 R
ela
ted
by
ho
mo
log
yD
ete
rmin
ed
by
stru
ctu
re
Courtesy Steven Brenner
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
An Example of a Structural Superfamily: An Example of a Structural Superfamily: The Protein Kinase-Like SuperfamilyThe Protein Kinase-Like Superfamily
Superfamily: not all eukaryotic or protein kinases: some homologues discovered in bacteria that phosphorylate antibiotics, others phosphorylate lipids Typical Kinase Core (c-Src, PDB ID: 2SRC)
SCOP grouping for kinases
1) Class: Alpha+Beta
2) Fold: Protein Kinase Catalytic Core
3) Superfamily: Protein Kinase Catalytic Core
4) Families:
a) Ser/Thr Kinases
b) Tyr Kinases
c) Atypical Kinases
d) Antibiotic Kinases
e) Lipid Kinases
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Evolution of the Kinase Evolution of the Kinase Superfamily: Comparison of Superfamily: Comparison of Three Superfamily MembersThree Superfamily Members
•A: Casein kinase 1 (PDB ID: 1CSN)
•B: Aminoglycoside kinase (PDB ID: 1J7L)
•C: Phosphatidylinositol 3-kinase (PDB ID: 1E8X).
•D: The previous three structures with only their shared region superposed (1CSN: light blue, 1J7L: red, 1E8X: yellow).
•The three kinases share a minimal core required for ATP binding and phosphotransfer.
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Our Algorithms Need to Our Algorithms Need to Continue to EvolveContinue to Evolve
Consider structure comparison Consider structure comparison and alignment of the diverse and alignment of the diverse
protein kinasesprotein kinases
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
An Example of Manual vs. Automated with Combinatorial Extension An Example of Manual vs. Automated with Combinatorial Extension (CE)(CE)•The manual alignment can be used to better understand the limitations of our automated method
•Alignment of helix C of two tyrosine kinases
•Insulin Receptor Kinase (pdb id 1IR3)
•c-Src (pdb id 2SRC)
•Can be aligned with 40% ident, 3.0Å RMSD
•In Src, C-helix is displaced and rotated outward
•Rotation pushes n-terminal end of helix out very far from n-terminal end of IRK
•CE gaps a part of this (yellow), splitting helix, aligning part of IRK helix C with loop leading to helix C in Src
Orange: IRK, Blue: c-SrcYellow: CE gap region
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
An Example of Manual vs. Automated with CEAn Example of Manual vs. Automated with CE•A closer look:
•The CE alignment puts closer C-alpha positions together but does not respect helical relationships
•Hand alignment respects helix, aligns more distant C-alpha positions
CE alignment Hand alignment
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Improving CEfam: Improving CEfam: Multiple Alignments Multiple Alignments with CEwith CE
•Example with strands 1 and 2 of kinase superfamily
•A: original
•B: optimal parameters
•C: manual
•Parameters also improved results with other protein superfamilies in visual analysis
•Just as sequence alignments are benchmarked against structure alignments, structure alignments should be benchmarked to manual results
•Improvement in optimization is now being folded into the next generation of CE
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Near Term Challenges - Near Term Challenges - Quality ControlQuality Control
Consider an exampleConsider an example
The definition of domains from The definition of domains from
3-D structure3-D structure
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
The 3D Domain Assignment Problem
Domain is a fundamental structural, functional and evolutionary unit of protein:
Compact
Stable
Have hydrophobic core
Fold independently
Perform specific function
Can be re-shuffled and put together in different combinations
Evolution works on the level of domain
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Exact assignments of domains remains a difficult and unresolved problem.
There is no complete agreement among experts on domain assignment given a protein structure.
Expert methods agree on 80% of all existing manual assignments, the remaining 20% represent “difficult” cases
Expert assignment #1
Expert assignment #2
Expert assignment #3
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Manual and automatic consensusagree
328 chains (77.3% of chains with consensus)
Automatic consensus only46 chains (10.9% of chains
with consensus)Manual consensus only 47 chains (11.1% of chains with consensus)
Automatic consensus and manual consensus disagree 3 chains (0.7% of chains with consensus)
Chains with manual consensus: 375 (80% of entire dataset)
Chains with automatic consensus: 374 (80% of entire dataset)
Chains with consensus (automatic or manual) : 424 (90.6% of entire dataset)
Manual vs. automatic consensuses: do they overlap?
Veretnik et al. 2003 JMB submitted
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
1cjaa1cjaa (actin-fragmin kinase, slime mold):(actin-fragmin kinase, slime mold): an unusual kinase an unusual kinase [complex interface][complex interface]
1 domain 1 domain + unassigned 4 domains
DALICATHSCOP, PDP, DomainParser
typical kinase
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Near Term Challenges – Near Term Challenges – High ThroughputHigh Throughput
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
iintegrated ntegrated GGenomic enomic AAnnotation Pipeline - iGAPnnotation Pipeline - iGAP
Deduced protein sequences
Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG)
Structural assignment of domains by PSI-BLAST on FOLDLIB
Only sequences w/out A-prediction
Only sequences w/out A-prediction
Structural assignment of domains by 123D on FOLDLIB
Create PSI-BLAST profiles for protein sequences
Store assigned regions in the DB
Functional assignment by PFAM, NR, PSIPred assignments
FOLDLIB
NR, PFAM
Building FOLDLIB:
PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP
90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30)
Domain location prediction by sequence
structure infosequence info
SCOP, PDB
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Deduced Protein sequences
Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG)
Structural assignment of domains by PSI-BLAST on FOLDLIB
Only sequences w/out A-prediction
Only sequences w/out A-prediction
Structural assignment of domains by 123D on FOLDLIB
Create PSI-BLAST profiles for Protein sequences
Store assigned regions in the DB
Functional assignment by PFAM, NR, PSIPred assignments
FOLDLIB
NR, PFAM
Building FOLDLIB:
PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP
90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30)
Domain location prediction by sequence
structure infosequence info
SCOP, PDB
~800 genomes @ 10k-20k per =~107 ORF’s
4 CPU years
228 CPU years
3 CPU years
9 CPU years
252 CPU years
3 CPU years
104 entries
iintegrated ntegrated GGenomic enomic AAnnotation nnotation PPipeline - ipeline - iGAPiGAP
Li, et al., (2003) Genome Biology
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Towards Workflows and the GridTowards Workflows and the Grid
XML
iGAP
Executables ParametersInputOutputResources
APST
Data Manager
Compute Manager
Scheduler
Grid ResourceInformation
Storage
Compute
Grid Middleware
MDS/NWS/Ganglia
SSH/GRAM/GASSPBS/Loadleveler/Condor
SCP/GASS/SRB/FTP
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
THE EOL GRID THE EOL GRID CONSORTIUM CONSORTIUM
EOL
Industrial PartnersIBMCeres
Titech Japan
SDSC Blue Horizon The EOL Cluster Sun Enterprise Server
BIISingapore
Encyclopedia Proteomics Inc.
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Near Term Challenges –Near Term Challenges –
We need to overcome the We need to overcome the “high noon” problem“high noon” problem
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
High Noon – A Working DefinitionHigh Noon – A Working Definition
12:00The cost:benefit ratio of entry to bioinformatics
tools and resources istoo high for the majority of biologists
Thus, those who could gain and
contribute most from the services provided are not users
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
One Approach - MBTOne Approach - MBT Java toolkit for developing custom molecular Java toolkit for developing custom molecular
visualization applicationsvisualization applications
High-qualityHigh-qualityinteractiveinteractiverendering of: rendering of:
sequence sequence structurestructure functionfunction
http://mbt.sdsc.edu
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
MBT FunctionalityMBT Functionality ProvidesProvides
Data loadingData loading Local files (PDB, mmCIF, Fasta, etc)Local files (PDB, mmCIF, Fasta, etc) Compressed files (zip, gzip)Compressed files (zip, gzip) Remote (http, ftp, OpenMMS?, EJB?)Remote (http, ftp, OpenMMS?, EJB?)
Efficient data accessEfficient data access Raw dataRaw data Derived data (StructureMap)Derived data (StructureMap)
Vizualization (plug-in viewers)Vizualization (plug-in viewers)
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
MBT ArchitectureMBT Architecture
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Future - The Structure Should Future - The Structure Should be the User Interfacebe the User Interface
Ligand - What otherentries contain this?
Chain - What otherentries have chains with >90% sequence identity?
Residue - What is the environment of this residue?
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
On-going and Longer On-going and Longer Term ChallengesTerm Challenges
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Outstanding Problems in Outstanding Problems in Sequence Analysis & Sequence Analysis &
ComparisonComparison
Exon recognitionExon recognition Protein coding gene modelingProtein coding gene modeling Protein/EST alignmentProtein/EST alignment Large scale sequence comparison and alignmentLarge scale sequence comparison and alignment Synteny recognitionSynteny recognition Polymorphism / variation detectionPolymorphism / variation detection Regulatory pattern recognitionRegulatory pattern recognition Repetitive DNA characterizationRepetitive DNA characterization RNA gene modelingRNA gene modeling
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Exemplar Bioinformatics ProblemsExemplar Bioinformatics Problems
1. Full genome comparisons
2. Rapid assessment of polymorphic variations
3. Complete construction of orthologous and paralogous groups
4. Structure resolution of large assemblies/complexes
5. Dynamical simulation of realistic systems
6. Rapid structural/topological clustering of proteins
7. Protein folding
8. Computer simulation of membrane insertion
9. Simulation of cellular pathways/ sensitivity analysis of pathways stoichiometry and kinetics
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
Bringing the Data View and the Complexity Bringing the Data View and the Complexity View Together to Define the Bioinformatics View Together to Define the Bioinformatics
“Engineering” Challenge“Engineering” Challenge
Easy access to any type of Easy access to any type of biological data across databasesbiological data across databases
Ability to go across databases and Ability to go across databases and types of datatypes of data
Rapidly infer knowledge from new Rapidly infer knowledge from new genome sequencesgenome sequences
Find relationships between Find relationships between sequence, structure and function sequence, structure and function of gene productsof gene products
Relate genotype to phenotype in Relate genotype to phenotype in speciesspecies
Access and apply polymorphism Access and apply polymorphism data seamlesslydata seamlessly
A single computer interface (Web A single computer interface (Web browser?)browser?)
Computer platform independenceComputer platform independence Total opaqueness of format Total opaqueness of format
differencesdifferences Compute on a point and click Compute on a point and click
modemode Seamless access to files, file Seamless access to files, file
uploads and downloadsuploads and downloads Multimedia capabilities on the Multimedia capabilities on the
interfaceinterface Ability to integrate new Ability to integrate new
tools/databases painlesslytools/databases painlessly
Feb. 25, 2004Feb. 25, 2004 World University Network - Worldwide BroadcastWorld University Network - Worldwide Broadcast
AcknowledgementsAcknowledgements
To all those who have chosen To all those who have chosen bioinformatics as a career and make the bioinformatics as a career and make the field so richfield so rich
Particularly those who do so for lesser Particularly those who do so for lesser rewards – the data providers and rewards – the data providers and annotatorsannotators
My group for the fun we had discussing My group for the fun we had discussing this topicthis topic
http://rinkworks.com/said/yogiberra.shtmlhttp://rinkworks.com/said/yogiberra.shtml"I didn't really say everything I said."