introduction to bioinformatics
DESCRIPTION
Introduction to Bioinformatics. ChBi406/506. Ozlem Keskin For today’s lectures Many slides from gersteinlab.org/courses/452 And Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). EMails. Ozlem Keskin [email protected] - PowerPoint PPT PresentationTRANSCRIPT
Introduction to BioinformaticsChBi406/506
Ozlem Keskin
For today’s lectures Many slides from gersteinlab.org/courses/452
And
Bioinformatics and Functional Genomics
by Jonathan Pevsner (ISBN 0-471-21004-8).
• People with very diverse backgrounds in
biology, chemical engineering (MS/BS)
• People with diverse backgrounds in computer
science -please visit Attila Hoca’s office!
• Most people have a favorite gene, protein, or disease
Who is taking this course?
• To provide an introduction to bioinformatics with a focus on the National Center for Biotechnology Information (NCBI) and EBI
• To focus on the analysis of DNA, RNA and proteins
• To introduce you to the analysis of genomes
• To combine theory and practice to help you solve research problems
What are the goals of the course?
Themes throughout the course
Textbooks
Web sites
Literature references
Gene/protein families
Computer labs
Textbook
The course textbook is J. Pevsner, Bioinformatics and Functional Genomics (Wiley, 2009).
Several other bioinformatics texts are available:Baxevanis and OuelletteMountDurbin et al.Lesk
In our library you will find (e-book)Bioinformatics [electronic resource] : sequence and genome analysis / David W. Mount.ImprintCold Spring Harbor, N.Y. : Cold Spring Harbor Laboratory Press, c2001.
Bioinformatics : a practical guide to the analysis of genes and proteins / editedby Andreas D. Baxevanis, B.F. Francis Ouellette.ImprintHoboken, N.J. : John Wiley, 2005. (SOON)
Themes throughout the course: Literature references
You are encouraged to read original sourcearticles. Although articles are not required,they will enhance your understanding of thematerial.
You can obtain articles through PubMedand Web of Science.
Web sites
The course website is reached via:http://pevsnerlab.kennedykrieger.org/bioinfo_course.htm(or Google “pevsnerlab” courses)This site contains the powerpoints for each lecture.
The textbook website is:http://www.bioinfbook.orgThis has 1000 URLs, organized by chapterThis site also contains the same powerpoints.
You will also find the lecture slides at F-folder.
Grading
Midterm 30%Final 35% HWs 20%Project 15%
(might change, the course will evolve)
Themes throughout the course:
gene/protein familiesWe will use beta globin and retinol-binding protein 4
(RBP4) as model genes/proteins throughout the course.
Globins including hemoglobin and myoglobin carry
oxygen. RBP4 is a member of the lipocalin family. It is a
small, abundant carrier protein. We will study globins and
lipocalins in a variety of contexts including
• --sequence alignment
• --gene expression
• --protein structure
• --phylogeny
• --homologs in various species
The HIV-1 pol gene encodes three proteins
Aspartylprotease
Reversetranscriptase
Integrase
PR RT IN
Outline for today (chapters 1 and 2)
Definition of bioinformatics
Overview of the NCBI website
Accessing information about DNA and proteins
--Definition of an accession number
--Four ways to find information on proteins and DNA
Access to biomedical literature
BioinformaticsBiological
DataComputer
Calculations+
• Interface of biology and computers
• Analysis of proteins, genes and genomes using computer algorithms and computer databases
• Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects.
Protein coordinates, DNA array data, annotated gene sequencesBiological information is being generated now days in parallel.We can easily run 10,000 simultaneous experiments on a single DNA microarray.
To cope with this much data we really need computers.So Bioinformatics is that field that combines biology and computers.
What is bioinformatics?
Where does Bioinformatics come from?
Data from the Human Genome Project has fueled the development of new bioinformatics methods
HGP
What is Bioinformatics?
• (Molecular) Bio - informatics• One idea for a definition?
Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale.
• Interface of biology and computers
Analysis of proteins, genes and genomes using computer algorithms and computer databases
Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects.•
Top ten challenges for bioinformatics
[1] Precise models of where and when transcription will occur in a genome (initiation and termination)
[2] Precise, predictive models of alternative RNA splicing
[3] Precise models of signal transduction pathways; ability to predict cellular responses to external stimuli
[4] Determining protein:DNA, protein:RNA, protein:protein recognition codes
[5] Accurate ab initio protein structure prediction
Top ten challenges for bioinformatics
[6] Rational design of small molecule inhibitors of proteins
[7] Mechanistic understanding of protein evolution
[8] Mechanistic understanding of speciation
[9] Development of effective gene ontologies: systematic ways to describe gene and protein function
[10] Education: development of bioinformatics curricula
Source: Ewan Birney, Chris Burge, Jim Fickett
Simulating the cell
On bioinformatics
“Science is about building causal relations between natural phenomena (for instance, between a mutation in a gene and a disease). The development of instruments to increase our capacity to observe natural phenomena has, therefore, played a crucial role in the development of science - the microscope being the paradigmatic example in biology. With the human genome, the natural world takes an unprecedented turn: it is better described as a sequence of symbols. Besides high-throughput machines such as sequencers and DNA chip readers, the computer and the associated software becomes the instrument to observe it, and the discipline of bioinformatics flourishes.
On bioinformatics
However, as the separation between us (the observers) and the phenomena observed increases (from organism to cell to genome, for instance), instruments may capture phenomena only indirectly, through the footprints they leave. Instruments therefore need to be calibrated: the distance between the reality and the observation (through the instrument) needs to be accounted for. This issue of Genome Biology is about calibrating instruments to observe gene sequences; more specifically, computer programs to identify human genes in the sequence of the human genome.”
Martin Reese and Roderic Guigó, Genome Biology 2006 7(Suppl I):S1,introducing EGASP, the Encyclopedia of DNA Elements (ENCODE) Genome Annotation Assessment Project
Tool-users
Tool-makers
bioinformatics
public healthinformatics
medicalinformatics
infrastructure
databases algorithms
Three perspectives on bioinformatics
The cell
The organism
The tree of life
Page 4
After Pace NR (1997) Science 276:734
Page 6
Time ofdevelopment
Body region, physiology, pharmacology, pathology
Page 5
DNA RNA phenotypeprotein
Page 5
DNA RNA phenotypeprotein
Growth of GenBank
Year
Bas
e p
airs
of
DN
A (
bil
lio
ns)
Seq
uen
ces
(mil
lio
ns)
Updated 8-12-04:>40b base pairs
1982 1986 1990 1994 1998 2002 Fig. 2.1Page 17
0
10
20
30
40
50
60
70
1985
Growth of GenBank
Bas
e p
airs
of
DN
A (
bil
lio
ns)
Seq
uen
ces
(mil
lio
ns)
200019951990
December1982
June2006
Growth of the International NucleotideSequence Database Collaboration
Bas
e p
airs
of
DN
A (
bil
lio
ns)
http://www.ncbi.nlm.nih.gov/Genbank/
Base pairs contributed by GenBank EMBL DDBJ
DNA RNA protein
Central dogma of molecular biology
genome transcriptome proteome
Central dogma of bioinformatics and genomics
What is the Information?Molecular Biology as an Information Science
• Central Dogmaof Molecular Biology DNA -> RNA -> Protein -> Phenotype -> DNA
• Molecules– Sequence, Structure, Function
• Processes– Mechanism, Specificity, Regulation
• Central Paradigmfor Bioinformatics
Genomic Sequence Information -> mRNA (level) -> Protein Sequence -> Protein Structure -> Protein Function -> Phenotype
• Large Amounts of Information– Standardized– Statistical
•Genetic material•Information transfer (mRNA)•Protein synthesis (tRNA/mRNA)•Some catalytic activity
•Most cellular functions are performed or facilitated by proteins.
•Primary biocatalyst
•Cofactor transport/storage
•Mechanical motion/support
•Immune protection
•Control of growth/differentiation
(idea from D Brutlag, Stanford, graphics from S Strobel)
• Proteins fold into 3D structures with specific functions which are reflected in a pheonotype.
• These functions are selected in a Darwinian sense by the environment of the phenotype.
• Which drives the evolution of the DNA sequence.
• Many Bioinformatics techniques address this flow of molecular biology information inside the organism hoping to understand the organization and control of genes even predicting protein structure from sequence.
• There is a second flow of information that bioinformatics seeks to address is the large amount of data generated by new high through methods. Bioinformatics owes its lively hood to the availability of large data sets that are too complex to allow manual analysis.
DNA RNA
cDNAESTsUniGene
phenotype
genomicDNAdatabases
protein sequence databases
protein
Fig. 2.2Page 20
GenBankEMBL DDBJ
There are three major public DNA databases
The underlying raw DNA sequences are identical
Page 16
GenBankEMBL DDBJ
Housedat EBI
EuropeanBioinformatics
Institute
There are three major public DNA databases
Housed at NCBINational
Center forBiotechnology
Information
Housed in Japan
Page 16
>100,000 species are represented in GenBank
all species 128,941
viruses 6,137
bacteria 31,262
archaea 2,100
eukaryota 87,147
Table 2-1Page 17
The most sequenced organisms in GenBank
Homo sapiens (6.9 million entries)Mus musculus (5.0 million)Zea mays (896,000)Rattus norvegicus (819,000)Gallus gallus (567,000)Arabidopsis thaliana (519,000)Danio rerio (492,000)Drosophila melanogaster (350,000)Oryza sativa (221,000)
National Center for BiotechnologyInformation (NCBI)
www.ncbi.nlm.nih.gov
Taxonomy nodes at NCBI
http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi8/06
The most sequenced organisms in GenBank
Homo sapiens 10.7 billion basesMus musculus 6.5bRattus norvegicus 5.6bDanio rerio 1.7bZea mays 1.4bOryza sativa 0.8bDrosophila melanogaster 0.7bGallus gallus 0.5bArabidopsis thaliana 0.5b
Updated 8-12-04GenBank release 142.0
Table 2-2Page 18
The most sequenced organisms in GenBank
Homo sapiens 11.2 billion basesMus musculus 7.5bRattus norvegicus 5.7bDanio rerio 2.1bBos taurus 1.9bZea mays 1.4bOryza sativa (japonica) 1.2bXenopus tropicalis 0.9bCanis familiaris 0.8bDrosophila melanogaster 0.7b
Updated 8-29-05GenBank release 149.0
Table 2-2Page 18
The most sequenced organisms in GenBank
Homo sapiens 12.3 billion basesMus musculus 8.0bRattus norvegicus 5.7bBos taurus 3.5bDanio rerio 2.5bZea mays 1.8bOryza sativa (japonica) 1.5bStrongylocentrotus purpurata 1.2bSus scrofa 1.0bXenopus tropicalis 1.0b
Updated 7-19-06GenBank release 154.0
Table 2-2Page 18
Molecular Biology Information - DNA
• Raw DNA Sequence– Coding or Not?– Parse into genes?– 4 bases: AGCT– ~1 K in a gene, ~2 M in genome– ~3 Gb Human
atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgcagcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatacatggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtgaaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatccagcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattcttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaactggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgcaggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgtgttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactgcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgcatcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacctgcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgttgttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatcaaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacactgaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgcagacgctggtatcgcattaactgattctttcgttaaattggtatc . . .
. . . caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaacaacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg
Molecular Biology Information: Protein Sequence
• 20 letter alphabet– ACDEFGHIKLMNPQRSTVWY but not BJOUXZ
• Strings of ~300 aa in an average protein (in bacteria), ~200 aa in a domain
• >1M known protein sequences (uniprot)
d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTL--------NKPVIMGRHTWESI d3dfr__ TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV--------GKIMVVGRRTYESF d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSId8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSId4dfra_ ISLIAALAVDRVIGMENAMPW-NLPADLAWFKRNTLD--------KPVIMGRHTWESId3dfr__ TAFLWAQDRNGLIGKDGHLPW-HLPDDLHYFRAQTVG--------KIMVVGRRTYESF
d1dhfa_ VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHPd8dfr__ VPEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKPd4dfra_ ---G-RPLPGRKNIILS-SQPGTDDRV-TWVKSVDEAIAACGDVP------EIMVIGGGRVYEQFLPKAd3dfr__ ---PKRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLDQ----ELVIAGGAQIFTAFKDDV d1dhfa_ -PEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHPd8dfr__ -PEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKPd4dfra_ -G---RPLPGRKNIILSSSQPGTDDRV-TWVKSVDEAIAACGDVPE-----.IMVIGGGRVYEQFLPKAd3dfr__ -P--KRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLD----QELVIAGGAQIFTAFKDDV
Molecular Biology Information:Macromolecular Structure
• DNA/RNA/Protein– Almost all protein
(RNA Adapted From D Soll Web Page, Right Hand Top Protein from M Levitt web page)
Molecular Biology Information:
Whole Genomes
• The Revolution Driving EverythingFleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J. M., McKenney, K., Sutton, G., Fitzhugh, W., Fields, C., Gocayne, J. D., Scott, J., Shirley, R., Liu, L. I., Glodek, A., Kelley, J. M., Weidman, J. F., Phillips, C. A., Spriggs, T., Hedblom, E., Cotton, M. D., Utterback, T. R., Hanna, M. C., Nguyen, D. T., Saudek, D. M., Brandon, R. C., Fine, L. D., Fritchman, J. L., Fuhrmann, J. L., Geoghagen, N. S. M., Gnehm, C. L., McDonald, L. A., Small, K. V., Fraser, C. M., Smith, H. O. & Venter, J. C. (1995). "Whole-genome random sequencing and assembly of Haemophilus influenzae rd." Science 269: 496-512.
(Picture adapted from TIGR website, http://www.tigr.org)• Integrative Data
1995, HI (bacteria): 1.6 Mb & 1600 genes done
1997, yeast: 13 Mb & ~6000 genes for yeast
1998, worm: ~100Mb with 19 K genes
1999: >30 completed genomes!
2003, human: 3 Gb & 100 K genes...
Genome sequence now accumulate so quickly that, in less than a week, a single laboratory can produce more bits of data than Shakespeare managed in a lifetime, although the latter make better reading.
-- G A Pekso, Nature 401: 115-116 (1999)
Genomes highlight the Finiteness
of the “Parts” in Biology
Bacteria, 1.6 Mb,
~1600 genes [Science 269: 496]
Eukaryote, 13 Mb,
~6K genes [Nature 387: 1]
1995
1997
1998
Animal, ~100 Mb,
~20K genes [Science 282:
1945]
Human, ~3 Gb, ~100K
genes [???]
2000?
real thing, Apr ‘00
‘98 spoof Core
Other Types of Data• Gene Expression
– Early experiments yeast
– Now tiling array technology• 50 M data points to tile the human genome at ~50 bp res.
– Can only sequence genome once but can do an infinite variety of array experiments
• Phenotype Experiments
• Protein Interactions– For yeast: 6000 x 6000 / 2 ~ 18M possible interactions
– maybe 30K real
Weber Cartoon
Bioinformatics is born!
(courtesy of Finn Drablos)
Major Application I:Designing Drugs
• Understanding How Structures Bind Other Molecules (Function)
• Designing Inhibitors• Docking, Structure Modeling
(From left to right, figures adapted from Olsen Group Docking Page at Scripps, Dyson NMR Group Web page at Scripps, and from Computational Chemistry Page at Cornell Theory Center).
Core
Major Application II: Finding Homologs
Core
Major Application II:Finding Homologues
• Find Similar Ones in Different Organisms• Human vs. Mouse vs. Yeast
– Easier to do Expts. on latter!
(Section from NCBI Disease Genes Database Reproduced Below.)
Best Sequence Similarity Matches to Date Between Positionally ClonedHuman Genes and S. cerevisiae Proteins
Human Disease MIM # Human GenBank BLASTX Yeast GenBank Yeast Gene Gene Acc# for P-value Gene Acc# for Description Human cDNA Yeast cDNA
Hereditary Non-polyposis Colon Cancer 120436 MSH2 U03911 9.2e-261 MSH2 M84170 DNA repair proteinHereditary Non-polyposis Colon Cancer 120436 MLH1 U07418 6.3e-196 MLH1 U07187 DNA repair proteinCystic Fibrosis 219700 CFTR M28668 1.3e-167 YCF1 L35237 Metal resistance proteinWilson Disease 277900 WND U11700 5.9e-161 CCC2 L36317 Probable copper transporterGlycerol Kinase Deficiency 307030 GK L13943 1.8e-129 GUT1 X69049 Glycerol kinaseBloom Syndrome 210900 BLM U39817 2.6e-119 SGS1 U22341 HelicaseAdrenoleukodystrophy, X-linked 300100 ALD Z21876 3.4e-107 PXA1 U17065 Peroxisomal ABC transporterAtaxia Telangiectasia 208900 ATM U26455 2.8e-90 TEL1 U31331 PI3 kinaseAmyotrophic Lateral Sclerosis 105400 SOD1 K00065 2.0e-58 SOD1 J03279 Superoxide dismutaseMyotonic Dystrophy 160900 DM L19268 5.4e-53 YPK1 M21307 Serine/threonine protein kinaseLowe Syndrome 309000 OCRL M88162 1.2e-47 YIL002C Z47047 Putative IPP-5-phosphataseNeurofibromatosis, Type 1 162200 NF1 M89914 2.0e-46 IRA2 M33779 Inhibitory regulator protein
Choroideremia 303100 CHM X78121 2.1e-42 GDI1 S69371 GDP dissociation inhibitorDiastrophic Dysplasia 222600 DTD U14528 7.2e-38 SUL1 X82013 Sulfate permeaseLissencephaly 247200 LIS1 L13385 1.7e-34 MET30 L26505 Methionine metabolismThomsen Disease 160800 CLC1 Z25884 7.9e-31 GEF1 Z23117 Voltage-gated chloride channelWilms Tumor 194070 WT1 X51630 1.1e-20 FZF1 X67787 Sulphite resistance proteinAchondroplasia 100800 FGFR3 M58051 2.0e-18 IPL1 U07163 Serine/threoinine protein kinaseMenkes Syndrome 309400 MNK X69208 2.1e-17 CCC2 L36317 Probable copper transporter
Major Application I|I:Overall Genome Characterization• Overall Occurrence of a
Certain Feature in the Genome
– e.g. how many kinases in Yeast
• Compare Organisms and Tissues
– Expression levels in Cancerous vs Normal Tissues
• Databases, Statistics
(Clock figures, yeast v. Synechocystis, adapted from GeneQuiz Web Page, Sander Group, EBI)
Core
What do you get from large-scale data mining? Global
statistics on the population of proteins
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
MP MG EC SC HP SS HI MT MJ AF AA OT
Mesophile Thermophile
LO
D v
alu
e
EK(3)
EK(4)
10 to 45
Physiological temperature in C
98 65 85 83 95
EX-2: Occurrence of 1-4 salt bridges in genomes
of thermophiles v mesophiles
EX-1: Occurrence of functions per fold & interactions per fold
over all genomes
http://www.nature.com/nature/journal/vaop/ncurrent/pdf/nature07331.pdf
End of First Lecture 2009
Remaining Slides Lecture 2
Organizing Molecular Biology
Information:Redundancy and
Multiplicity
• Different Sequences Have the Same Structure
• Organism has many similar genes• Single Gene May Have Multiple
Functions• Genes are grouped into Pathways• Genomic Sequence Redundancy due to
the Genetic Code• How do we find the similarities?....
Integrative Genomics - genes structures functions pathways expression levels regulatory systems ….
Core
'Omics: studying populations of molecules in a
database frameworkOme
molecular group
Genome
Proteome
Transcriptome
Phenome
Interactome
Metabolome
Physiome
Orfeome
Secretome
Morphome
Glycome
Regulome
Functome
Cellome
Transportome
Ribonome
Operome
'Omics: studying populations of molecules in a
database frameworkOme Google
molecular group Hits
Genome5820000
0
Proteome 1850000
Transcriptome 707000
Phenome 418000
Interactome 87500
Metabolome 80700
Physiome 56300
Orfeome 29800
Secretome 23900
Morphome 11400
Glycome 995
Regulome 618
Functome 390
Cellome 246
Transportome 155
Ribonome 131
Operome 57
'Omics: studying populations of molecules in a
database frameworkOme Google PubMed PubMed
molecular group Hits Hits First year
Genome5820000
0 537993 1953
Proteome 1850000 6005 1995
Transcriptome 707000 1665 1997
Phenome 418000 53 1989
Interactome 87500 87 1999
Metabolome 80700 182 1998
Physiome 56300 41 1997
Orfeome 29800 25 2002
Secretome 23900 48 2000
Morphome 11400 2 2000
Glycome 995 34 2000
Regulome 618 6 2004
Functome 390 1 2001
Cellome 246 17 2002
Transportome 155 1 2004
Ribonome 131 1 2002
Operome 57 0
Pub
Med
Hits Proteome
Core
A Parts List Approach to Bike Maintenance
Extra
A Parts List Approach to Bike Maintenance
What are the shared parts (bolt, nut, washer, spring, bearing), unique parts (cogs, levers)? What are the common parts -- types of parts (nuts & washers)?
How many roles can these play? How flexible and adaptable are they mechanically?
Where are the parts
located?
Extra
Core
Molecular Parts = Conserved Domains, Folds, &c
Vast Growth in (Structural) Data...
but number of Fundamentally New
(Fold) Parts Not Increasing that Fast
New Submissions
New Folds
Total in Databank
~1000 folds
~100000 genes
~1000 genes1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 …
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 …
(human)
(T. pallidum)
World of Structures is even more Finite,
providing a valuable simplification
Same logic for pathways, functions, sequence families, blocks, motifs....
Global Surveys of a Finite Set of Parts from Many Perspectives
Functions picture from www.fruitfly.org/~suzi (Ashburner); Pathways picture from, ecocyc.pangeasystems.com/ecocyc (Karp, Riley). Related resources: COGS, ProDom, Pfam, Blocks, Domo, WIT, CATH, Scop....
What is Bioinformatics?
• (Molecular) Bio - informatics• One idea for a definition?
Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale.
• Structural Bioinformatics is a practical discipline with many applications that deals with biological three dimensional structural data.
General Types of “Informatics” techniques
in Bioinformatics• Databases
Building, Querying Object DB
• Text String Comparison Text Search 1D Alignment Significance Statistics Google, grep
• Finding Patterns AI / Machine Learning Clustering Datamining
• Geometry Robotics Graphics (Surfaces, Volumes) Comparison and 3D Matching
(Vision, recognition)
• Physical Simulation Newtonian Mechanics Electrostatics Numerical Algorithms Simulation
Bioinformatics Topics -- Genome Sequence
• Finding Genes in Genomic DNA introns exons promotors
• Characterizing Repeats in Genomic DNA Statistics Patterns
• Duplications in the Genome Large scale genomic alignment
• Whole-Genome Comparisons
• Finding Structural RNAs
Bioinformatics Topics --
Protein Sequence
• Sequence Alignment non-exact string matching,
gaps How to align two strings
optimally via Dynamic Programming
Local vs Global Alignment Suboptimal Alignment Hashing to increase speed
(BLAST, FASTA) Amino acid substitution
scoring matrices
• Multiple Alignment and Consensus Patterns How to align more than one
sequence and then fuse the result in a consensus representation
Transitive Comparisons HMMs, Profiles Motifs
• Scoring schemes and Matching statistics How to tell if a given
alignment or match is statistically significant
A P-value (or an e-value)? Score Distributions
(extreme val. dist.) Low Complexity Sequences
• Evolutionary Issues Rates of mutation and
change
Bioinformatics Topics --
Sequence / Structure
• Secondary Structure “Prediction” via Propensities Neural Networks, Genetic
Alg. Simple Statistics TM-helix finding Assessing Secondary
Structure Prediction
• Structure Prediction: Protein v RNA
• Tertiary Structure Prediction Fold Recognition Threading Ab initio
• Function Prediction Active site identification
• Relation of Sequence Similarity to Structural Similarity
Problems in Protein Bioinformatics
• Prediction of structure from sequence Fold recognition Fragment construction
• Proteome annotation
• Protein-protein docking
Protein folding code
Proteinfoldingcode
Proteinstructure
Protein sequence
Prediction of correct fold
Query sequence
Foldrecognition
Eisenberg et al.Jones, Taylor, Thornton
Matchedfold
Match sequence against library of known folds
Computational Requirements
• 1 sequence search takes 12 mins (3Ghz)
• Benchmarking on 100 proteins with 100 runs for a simplex search of parameter space = 80 days
• 30 approaches explored = 7 years (on 1 cpu)