introduction to bioinformatics

Introduction to BioinformaticsChBi406/506

Ozlem Keskin

For today’s lectures Many slides from gersteinlab.org/courses/452

And

Bioinformatics and Functional Genomics

by Jonathan Pevsner (ISBN 0-471-21004-8).

Ozlem Keskin [email protected]

Engin Cukuroglu [email protected]

EMails

mailto:[email protected]

mailto:[email protected]

• People with very diverse backgrounds in

biology, chemical engineering (MS/BS)

• People with diverse backgrounds in computer

science -please visit Attila Hoca’s office!

• Most people have a favorite gene, protein, or disease

Who is taking this course?

• To provide an introduction to bioinformatics with a focus on the National Center for Biotechnology Information (NCBI) and EBI

• To focus on the analysis of DNA, RNA and proteins

• To introduce you to the analysis of genomes

• To combine theory and practice to help you solve research problems

What are the goals of the course?

Themes throughout the course

Textbooks

Web sites

Literature references

Gene/protein families

Computer labs

Textbook

The course textbook is J. Pevsner, Bioinformatics and Functional Genomics (Wiley, 2009).

Several other bioinformatics texts are available:Baxevanis and OuelletteMountDurbin et al.Lesk

In our library you will find (e-book)Bioinformatics [electronic resource] : sequence and genome analysis / David W. Mount.ImprintCold Spring Harbor, N.Y. : Cold Spring Harbor Laboratory Press, c2001.

Bioinformatics : a practical guide to the analysis of genes and proteins / editedby Andreas D. Baxevanis, B.F. Francis Ouellette.ImprintHoboken, N.J. : John Wiley, 2005. (SOON)

Themes throughout the course: Literature references

You are encouraged to read original sourcearticles. Although articles are not required,they will enhance your understanding of thematerial.

You can obtain articles through PubMedand Web of Science.

Web sites

The course website is reached via:http://pevsnerlab.kennedykrieger.org/bioinfo_course.htm(or Google “pevsnerlab” courses)This site contains the powerpoints for each lecture.

The textbook website is:http://www.bioinfbook.orgThis has 1000 URLs, organized by chapterThis site also contains the same powerpoints.

You will also find the lecture slides at F-folder.

Grading

Midterm 30%Final 35% HWs 20%Project 15%

(might change, the course will evolve)

Themes throughout the course:

gene/protein familiesWe will use beta globin and retinol-binding protein 4

(RBP4) as model genes/proteins throughout the course.

Globins including hemoglobin and myoglobin carry

oxygen. RBP4 is a member of the lipocalin family. It is a

small, abundant carrier protein. We will study globins and

lipocalins in a variety of contexts including

• --sequence alignment

• --gene expression

• --protein structure

• --phylogeny

• --homologs in various species

The HIV-1 pol gene encodes three proteins

Aspartylprotease

Reversetranscriptase

Integrase

PR RT IN

Outline for today (chapters 1 and 2)

Definition of bioinformatics

Overview of the NCBI website

Accessing information about DNA and proteins

--Definition of an accession number

--Four ways to find information on proteins and DNA

Access to biomedical literature

BioinformaticsBiological

DataComputer

Calculations+

• Interface of biology and computers

• Analysis of proteins, genes and genomes using computer algorithms and computer databases

• Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects.

Protein coordinates, DNA array data, annotated gene sequencesBiological information is being generated now days in parallel.We can easily run 10,000 simultaneous experiments on a single DNA microarray.

To cope with this much data we really need computers.So Bioinformatics is that field that combines biology and computers.

What is bioinformatics?

Where does Bioinformatics come from?

Data from the Human Genome Project has fueled the development of new bioinformatics methods

What is Bioinformatics?

• (Molecular) Bio - informatics• One idea for a definition?

Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale.

• Interface of biology and computers

Analysis of proteins, genes and genomes using computer algorithms and computer databases

Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects.•

Top ten challenges for bioinformatics

[1] Precise models of where and when transcription will occur in a genome (initiation and termination)

[2] Precise, predictive models of alternative RNA splicing

[3] Precise models of signal transduction pathways; ability to predict cellular responses to external stimuli

[4] Determining protein:DNA, protein:RNA, protein:protein recognition codes

[5] Accurate ab initio protein structure prediction

Top ten challenges for bioinformatics

[6] Rational design of small molecule inhibitors of proteins

[7] Mechanistic understanding of protein evolution

[8] Mechanistic understanding of speciation

[9] Development of effective gene ontologies: systematic ways to describe gene and protein function

[10] Education: development of bioinformatics curricula

Source: Ewan Birney, Chris Burge, Jim Fickett

Simulating the cell

On bioinformatics

“Science is about building causal relations between natural phenomena (for instance, between a mutation in a gene and a disease). The development of instruments to increase our capacity to observe natural phenomena has, therefore, played a crucial role in the development of science - the microscope being the paradigmatic example in biology. With the human genome, the natural world takes an unprecedented turn: it is better described as a sequence of symbols. Besides high-throughput machines such as sequencers and DNA chip readers, the computer and the associated software becomes the instrument to observe it, and the discipline of bioinformatics flourishes.

On bioinformatics

However, as the separation between us (the observers) and the phenomena observed increases (from organism to cell to genome, for instance), instruments may capture phenomena only indirectly, through the footprints they leave. Instruments therefore need to be calibrated: the distance between the reality and the observation (through the instrument) needs to be accounted for. This issue of Genome Biology is about calibrating instruments to observe gene sequences; more specifically, computer programs to identify human genes in the sequence of the human genome.”

Martin Reese and Roderic Guigó, Genome Biology 2006 7(Suppl I):S1,introducing EGASP, the Encyclopedia of DNA Elements (ENCODE) Genome Annotation Assessment Project

Tool-users

Tool-makers

bioinformatics

public healthinformatics

medicalinformatics

infrastructure

databases algorithms

Three perspectives on bioinformatics

The cell

The organism

The tree of life

Page 4

After Pace NR (1997) Science 276:734

Page 6

Time ofdevelopment

Body region, physiology, pharmacology, pathology

Page 5

DNA RNA phenotypeprotein

Page 5

DNA RNA phenotypeprotein

Growth of GenBank

Year

Bas

e p

airs

of

DN

A (

bil

lio

ns)

Seq

uen

ces

(mil

lio

ns)

Updated 8-12-04:>40b base pairs

1982 1986 1990 1994 1998 2002 Fig. 2.1Page 17

0

10

20

30

40

50

60

70

1985

Growth of GenBank

Bas

e p

airs

of

DN

A (

bil

lio

ns)

Seq

uen

ces

(mil

lio

ns)

200019951990

December1982

June2006

Growth of the International NucleotideSequence Database Collaboration

Bas

e p

airs

of

DN

A (

bil

lio

ns)

http://www.ncbi.nlm.nih.gov/Genbank/

Base pairs contributed by GenBank EMBL DDBJ

DNA RNA protein

Central dogma of molecular biology

genome transcriptome proteome

Central dogma of bioinformatics and genomics

What is the Information?Molecular Biology as an Information Science

• Central Dogmaof Molecular Biology DNA -> RNA -> Protein -> Phenotype -> DNA

• Molecules– Sequence, Structure, Function

• Processes– Mechanism, Specificity, Regulation

• Central Paradigmfor Bioinformatics

Genomic Sequence Information -> mRNA (level) -> Protein Sequence -> Protein Structure -> Protein Function -> Phenotype

• Large Amounts of Information– Standardized– Statistical

•Genetic material•Information transfer (mRNA)•Protein synthesis (tRNA/mRNA)•Some catalytic activity

•Most cellular functions are performed or facilitated by proteins.

•Primary biocatalyst

•Cofactor transport/storage

•Mechanical motion/support

•Immune protection

•Control of growth/differentiation

(idea from D Brutlag, Stanford, graphics from S Strobel)

ku user

Proteins fold into 3D structures with specific functions which are reflected in a pheonotype.These functions are selected in a Darwinian sense by the environment of the phenotype.Which drives the evolution of the DNA sequence.Many Bioinformatics techniques address this flow of molecular biology information inside the organism hoping to understand the organization and control of genes even predicting protein structure from sequence.There is a second flow of information that bioinformatics seeks to address is the large amount of data generated by new high through methods. Bioinformatics owes its lively hood to the availability of large data sets that are too complex to allow manual analysis.

• Proteins fold into 3D structures with specific functions which are reflected in a pheonotype.

• These functions are selected in a Darwinian sense by the environment of the phenotype.

• Which drives the evolution of the DNA sequence.

• Many Bioinformatics techniques address this flow of molecular biology information inside the organism hoping to understand the organization and control of genes even predicting protein structure from sequence.

• There is a second flow of information that bioinformatics seeks to address is the large amount of data generated by new high through methods. Bioinformatics owes its lively hood to the availability of large data sets that are too complex to allow manual analysis.

DNA RNA

cDNAESTsUniGene

phenotype

genomicDNAdatabases

protein sequence databases

protein

Fig. 2.2Page 20

GenBankEMBL DDBJ

There are three major public DNA databases

The underlying raw DNA sequences are identical

Page 16

GenBankEMBL DDBJ

Housedat EBI

EuropeanBioinformatics

Institute

There are three major public DNA databases

Housed at NCBINational

Center forBiotechnology

Information

Housed in Japan

Page 16

>100,000 species are represented in GenBank

all species 128,941

viruses 6,137

bacteria 31,262

archaea 2,100

eukaryota 87,147

Table 2-1Page 17

The most sequenced organisms in GenBank

Homo sapiens (6.9 million entries)Mus musculus (5.0 million)Zea mays (896,000)Rattus norvegicus (819,000)Gallus gallus (567,000)Arabidopsis thaliana (519,000)Danio rerio (492,000)Drosophila melanogaster (350,000)Oryza sativa (221,000)

National Center for BiotechnologyInformation (NCBI)

www.ncbi.nlm.nih.gov

Taxonomy nodes at NCBI

http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi8/06


Homo sapiens 10.7 billion basesMus musculus 6.5bRattus norvegicus 5.6bDanio rerio 1.7bZea mays 1.4bOryza sativa 0.8bDrosophila melanogaster 0.7bGallus gallus 0.5bArabidopsis thaliana 0.5b

Updated 8-12-04GenBank release 142.0

Table 2-2Page 18


Homo sapiens 11.2 billion basesMus musculus 7.5bRattus norvegicus 5.7bDanio rerio 2.1bBos taurus 1.9bZea mays 1.4bOryza sativa (japonica) 1.2bXenopus tropicalis 0.9bCanis familiaris 0.8bDrosophila melanogaster 0.7b


Table 2-2Page 18


Homo sapiens 12.3 billion basesMus musculus 8.0bRattus norvegicus 5.7bBos taurus 3.5bDanio rerio 2.5bZea mays 1.8bOryza sativa (japonica) 1.5bStrongylocentrotus purpurata 1.2bSus scrofa 1.0bXenopus tropicalis 1.0b


Table 2-2Page 18

Molecular Biology Information - DNA

• Raw DNA Sequence– Coding or Not?– Parse into genes?– 4 bases: AGCT– ~1 K in a gene, ~2 M in genome– ~3 Gb Human

atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgcagcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatacatggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtgaaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatccagcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattcttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaactggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgcaggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgtgttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactgcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgcatcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacctgcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgttgttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatcaaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacactgaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgcagacgctggtatcgcattaactgattctttcgttaaattggtatc . . .

. . . caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaacaacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg

Molecular Biology Information: Protein Sequence

• 20 letter alphabet– ACDEFGHIKLMNPQRSTVWY but not BJOUXZ

• Strings of ~300 aa in an average protein (in bacteria), ~200 aa in a domain

• >1M known protein sequences (uniprot)

d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTL--------NKPVIMGRHTWESI d3dfr__ TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV--------GKIMVVGRRTYESF d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSId8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSId4dfra_ ISLIAALAVDRVIGMENAMPW-NLPADLAWFKRNTLD--------KPVIMGRHTWESId3dfr__ TAFLWAQDRNGLIGKDGHLPW-HLPDDLHYFRAQTVG--------KIMVVGRRTYESF

d1dhfa_ VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHPd8dfr__ VPEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKPd4dfra_ ---G-RPLPGRKNIILS-SQPGTDDRV-TWVKSVDEAIAACGDVP------EIMVIGGGRVYEQFLPKAd3dfr__ ---PKRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLDQ----ELVIAGGAQIFTAFKDDV d1dhfa_ -PEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHPd8dfr__ -PEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKPd4dfra_ -G---RPLPGRKNIILSSSQPGTDDRV-TWVKSVDEAIAACGDVPE-----.IMVIGGGRVYEQFLPKAd3dfr__ -P--KRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLD----QELVIAGGAQIFTAFKDDV

Molecular Biology Information:Macromolecular Structure

• DNA/RNA/Protein– Almost all protein

(RNA Adapted From D Soll Web Page, Right Hand Top Protein from M Levitt web page)

Molecular Biology Information:

Whole Genomes

• The Revolution Driving EverythingFleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J. M., McKenney, K., Sutton, G., Fitzhugh, W., Fields, C., Gocayne, J. D., Scott, J., Shirley, R., Liu, L. I., Glodek, A., Kelley, J. M., Weidman, J. F., Phillips, C. A., Spriggs, T., Hedblom, E., Cotton, M. D., Utterback, T. R., Hanna, M. C., Nguyen, D. T., Saudek, D. M., Brandon, R. C., Fine, L. D., Fritchman, J. L., Fuhrmann, J. L., Geoghagen, N. S. M., Gnehm, C. L., McDonald, L. A., Small, K. V., Fraser, C. M., Smith, H. O. & Venter, J. C. (1995). "Whole-genome random sequencing and assembly of Haemophilus influenzae rd." Science 269: 496-512.

(Picture adapted from TIGR website, http://www.tigr.org)• Integrative Data

1995, HI (bacteria): 1.6 Mb & 1600 genes done

1997, yeast: 13 Mb & ~6000 genes for yeast

1998, worm: ~100Mb with 19 K genes

1999: >30 completed genomes!

2003, human: 3 Gb & 100 K genes...

Genome sequence now accumulate so quickly that, in less than a week, a single laboratory can produce more bits of data than Shakespeare managed in a lifetime, although the latter make better reading.

-- G A Pekso, Nature 401: 115-116 (1999)

Genomes highlight the Finiteness

of the “Parts” in Biology

Bacteria, 1.6 Mb,

~1600 genes [Science 269: 496]

Eukaryote, 13 Mb,

~6K genes [Nature 387: 1]

1995

1997

1998

Animal, ~100 Mb,

~20K genes [Science 282:

1945]

Human, ~3 Gb, ~100K

genes [???]

2000?

real thing, Apr ‘00

‘98 spoof Core

Other Types of Data• Gene Expression

– Early experiments yeast

– Now tiling array technology• 50 M data points to tile the human genome at ~50 bp res.

– Can only sequence genome once but can do an infinite variety of array experiments

• Phenotype Experiments

• Protein Interactions– For yeast: 6000 x 6000 / 2 ~ 18M possible interactions

– maybe 30K real

Weber Cartoon

Bioinformatics is born!

(courtesy of Finn Drablos)

Major Application I:Designing Drugs

• Understanding How Structures Bind Other Molecules (Function)

• Designing Inhibitors• Docking, Structure Modeling

(From left to right, figures adapted from Olsen Group Docking Page at Scripps, Dyson NMR Group Web page at Scripps, and from Computational Chemistry Page at Cornell Theory Center).

Core

Major Application II: Finding Homologs

Core

Major Application II:Finding Homologues

• Find Similar Ones in Different Organisms• Human vs. Mouse vs. Yeast

– Easier to do Expts. on latter!

(Section from NCBI Disease Genes Database Reproduced Below.)

Best Sequence Similarity Matches to Date Between Positionally ClonedHuman Genes and S. cerevisiae Proteins

Human Disease MIM # Human GenBank BLASTX Yeast GenBank Yeast Gene Gene Acc# for P-value Gene Acc# for Description Human cDNA Yeast cDNA

Hereditary Non-polyposis Colon Cancer 120436 MSH2 U03911 9.2e-261 MSH2 M84170 DNA repair proteinHereditary Non-polyposis Colon Cancer 120436 MLH1 U07418 6.3e-196 MLH1 U07187 DNA repair proteinCystic Fibrosis 219700 CFTR M28668 1.3e-167 YCF1 L35237 Metal resistance proteinWilson Disease 277900 WND U11700 5.9e-161 CCC2 L36317 Probable copper transporterGlycerol Kinase Deficiency 307030 GK L13943 1.8e-129 GUT1 X69049 Glycerol kinaseBloom Syndrome 210900 BLM U39817 2.6e-119 SGS1 U22341 HelicaseAdrenoleukodystrophy, X-linked 300100 ALD Z21876 3.4e-107 PXA1 U17065 Peroxisomal ABC transporterAtaxia Telangiectasia 208900 ATM U26455 2.8e-90 TEL1 U31331 PI3 kinaseAmyotrophic Lateral Sclerosis 105400 SOD1 K00065 2.0e-58 SOD1 J03279 Superoxide dismutaseMyotonic Dystrophy 160900 DM L19268 5.4e-53 YPK1 M21307 Serine/threonine protein kinaseLowe Syndrome 309000 OCRL M88162 1.2e-47 YIL002C Z47047 Putative IPP-5-phosphataseNeurofibromatosis, Type 1 162200 NF1 M89914 2.0e-46 IRA2 M33779 Inhibitory regulator protein

Choroideremia 303100 CHM X78121 2.1e-42 GDI1 S69371 GDP dissociation inhibitorDiastrophic Dysplasia 222600 DTD U14528 7.2e-38 SUL1 X82013 Sulfate permeaseLissencephaly 247200 LIS1 L13385 1.7e-34 MET30 L26505 Methionine metabolismThomsen Disease 160800 CLC1 Z25884 7.9e-31 GEF1 Z23117 Voltage-gated chloride channelWilms Tumor 194070 WT1 X51630 1.1e-20 FZF1 X67787 Sulphite resistance proteinAchondroplasia 100800 FGFR3 M58051 2.0e-18 IPL1 U07163 Serine/threoinine protein kinaseMenkes Syndrome 309400 MNK X69208 2.1e-17 CCC2 L36317 Probable copper transporter

Major Application I|I:Overall Genome Characterization• Overall Occurrence of a

Certain Feature in the Genome

– e.g. how many kinases in Yeast

• Compare Organisms and Tissues

– Expression levels in Cancerous vs Normal Tissues

• Databases, Statistics

(Clock figures, yeast v. Synechocystis, adapted from GeneQuiz Web Page, Sander Group, EBI)

Core

What do you get from large-scale data mining? Global

statistics on the population of proteins

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

MP MG EC SC HP SS HI MT MJ AF AA OT

Mesophile Thermophile

LO

D v

alu

e

EK(3)

EK(4)

10 to 45

Physiological temperature in C

98 65 85 83 95

EX-2: Occurrence of 1-4 salt bridges in genomes

of thermophiles v mesophiles

EX-1: Occurrence of functions per fold & interactions per fold

over all genomes

http://www.nature.com/nature/journal/vaop/ncurrent/pdf/nature07331.pdf

End of First Lecture 2009

Remaining Slides Lecture 2

Organizing Molecular Biology

Information:Redundancy and

Multiplicity

• Different Sequences Have the Same Structure

• Organism has many similar genes• Single Gene May Have Multiple

Functions• Genes are grouped into Pathways• Genomic Sequence Redundancy due to

the Genetic Code• How do we find the similarities?....

Integrative Genomics - genes structures functions pathways expression levels regulatory systems ….

Core

'Omics: studying populations of molecules in a

database frameworkOme

molecular group

Genome

Proteome

Transcriptome

Phenome

Interactome

Metabolome

Physiome

Orfeome

Secretome

Morphome

Glycome

Regulome

Functome

Cellome

Transportome

Ribonome

Operome


database frameworkOme Google

molecular group Hits

Genome5820000

0

Proteome 1850000

Transcriptome 707000

Phenome 418000

Interactome 87500

Metabolome 80700

Physiome 56300

Orfeome 29800

Secretome 23900

Morphome 11400

Glycome 995

Regulome 618

Functome 390

Cellome 246

Transportome 155

Ribonome 131

Operome 57


database frameworkOme Google PubMed PubMed

molecular group Hits Hits First year

Genome5820000

0 537993 1953

Proteome 1850000 6005 1995

Transcriptome 707000 1665 1997

Phenome 418000 53 1989

Interactome 87500 87 1999

Metabolome 80700 182 1998

Physiome 56300 41 1997

Orfeome 29800 25 2002

Secretome 23900 48 2000

Morphome 11400 2 2000

Glycome 995 34 2000

Regulome 618 6 2004

Functome 390 1 2001

Cellome 246 17 2002

Transportome 155 1 2004

Ribonome 131 1 2002

Operome 57 0

Pub

Med

Hits Proteome

Core

A Parts List Approach to Bike Maintenance

Extra

A Parts List Approach to Bike Maintenance

What are the shared parts (bolt, nut, washer, spring, bearing), unique parts (cogs, levers)? What are the common parts -- types of parts (nuts & washers)?

How many roles can these play? How flexible and adaptable are they mechanically?

Where are the parts

located?

Extra

Core

Molecular Parts = Conserved Domains, Folds, &c

Vast Growth in (Structural) Data...

but number of Fundamentally New

(Fold) Parts Not Increasing that Fast

New Submissions

New Folds

Total in Databank

~1000 folds

~100000 genes

~1000 genes1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 …

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 …

(human)

(T. pallidum)

World of Structures is even more Finite,

providing a valuable simplification

Same logic for pathways, functions, sequence families, blocks, motifs....

Global Surveys of a Finite Set of Parts from Many Perspectives

Functions picture from www.fruitfly.org/~suzi (Ashburner); Pathways picture from, ecocyc.pangeasystems.com/ecocyc (Karp, Riley). Related resources: COGS, ProDom, Pfam, Blocks, Domo, WIT, CATH, Scop....

What is Bioinformatics?

• (Molecular) Bio - informatics• One idea for a definition?

Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale.

• Structural Bioinformatics is a practical discipline with many applications that deals with biological three dimensional structural data.

General Types of “Informatics” techniques

in Bioinformatics• Databases

Building, Querying Object DB

• Text String Comparison Text Search 1D Alignment Significance Statistics Google, grep

• Finding Patterns AI / Machine Learning Clustering Datamining

• Geometry Robotics Graphics (Surfaces, Volumes) Comparison and 3D Matching

(Vision, recognition)

• Physical Simulation Newtonian Mechanics Electrostatics Numerical Algorithms Simulation

Bioinformatics Topics -- Genome Sequence

• Finding Genes in Genomic DNA introns exons promotors

• Characterizing Repeats in Genomic DNA Statistics Patterns

• Duplications in the Genome Large scale genomic alignment

• Whole-Genome Comparisons

• Finding Structural RNAs

Bioinformatics Topics --

Protein Sequence

• Sequence Alignment non-exact string matching,

gaps How to align two strings

optimally via Dynamic Programming

Local vs Global Alignment Suboptimal Alignment Hashing to increase speed

(BLAST, FASTA) Amino acid substitution

scoring matrices

• Multiple Alignment and Consensus Patterns How to align more than one

sequence and then fuse the result in a consensus representation

Transitive Comparisons HMMs, Profiles Motifs

• Scoring schemes and Matching statistics How to tell if a given

alignment or match is statistically significant

A P-value (or an e-value)? Score Distributions

(extreme val. dist.) Low Complexity Sequences

• Evolutionary Issues Rates of mutation and

change

Bioinformatics Topics --

Sequence / Structure

• Secondary Structure “Prediction” via Propensities Neural Networks, Genetic

Alg. Simple Statistics TM-helix finding Assessing Secondary

Structure Prediction

• Structure Prediction: Protein v RNA

• Tertiary Structure Prediction Fold Recognition Threading Ab initio

• Function Prediction Active site identification

• Relation of Sequence Similarity to Structural Similarity

Problems in Protein Bioinformatics

• Prediction of structure from sequence Fold recognition Fragment construction

• Proteome annotation

• Protein-protein docking

Protein folding code

Proteinfoldingcode

Proteinstructure

Protein sequence

Prediction of correct fold

Query sequence

Foldrecognition

Eisenberg et al.Jones, Taylor, Thornton

Matchedfold

Match sequence against library of known folds

Computational Requirements

• 1 sequence search takes 12 mins (3Ghz)

• Benchmarking on 100 proteins with 100 runs for a simplex search of parameter space = 80 days

• 30 approaches explored = 7 years (on 1 cpu)

introduction to bioinformatics

Documents

analysis of genomes

analysis of genes

computers analysis of

analysis of dna

web sitesthe course

bioinformatics texts

tools of bioinformatics

genome analysis david