the big goal “the greatest challenge, however, is analytical. … deeper biological insight is...

51
The BIG Goal “The greatest challenge, however, is analytical. … Deeper biological insight is likely to emerge from examining datasets with scores of samples.” Eric Lander, “array of hope” Nat. Gen. volume 21 supplement pp 3 - 4, 1999. Bio-informatics: Provide methodologies for elucidating biological knowledge from biological data.

Post on 20-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

The BIG Goal“The greatest challenge, however, is

analytical. … Deeper biological insight is likely to emerge from examining datasets with scores of samples.”

Eric Lander, “array of hope” Nat. Gen.

volume 21 supplement pp 3 - 4, 1999.

Bio-informatics:Provide methodologies for elucidating biological knowledge from biological data.

GeneticInformation

Central Paradigm of Bio-informatics

Molecular Structure

GeneticInformation

Central Paradigm of Bio-informatics

Molecular Structure

GeneticInformation

BiochemicalFunction

Central Paradigm of BioInformatics

Molecular Structure

GeneticInformation

BiochemicalFunction Symptoms

Central Paradigm of Bio-informatics

Molecular Structure

GeneticInformation

BiochemicalFunction Symptoms

Central Paradigm of Bio-informatics

http://www.sanger.ac.uk/PostGenomics/S_pombe/presentations/EMBOCopenhagenWebsite.pdf

Computer Science Tools are Crucial

Computer Science Tools are Crucial• New bio-technologies create huge amounts of data.

• It is impossible to analyze data by manual inspection.

• Novel mathematical, statistical, algorithmic and computational tools are necessary !

http://cbms.st-and.ac.uk/academics/ryan/Teaching/SB&Bioinf/lecture1.htm

Automated Sequencing

What is Bio-Informatics ?• A field of science in which Biology, Computer Science and Information Technology merge into a single discipline.

• Computers (& software tools) are used to collect, analyze and interpret biological information at the molecular level.

• Goal: To enable the discovery of new biological insights and create a global perspective for biologists.

• Development of new algorithms and statistical methods to assess relationships among members of large data sets.

• Analysis and interpretation of various types of data.

• Development and implementation of tools to efficiently access and manage different types of information.

Disciplines

Why Use Bio-Informatics ?

An explosive growth in the amount of biological information necessitates the use of computers for cataloging and retrieval of data (> 3 billion bps, > 30,000 genes).

• The human genome project.• Automated sequencing.• GenBank has over 16 Billion bases and is doubling every year !!!

New Types of Biological Data• Micro arrays - gene expression.

• Multi-level maps: genetic, physical: sequence, annotation.

• Networks of protein-protein interactions.

• Cross-species relationships:• Homologous genes.• Chromosome organization.http://www.the-scientist.com/yr2002/apr/

research020415.html

• A more global view of experimental design. (from “one scientist = one gene/protein/disease” paradigm to whole organism consideration).

• Data mining - functional/structural information is important for studying the molecular basis of diseases, diagnostics, developing drugs (personal medicine), evolutionary patterns, etc.

Why Bio Informatics ? (cont.)

http://www.library.csi.cuny.edu/~davis/Bioinfo_326/lectures/lect14/lect_14.html

Why Bio Informatics ? (cont.)

http://www.usgenomics.com/technology/index.shtml

Principle milestones in data mining and genome analysis:• Sanger method for sequencing, invented in 1977 (winner of the Nobel Prize in 1980), • Polymerase chain reaction (PCR), invented in 1989 (awarded the Nobel Prize in 1993).

Future of Genomic Research

The next step:

Locate all the genesand understand their function.

This will probably take another 15-20 years !

Disease Genes Discovered

One can efficiently find information:Using databases and software on the web .

Question: How likely are you to use a free bio-informatics library of accessible software ?

http://www.cryst.bbk.ac.uk/classlib/BBSRC_poster/potential.html

The job of biologists is changing…

Molecular Biology Analysis Software Tools -

Freely Available on the Web.

- Highlights

Broad Classification of Biological Databases

http://www.mrc-lmb.cam.ac.uk/genomes/madanm/pres/biodb.htm

ENTREZ - PubMed

NCBI

http://www3.ncbi.nlm.nih.gov/Entrez/index.html

Genome

Proteome

Transcriptome

Gene function

Metabolome

Glycome

89,300 1,701

Google search PubMed

2.1x106 76,566

9,960 229

1.2x106 6.5x105

1,170 29

Post-genomic terms (Oct. 2002)

138 6

Pub

Med

Hits Proteome

From: Computational Proteomics, Mark B Gerstein, Yale U.

http://cbms.st-and.ac.uk/academics/ryan/Teaching/SB&Bioinf/lecture1.htm

http://cbms.st-and.ac.uk/academics/ryan/Teaching/SB&Bioinf/lecture1.htm

http://cbms.st-and.ac.uk/academics/ryan/Teaching/SB&Bioinf/lecture1.htm

http://cbms.st-and.ac.uk/academics/ryan/Teaching/SB&Bioinf/lecture1.htm

Similarity / Analogy

Examples:

If looks like an elephant, and smells like an elephant– it’s an elephant.

If walks like a duck, and quacks like a duck– it’s a duck.

http://cbms.st-and.ac.uk/academics/ryan/Teaching/molbiol/Bioinf_files/v3_document.htm

Similarity Search in Databanks

Find similar sequencesto a working draft.

As databanks grow, homologies get harder,and quality is reduced.

Alignment Tools: BLAST & FASTA (time saving heuristics-approximations).

>gb|BE588357.1|BE588357 194087 BARC 5BOV Bos taurus cDNA 5'.

Length = 369

Score = 272 bits (137), Expect = 4e-71

Identities = 258/297 (86%), Gaps = 1/297 (0%)

Strand = Plus / Plus

Query: 17 aggatccaacgtcgctccagctgctcttgacgactccacagataccccgaagccatggca 76

|||||||||||||||| | ||| | ||| || ||| | |||| ||||| |||||||||

Sbjct: 1 aggatccaacgtcgctgcggctacccttaaccact-cgcagaccccccgcagccatggcc 59

Query: 77 agcaagggcttgcaggacctgaagcaacaggtggaggggaccgcccaggaagccgtgtca 136

|||||||||||||||||||||||| | || ||||||||| | ||||||||||| ||| ||

Sbjct: 60 agcaagggcttgcaggacctgaagaagcaagtggagggggcggcccaggaagcggtgaca 119

Query: 137 gcggccggagcggcagctcagcaagtggtggaccaggccacagaggcggggcagaaagcc 196

|||||||| | || | ||||||||||||||| ||||||||||| || ||||||||||||

Sbjct: 120 tcggccggaacagcggttcagcaagtggtggatcaggccacagaagcagggcagaaagcc 179

Query: 197 atggaccagctggccaagaccacccaggaaaccatcgacaagactgctaaccaggcctct 256

||||||||| | |||||||| |||||||||||||||||| ||||||||||||||||||||

Sbjct: 180 atggaccaggttgccaagactacccaggaaaccatcgaccagactgctaaccaggcctct 239

Query: 257 gacaccttctctgggattgggaaaaaattcggcctcctgaaatgacagcagggagac 313

|| || ||||| || ||||||||||| | |||||||||||||||||| ||||||||

Sbjct: 240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgacagaagggagac 296

Pairwise alignment:

Multiple Sequence Alignment

Multiple alignment: find protein families and functional domains.

Structure - Function Relationships

structure

function

sequence

Protein Structure (domains)

PhylogenyEvolution - a process in which small changes occur within species over time.

These changes could be monitored today using molecular techniques.

The Tree of Life: A classical, basic science problem, since Darwin’s 1859 “Origin of Species”.

Origin of the universe ?

Formation of the solar system

First self replicating systems

Prokaryotes/eukaryotes

Plant/animals

Invertebrates/vertebrates

Mammalianradiation

Tree of Life:Searching Protein Sequence Databases -How far can we see back ?

• Write down all of human DNA on a single CD

(“completed” 2001).

• Identify all genes, their location and function (far from completion).

The Human Genome Project (HGP)

Example for Gene Localization Bio-Tool (FISH).

• Fluorescent labeled probes hybridize to specific chromosomal locations.• Example application: low resolution localization of a gene.

FISH - Fluorescence In-Situ Hybridization.

Sequencing Genes & Gene Assembly

Automated sequencing

Gene Finding• Only 2-3% of the human genome encodes for functional genes.

• Genes are found along large non-coding DNA regions.

• Repeats, pseudo-genes, introns, contamination of vectors, are very confusing.

Gene Finding - cont.

Find special gene patterns:

• Translation start and stop sites (open reading frames - ORF).• Transcription factors, promoters.• Intron splice sites.Etc…

Micro Arrays (“DNA Chips”)New biotechnology breakthrough: measure RNA expression levels of thousands of genes (in one experiment).

The Idea Behind Micro Arrays

Clustering Analysis of Gene Expression Data

DNA chips and personalized medicine (leading edge, future technologies).

Pharmaco-genomicsUse DNA information to measure and predict the reaction to drugs.

Personalized medicine.

Faster clinical trials: selected populations.

Less drug side-effects.

Protein and Other ArraysSequencing the human genome => finite problem.Studying the proteome => endless possible variations, dynamic.

Protein arrayFuture fields of study:

Proteins + Genomics = Proteomics

Lipids + Genomics = Lipomics

Sugars + Genomics = Glycomics

Understanding Mechanisms of Disease

EC number compound

SEQUENCEALIGNMENT

ORTHOLOG GENES (Taxonomy)

CONSERVED DOMAINS

CODINGREGIONS

3-DSTRUCTURE

GENEFAMILIES

MUTATIONS &POLYMORPHISMGENOME

MAPSCELLULARLOCATION

SIGNALPEPTIDE

Putting it all together: Bio-Informatics

SEQUENCES & LITERATURE

GENE EXPRESSION,GENES FUNCTION,DRUG & PERSONAL THERAPY

CODINGREGIONS

SEQUENCEALIGNMENT

ORTHOLOG GENES (Taxonomy)

CONSERVED DOMAINS

GENEFAMILIES

MUTATIONS &POLYMORPHISMGENOME

MAPSCELLULARLOCATION

SIGNAL PEPTIDE

3-DSTRUCTURE

Putting it all together: Bio-Informatics