3/17/2003 bioinformatics: merging biological and computational skills eric rouchka, d.sc. university...

47
3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville Overview of Genome Sequencing Progress Eric C. Rouchka, D.Sc. Bioinformatics Journal Club October 1, 2003

Post on 19-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Overview of Genome Sequencing Progress

Eric C. Rouchka, D.Sc.

Bioinformatics Journal Club

October 1, 2003

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

DNA Sequences

• DNA: double stranded helix

• Composed of 4 bases: A,C,G,T

• Genome: linear chain of bases– Humans: 22 Autosome pairs, 2

sex chromosomes, 3.2 billion bases

(Image source: http://www.ebi.ac.uk/)

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Double Helix

• Two complementary DNA strands form a stable DNA double helix

• A, T are complements; G, C are complements

Image source; www.ebi.ac.uk/microarray/ biology_intro.htm

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Brief History of Sequencing

• Discovery of Complementary Bases– Erwin Chargaff, 1950

• Discovery of DNA Double Helix– 1953 – only 50 years ago

– James Watson– Francis Crick– Rosland Franklin

Image: www.simr.org.uk/pages/biotechnology/ biotechnology_2.html

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

DNA

RNA

PROTEIN

Central Dogma

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

RNA• Ribonucleic Acid

• Similar to DNA

• Thymine (T) is replaced by uracil (U)

• RNA can be:– Single stranded– Double stranded– Hybridized with DNA

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

RNA

• RNA is generally single stranded

• Forms secondary or tertiary structures

• Important in a variety of ways, including protein synthesis

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

How Central Dogma Works

• DNA “transcribed” into SS mRNA

• mRNA “translated” into protein using tRNA– Triplet bases (codons) used to code amino

acids– 3 mRNA bases code one amino acid

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

History Of Genetic Code

• Genetic Code Completely uncovered (1965)– Marshall Nierenberg

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Genetic Code

• 4 possible bases (A, C, G, U)• 4 * 4 * 4 = 64 possible codon sequences• Start codon: AUG• Stop codons: UAA, UAG, UGA• 61 codons to code for amino acids (AUG as

well)• 20 amino acids – redundancy in genetic code

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

20 Amino Acids• Glycine (G, GLY)• Alanine (A, ALA)• Valine (V, VAL)• Leucine (L, LEU)• Isoleucine (I, ILE)• Phenylalanine (F, PHE)• Proline (P, PRO)• Serine (S, SER)• Threonine (T, THR)• Cysteine (C, CYS)• Methionine (M, MET)• Tryptophan (W, TRP)• Tyrosine (T, TYR)• Asparagine (N, ASN)• Glutamine (Q, GLN)• Aspartic acid (D, ASP)• Glutamic Acid (E, GLU)• Lysine (K, LYS)• Arginine (R, ARG)• Histidine (H, HIS)• START: AUG• STOP: UAA, UAG, UGA

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

tRNA structure

http://www.tulane.edu/~biochem/nolan/lectures/rna/frames/trnabtx2.htm

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Protein Structure

Image source; www.ebi.ac.uk/microarray/ biology_intro.htm

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Brief History of Sequencing

• First Protein Sequence– ~1955 Bovine Insulin (Fred Sanger)

• First DNA Sequence– ~1965 yeast alanine tRNA (77 bases)

• Development of DNA sequencing – Maxam-Gilbert and Sanger Methods (1977)

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Sanger Sequencing Method

• (Quicktime Movie)

• SOURCE: Molecular Cell Biology

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Improving Sanger’s Method

• Dideoxynucleosides fluorescently labeled (1986)– Reaction cut by ¼

• Sequencing Automated by machine (1986)

• Laser detects fluorescence

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Image Source: plantbio.berkeley.edu/ ~bruns/tour3.html

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Genetic Mapping

• Sex-linked genes studied since early 1900s

• Gene mapping takes off in late 1970s– David Botstein (RFLPs 1978)

• 1979: 579 Genes Mapped• 2003 ~30,000 Genes Mapped

– Mapping of Huntington’s Disease (First Diseased Gene)• Triplet Repeat• 1983• Nancy Wexler

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Mapping of Markers

• Sequence Tagged Sites (STS)– Sequences occurring only once in the

human genome

– Help to map locations

– 52,000 STS in Humans• ~ 1 every 62,000 bases

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Cloning Techniques

• Plasmid Cloning Introduced (1973)

– Region of Interest duplicated by inclusion

• YAC Chromosomes described (1987)

• BACs introduced (1992)

• 30,000 to 100,000 bases can be cloned

pACYC1773941 bp

KN(R)

AP r

P15A ORI

TN3 INV RPT

TN903 INV RPT

TN903 INV RPT

TN3 REPR FRAG

Apa LI (3815)

Bam HI (3321)

Cla I (2046)

Hin dIII (2473)

Pst I (304)

Sma I (2229)

Xma I (2227) Ava I (1953)

Ava I (2227)

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Hierarchical (Clone-based) Approach

• Know location of 30,000 – 100,000 bp region• Break into 500-700 bp fragments• Sequence Fragments• Assemble based on similarity• ~8-10x coverage

• Current Price: $0.09 / base

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Hierarchical (clone-based) approach

• generate overlapping set of clones• select a minimum tiling path• shotgun sequence each clone

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Hierarchical (clone-based) approach

• MINUS– map generation requires resources, time

and money– Some regions not cloned

• PLUS– easier to assemble smaller pieces– less chance for assembly error

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Shotgun Sequencing Approach

• Developed 1991 TIGR– Craig Venter, Hamilton Smith

• Break genome into millions of pieces– Sequence each piece– Reassemble into full genomes

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Whole Genome Shotgun Approach

• reads generated directly from a whole-genome library

• assemble the genome all at once

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Whole Genome Shotgun Approach

• MINUS– more prone to assembly error– computationally intensive– cannot effectively handle repeats

• PLUS– Less overhead time up front

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Base calling and Assembly Software

• PHRED and PHRAP Developed (1988)– PHRED: Base calling software– PHRAP: Assists in assembly of sequenced

data

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Available Assemblers

• SEQAID (Peltola et al., 1984)• CAP (Huang, 1992)• PHRAP (Green, 1994)• TIGR Assembler (Sutton et al., 1995)• AMASS (Kim et al., 1999)• CAP3 (Huang and Madan, 1999)• Celera Assembler (Myers et al., 2000)• EULER (Pevzner et al., 2001)• ARACHNE (Batzoglou et al., 2002)

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

History of Genome Projects

• First Genome Sequence

– FX174 Phage 5,386 bp; 9 proteins (1980)

• Haemophilus Influenzea Sequenced

– First non-viral genome (1.8 MB) (1995)

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

History of Genome Projects

• Saccharomyces cereviseae sequenced

– First eukaryotic genome (12.1 MB) (1996)

• Caenorhabditis elegans sequence released

– First animal genome 200 MB (1998)

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

History of Genome Projects

• Arabidopsis thaliana sequence released

– First publicly available plant genome (1999)

• Rough Draft of Human Genome Reported (2001)– “Finished” 2003

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Human Genome Project

• Began in 1990 (US DOE – 15 years)– Identify all genes in human DNA– Determine sequence of human genome– Develop faster sequencing technologies– Develop tools for data analysis– ELSI

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Microbial Genomes

• 122 Complete Genomes in CMR– http://www.tigr.org/tigr-scripts/CMR2/CMR_

Content.spl

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Genomes

– Fruit Fly– Mouse– Rat– Rice– Zebra fish– Puffer fish– Chicken– Dog– Frog

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Growth of GenBank

• 1982: 600,000 Bases

• 2002: 28.5 Billion Bases

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Other Notables

• Dayhoff ATLAS Database of Proteins (1960s)

• Sequence Comparison Algorithms– 1970, Needleman-Wunch (global alignment)

• Protein Databank– Brookhaven PDB (1973)

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Other Notables

• NMR for protein structure identification (1980)

• IntelliGenetics Founded

– DNA and Protein sequence analysis (1980)

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Other Notables

• Smith-Waterman algorithm– Local sequence alignment (1981)

• GenBank Database created (1982)

• Genetics Computer Group Founded– GCG suite (1982)

• PCR First Described (1985)

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Other Notables

• FASTP Algorithm

– Protein database searching (1985)

• SWISS-PROT

– Protein Database (1986)

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Other Notables

• PERL Programming Language– Allows for sequence manipulation (1987)

• NCBI Established (1988)

• Human Genome Initiative (1988)

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Other Notables

• FASTA Program released (1988)– DNA and Protein sequence database searches

• BLAST Program released (1990)– Allows for quick database searches

• Informax Founded (1990)

• Human Genome Project Begins (1990)

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Other Notables

• Creation and Use of ESTs Described (1991)

• Incyte Pharmaceuticals Founded (1991)

• TIGR Established (1992)

– Shotgun sequencing methods

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Other Notables

• Affymetrix founded (1993)

• PRINTS protein motif database (1994)

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Other Notables

First Commercial Microarray chips produced (1996)

• Dolly Cloned (1997)

• Capillary Sequencing machines introduced (1997)

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Other Notables

• Celera Genomics Formed (1998)

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

More Detailed Histories

http://www.netsci.org/Science/Bioinform/feature06.html

http://www.dhgp.de/intro/history/history.html