overview of genome sequencing progress

47
3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville Overview of Genome Sequencing Progress Eric C. Rouchka, D.Sc. Bioinformatics Journal Club October 1, 2003

Upload: aislin

Post on 19-Jan-2016

30 views

Category:

Documents


1 download

DESCRIPTION

Overview of Genome Sequencing Progress. Eric C. Rouchka, D.Sc. Bioinformatics Journal Club October 1, 2003. DNA Sequences. DNA: double stranded helix Composed of 4 bases: A,C,G,T Genome: linear chain of bases Humans: 22 Autosome pairs, 2 sex chromosomes, 3.2 billion bases. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Overview of Genome Sequencing Progress

Eric C. Rouchka, D.Sc.

Bioinformatics Journal Club

October 1, 2003

Page 2: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

DNA Sequences

• DNA: double stranded helix

• Composed of 4 bases: A,C,G,T

• Genome: linear chain of bases– Humans: 22 Autosome pairs, 2

sex chromosomes, 3.2 billion bases

(Image source: http://www.ebi.ac.uk/)

Page 3: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Double Helix

• Two complementary DNA strands form a stable DNA double helix

• A, T are complements; G, C are complements

Image source; www.ebi.ac.uk/microarray/ biology_intro.htm

Page 4: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Brief History of Sequencing

• Discovery of Complementary Bases– Erwin Chargaff, 1950

• Discovery of DNA Double Helix– 1953 – only 50 years ago

– James Watson– Francis Crick– Rosland Franklin

Image: www.simr.org.uk/pages/biotechnology/ biotechnology_2.html

Page 5: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

DNA

RNA

PROTEIN

Central Dogma

Page 6: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

RNA• Ribonucleic Acid

• Similar to DNA

• Thymine (T) is replaced by uracil (U)

• RNA can be:– Single stranded– Double stranded– Hybridized with DNA

Page 7: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

RNA

• RNA is generally single stranded

• Forms secondary or tertiary structures

• Important in a variety of ways, including protein synthesis

Page 8: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

How Central Dogma Works

• DNA “transcribed” into SS mRNA

• mRNA “translated” into protein using tRNA– Triplet bases (codons) used to code amino

acids– 3 mRNA bases code one amino acid

Page 9: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

History Of Genetic Code

• Genetic Code Completely uncovered (1965)– Marshall Nierenberg

Page 10: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Genetic Code

• 4 possible bases (A, C, G, U)• 4 * 4 * 4 = 64 possible codon sequences• Start codon: AUG• Stop codons: UAA, UAG, UGA• 61 codons to code for amino acids (AUG as

well)• 20 amino acids – redundancy in genetic code

Page 11: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

20 Amino Acids• Glycine (G, GLY)• Alanine (A, ALA)• Valine (V, VAL)• Leucine (L, LEU)• Isoleucine (I, ILE)• Phenylalanine (F, PHE)• Proline (P, PRO)• Serine (S, SER)• Threonine (T, THR)• Cysteine (C, CYS)• Methionine (M, MET)• Tryptophan (W, TRP)• Tyrosine (T, TYR)• Asparagine (N, ASN)• Glutamine (Q, GLN)• Aspartic acid (D, ASP)• Glutamic Acid (E, GLU)• Lysine (K, LYS)• Arginine (R, ARG)• Histidine (H, HIS)• START: AUG• STOP: UAA, UAG, UGA

Page 12: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

tRNA structure

http://www.tulane.edu/~biochem/nolan/lectures/rna/frames/trnabtx2.htm

Page 13: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Protein Structure

Image source; www.ebi.ac.uk/microarray/ biology_intro.htm

Page 14: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Brief History of Sequencing

• First Protein Sequence– ~1955 Bovine Insulin (Fred Sanger)

• First DNA Sequence– ~1965 yeast alanine tRNA (77 bases)

• Development of DNA sequencing – Maxam-Gilbert and Sanger Methods (1977)

Page 15: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Sanger Sequencing Method

• (Quicktime Movie)

• SOURCE: Molecular Cell Biology

Page 16: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Improving Sanger’s Method

• Dideoxynucleosides fluorescently labeled (1986)– Reaction cut by ¼

• Sequencing Automated by machine (1986)

• Laser detects fluorescence

Page 17: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Image Source: plantbio.berkeley.edu/ ~bruns/tour3.html

Page 18: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Page 19: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Genetic Mapping

• Sex-linked genes studied since early 1900s

• Gene mapping takes off in late 1970s– David Botstein (RFLPs 1978)

• 1979: 579 Genes Mapped• 2003 ~30,000 Genes Mapped

– Mapping of Huntington’s Disease (First Diseased Gene)• Triplet Repeat• 1983• Nancy Wexler

Page 20: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Mapping of Markers

• Sequence Tagged Sites (STS)– Sequences occurring only once in the

human genome

– Help to map locations

– 52,000 STS in Humans• ~ 1 every 62,000 bases

Page 21: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Cloning Techniques

• Plasmid Cloning Introduced (1973)

– Region of Interest duplicated by inclusion

• YAC Chromosomes described (1987)

• BACs introduced (1992)

• 30,000 to 100,000 bases can be cloned

pACYC1773941 bp

KN(R)

AP r

P15A ORI

TN3 INV RPT

TN903 INV RPT

TN903 INV RPT

TN3 REPR FRAG

Apa LI (3815)

Bam HI (3321)

Cla I (2046)

Hin dIII (2473)

Pst I (304)

Sma I (2229)

Xma I (2227) Ava I (1953)

Ava I (2227)

Page 22: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Hierarchical (Clone-based) Approach

• Know location of 30,000 – 100,000 bp region• Break into 500-700 bp fragments• Sequence Fragments• Assemble based on similarity• ~8-10x coverage

• Current Price: $0.09 / base

Page 23: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Hierarchical (clone-based) approach

• generate overlapping set of clones• select a minimum tiling path• shotgun sequence each clone

Page 24: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Hierarchical (clone-based) approach

• MINUS– map generation requires resources, time

and money– Some regions not cloned

• PLUS– easier to assemble smaller pieces– less chance for assembly error

Page 25: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Shotgun Sequencing Approach

• Developed 1991 TIGR– Craig Venter, Hamilton Smith

• Break genome into millions of pieces– Sequence each piece– Reassemble into full genomes

Page 26: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Whole Genome Shotgun Approach

• reads generated directly from a whole-genome library

• assemble the genome all at once

Page 27: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Whole Genome Shotgun Approach

• MINUS– more prone to assembly error– computationally intensive– cannot effectively handle repeats

• PLUS– Less overhead time up front

Page 28: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Base calling and Assembly Software

• PHRED and PHRAP Developed (1988)– PHRED: Base calling software– PHRAP: Assists in assembly of sequenced

data

Page 29: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Available Assemblers

• SEQAID (Peltola et al., 1984)• CAP (Huang, 1992)• PHRAP (Green, 1994)• TIGR Assembler (Sutton et al., 1995)• AMASS (Kim et al., 1999)• CAP3 (Huang and Madan, 1999)• Celera Assembler (Myers et al., 2000)• EULER (Pevzner et al., 2001)• ARACHNE (Batzoglou et al., 2002)

Page 30: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

History of Genome Projects

• First Genome Sequence

– FX174 Phage 5,386 bp; 9 proteins (1980)

• Haemophilus Influenzea Sequenced

– First non-viral genome (1.8 MB) (1995)

Page 31: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

History of Genome Projects

• Saccharomyces cereviseae sequenced

– First eukaryotic genome (12.1 MB) (1996)

• Caenorhabditis elegans sequence released

– First animal genome 200 MB (1998)

Page 32: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

History of Genome Projects

• Arabidopsis thaliana sequence released

– First publicly available plant genome (1999)

• Rough Draft of Human Genome Reported (2001)– “Finished” 2003

Page 33: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Human Genome Project

• Began in 1990 (US DOE – 15 years)– Identify all genes in human DNA– Determine sequence of human genome– Develop faster sequencing technologies– Develop tools for data analysis– ELSI

Page 34: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Microbial Genomes

• 122 Complete Genomes in CMR– http://www.tigr.org/tigr-scripts/CMR2/CMR_

Content.spl

Page 35: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Genomes

– Fruit Fly– Mouse– Rat– Rice– Zebra fish– Puffer fish– Chicken– Dog– Frog

Page 36: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Growth of GenBank

• 1982: 600,000 Bases

• 2002: 28.5 Billion Bases

Page 37: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Other Notables

• Dayhoff ATLAS Database of Proteins (1960s)

• Sequence Comparison Algorithms– 1970, Needleman-Wunch (global alignment)

• Protein Databank– Brookhaven PDB (1973)

Page 38: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Other Notables

• NMR for protein structure identification (1980)

• IntelliGenetics Founded

– DNA and Protein sequence analysis (1980)

Page 39: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Other Notables

• Smith-Waterman algorithm– Local sequence alignment (1981)

• GenBank Database created (1982)

• Genetics Computer Group Founded– GCG suite (1982)

• PCR First Described (1985)

Page 40: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Other Notables

• FASTP Algorithm

– Protein database searching (1985)

• SWISS-PROT

– Protein Database (1986)

Page 41: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Other Notables

• PERL Programming Language– Allows for sequence manipulation (1987)

• NCBI Established (1988)

• Human Genome Initiative (1988)

Page 42: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Other Notables

• FASTA Program released (1988)– DNA and Protein sequence database searches

• BLAST Program released (1990)– Allows for quick database searches

• Informax Founded (1990)

• Human Genome Project Begins (1990)

Page 43: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Other Notables

• Creation and Use of ESTs Described (1991)

• Incyte Pharmaceuticals Founded (1991)

• TIGR Established (1992)

– Shotgun sequencing methods

Page 44: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Other Notables

• Affymetrix founded (1993)

• PRINTS protein motif database (1994)

Page 45: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Other Notables

First Commercial Microarray chips produced (1996)

• Dolly Cloned (1997)

• Capillary Sequencing machines introduced (1997)

Page 46: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

Other Notables

• Celera Genomics Formed (1998)

Page 47: Overview of Genome Sequencing Progress

3/17/2003 Bioinformatics: Merging Biological and Computational Skills Eric Rouchka, D.Sc. University of Louisville

More Detailed Histories

http://www.netsci.org/Science/Bioinform/feature06.html

http://www.dhgp.de/intro/history/history.html