from first assembly towards a new cyberpharmaceutical computing paradigm sorin istrail senior...

From First From First Assembly Assembly

Towards a NewTowards a NewCyberpharmaceutiCyberpharmaceuti

cal cal Computing Computing ParadigmParadigm

Sorin Istrail

Senior Director, Informatics Research

Stan Ulam’s Vision

“Don’t ask what Mathematics can do for Biology, ask what Biology can do for Mathematics.”

The Gene Counting Problem

“The more I see the less I know for sure.”

John Lennon

The O.J. Simpson Problem

No matter how good the data look,it is full of errors!

David Botstein, Albuquerque, 1994

Science and Technology

GenomicsGenomics Comparative GenomicsComparative Genomics ProteomicsProteomics PharmacogenomicsPharmacogenomics Structural GenomicsStructural Genomics Drug and Vaccine DesignDrug and Vaccine Design DNA Expression ChipsDNA Expression Chips Animal ModelsAnimal Models

GENOMICS

There is Nothing More Important than the Assembly!

Gene Myers and Granger Sutton and the Assembly TeamCelera Human Assembly:

Largest Non-Defence Computation20,000 CPU hours -- done 5 times now

160 Processors -- Compaq Architecture

Assembly Progression(Macro View)

Generally 3-10 pairs link each consecutive contigGenerally 3-10 pairs link each consecutive contig

Archaeoglobus fulgidus

Methanobacteriumthermoautotropicum

Saccharomyces cerevisiae

Mycoplasmapneumoniae

1996 19981997

Methanococcusjannaschii

Mycoplasmagenitalium

Haemophilusinfluenzae

CompletedMicrobialGenomes

Treponemapallidum

Borrelia burgdorferi

Helicobacterpylori

Bacillus subtilis

Escherichiacoli

Aquifexaeolicus

1999

Aq

Mycobacteriumtuberculosis H37Rv

Pyrococcushorikoshii

(Mycobacteriumtuberculosis

CSU#93)

(Deinococcusradiodurans)

(Thermotoga maritima)

(Rickettsia prowazekii)

Chlamydiatrachomatis

Synechocystissp.

GENOMES SEQUENCED AT TIGR and CELERA

Pathogens

*Haemophilus influenzae Rd

*Mycoplasma genitalium

*Helicobacter pylori

*Borrelia burgdorferi

*Treponema pallidum

*Plasmodium flaciparum

*Neisseria meningitidis

*Chlamydia trachomatis

*Chlamydia pneumoniae

*Vibro cholerae

*Streptococcus pneumoniae

*Mycobacterium tuberculosis

*Porphyromonas gingivalis

*Trypanosoma brucei

*Staphylococcus aureus

*Enterococcus faecalis

*Porphyromonas gingivalis

*Chlamydia psittaci

Plants

*Arabidopsis thaliana

Environment

*Methanococcus jannaschii

*Archaeoglobus fulgidus

*Thermotoga maritima

*Deinococcus radiodurans

*Chlorobium tepidum

*Caulobacter crescentus

*Shewanella putrafaciens

*Desulfovibrio vulgaris

*Pseudomonas putida

Insects

**Drosophila melanogaster

Mammals

**Human

**Mouse

* The Institute for Genomic Research

** Celera Genomics

Genesis of Celera August 1998

New 3700 automated DNA New 3700 automated DNA Sequencer changed the sequencing Sequencer changed the sequencing possibilitypossibility

Combined with TIGR Whole Combined with TIGR Whole Genome Sequencing StrategyGenome Sequencing Strategy

And 64bit computing And 64bit computing

Celera’s Sequencing / SNP Discovery Center

Celera Supercomputing Facility

Celera’s system is one of the most powerful Celera’s system is one of the most powerful civilian super-computing facilities in the civilian super-computing facilities in the worldworld

Currently over 1.5 teraflop of computing Currently over 1.5 teraflop of computing power in a virtual compute farm of Compaq power in a virtual compute farm of Compaq processors with 100 terabytes storageprocessors with 100 terabytes storage

Next phase a 100 teraflop computerNext phase a 100 teraflop computer

• Sequencing reactions produce short reads (~550bp).

Human Genome~3 billion bases

Sequence read~550 bases

• The human genome is repeat-rich.

Many short reads look identical to each other.

GCATTA...GACCGT

CGGATAGACATAACCGGATAGACATAAC

CGGATAGACATAAC

CAGCAGCAGCAGCACAGCAGCAGCAGCA

CAGCAGCAGCAGCA

Obstacles to Genome Sequencing

1. Mapping and Walking

2. Mapping and Clone by Clone Shotgun

3. Whole Genome Shotgun with Mate Pairs

Lab-Intense (SLOW)

Compute-Intense (FAST)

Comparison of Sequencing Strategies

• Mapping and Shotgun

1) Replicate mapped spans of DNA.

Chromosome

Mapped span(BAC) 35,000

2) Shear the replicates randomly and sequence the pieces.

cgattc

cgattc

cgattc

cgattc

cgattc

cgattccgattccgattc

cgattc

cgattccgattc

cgattc

cgattc

cgattc

cgattccgattc

3) Assemble reads by overlap matching. Infer the original sequence by consensus.

Computed overlapscgattc

cgattccgattc

cgattc

cgattccgattc

cgattccgattc

Computedsequence

cgattcggattctcgattctacgaa

Clone by Clone Shotgun sequencing

DNA target sampleDNA target sample

SHEAR & SIZESHEAR & SIZE

e.g., 10Kbp e.g., 10Kbp ± 8% std.dev.± 8% std.dev.

End Reads / Mate PairsEnd Reads / Mate Pairs

CLONECLONE & END SEQUENCE& END SEQUENCE

590bp

10,000bp

Mate-Pair Shotgun DNA Sequencing

– Early simulations showed that if repeats were considered black boxes, one could still cover Early simulations showed that if repeats were considered black boxes, one could still cover 99.7% of the genome unambiguously.99.7% of the genome unambiguously.

BAC 5’BAC 5’ BAC 3’BAC 3’

– Collect Collect 10-15x BAC10-15x BAC inserts and end sequence: inserts and end sequence: ~~ 300K300K pairs for Human. pairs for Human.

~~ 227 million7 million reads reads for Human.for Human.

2Kbp2Kbp 10Kbp10Kbp

Whole Genome Shotgun SequencingWhole Genome Shotgun Sequencing::

Whole Genome Sequencing Approaches

50Kbp

Building Scaffolds

Mated readsMated reads

Confirmed if at least 2 join the sameConfirmed if at least 2 join the sameunitigs and one of them is a U-unitig.unitigs and one of them is a U-unitig.

1. 1. 2k-10k Scaffolds2k-10k Scaffolds: Compute all “unitigs” in graph of: Compute all “unitigs” in graph ofU-unitigs connected by confirmed mate links.U-unitigs connected by confirmed mate links.

2. 2. BAC ScaffoldsBAC Scaffolds: Compute all “unitigs” in graph of : Compute all “unitigs” in graph of 2/10K scaffolds connected by confirmed BAC links.2/10K scaffolds connected by confirmed BAC links.

ScaffoldScaffold

Sequence or Repeat GapsSequence or Repeat Gaps(with estimated distances)(with estimated distances)

1 in 101 in 101515 that a confirmed pair that a confirmed pair is in error.is in error.

The Gene Counting Problem

The number probably will be never known exactly.

Current estimates: 30,000-40,000

Other estimates: 120,000

Gene discovery:

• sequence analysis• motif recognition• matches to mRNA• computational predictions• mouse data matches• experimental validation

Gene Counting

Random ESTs from tissuesCpG islands (55% of known genes are in CpG islands)

Complexity of EST data sets -- sampling biased on tissue and depth of collection

Underrepresented in data bases: Low abundance genes, in inaccessible tissues or developmental stages

Overrepresented: EST data sets are composed of incomplete sequences of mRNA, and non-overlapping pieces of same mRNA

Functional Assignment using Gene Ontology

Signal Transduction

4%

Enzyme18%

Nucleic Acid Binding

8%Hypothetical

11%

Unknown48%

Transporter 4%

Structural Protein2%

Ligand Binding or Carrier

2%

Cell Adhesion1%Motor Protein

1%Chaperone

1%

Nucleic Acid Binding Enzyme Signal Transduction

Transporter Structural Protein Ligand Binding or CarrierCell Adhesion Chaperone Motor Protein

Unknown Hypothetical

13,601 Genes

Drosophila

10 K

20 K

30 K

40 K

50K

Number of genes

Known genes

Otto432

1

Confidence

Gene Number in the Human Genome

Haemophilus vs. Drosophila

HfluHflu DrosophilaDrosophila XX Genome Size (Mbp)Genome Size (Mbp) 1.8 1.8 120120 6767

SequencesSequences 26,00026,000 3,100,0003,100,000 116116

Months in sequencingMonths in sequencing 44 44 11

Sequencing StaffSequencing Staff 2424 50502.12.1

Assembly Group StaffAssembly Group Staff 11 1010 1010

Human Genome Sequence from 5 Humans (3 females-2 males) completed

=Human sequencing started 9/8/99Human sequencing started 9/8/99

=Over 39X coverage of the genome in paired plasmid readsOver 39X coverage of the genome in paired plasmid reads

=First Assembly announced June 26 2.9 billion bpFirst Assembly announced June 26 2.9 billion bp

=Published in Science, February 16, 2001Published in Science, February 16, 2001

BD

GP

ST

S O

rder

BD

GP

ST

S O

rder

Validation Against STS-map

Scaffolds were aligned against Scaffolds were aligned against the BDGP STS-content mapthe BDGP STS-content map

All scaffolds with spanning 2 or All scaffolds with spanning 2 or more STSs were checked for more STSs were checked for order discrepancies.order discrepancies.

16 STS sites out of 2175 (.73%) 16 STS sites out of 2175 (.73%) were out of order, well within were out of order, well within the estimated error rate of the the estimated error rate of the STS map. 10 have been STS map. 10 have been determined to be incorrect.determined to be incorrect.

Celera Scaffold and STS OrderCelera Scaffold and STS Order

2L2L

3R3R

3L3L

2R2R

XX

44

Components vs. GeneMap ‘99

Order & Orientation is Essential to Finding Genes

Exon 1Exon 1 Exon 2Exon 2 Exon 3Exon 3 Exon 4Exon 4

Exons are shuffled and unoriented, significantly impacting the ability of Exons are shuffled and unoriented, significantly impacting the ability of gene finding programs to make a correct prediction.gene finding programs to make a correct prediction.

Users consistently report finding genes that they can’t find elsewhere.Users consistently report finding genes that they can’t find elsewhere.

But if contigs are not correctly put together:But if contigs are not correctly put together:

11 44 3 reversed3 reversed22

Contactin-associated protein gene (CNTNAP2) Comparison of genomic DNA sequences retrieved from the public working draft and the Celera database

Genomics 73, 108-112, 2001. http://www.idealibrary.com

Working draft

Celera

Mouse WGAHuman WGAHuman CSA

scaf

fold

leng

th

percent of genome coverage

Sca

ffol

d L

engt

h(M

bp

)

% of genome

Scaffold Sizes

0

5

10

15

20

25

Sca

ffol

d L

engt

h(M

bp

)Mouse WGAHuman WGAHuman CSA

% of genome

Drosophila WGA

Celera-only WGA

Scaffold Sizes

0

20

40

60

80

AllAll

Mouse WGAMouse WGA 2,4462,446 19,77819,778 212212 265,000265,000 96.896.8 95.595.5

Mouse WGAMouse WGA 2,3672,367 1,7791,779 193 193 242,000242,000

AllAll Span (Mbp)Span (Mbp) ScaffoldsScaffolds Gap (Mbp) Gap (Mbp) Gaps Gaps 30K %30K % 100K %100K %

WGAWGA 2,8472,847 119,000119,000 261 261 221,000221,000 90.490.4 88.688.6

CSACSA 2,9052,905 53,00053,000 252 252 170,000170,000 94.694.6 92.992.9

30K30K Span (Mbp)Span (Mbp) ScaffoldsScaffolds Gap (Mbp) Gap (Mbp) Gaps Gaps

WGAWGA 2,5742,574 2,5072,507 240 240 99,00099,000

CSACSA 2,7482,748 2,8452,845 224 224 112,000112,000

AllAll

C-only WGAC-only WGA 2,7812,781 6,5006,500 134134 *182,000*182,000 99.099.0 98.798.7

30K30K

C-only WGAC-only WGA 2,7542,754 537537 118 118 174,000174,000

Human and Mouse Assemblies

THE Book of Life

The Blueprint of Humanity

The Language of God

The Parts List of Humanity

The Human Genome is NOT

Molecular Function of Predicted proteins

BLAST, FASTA, and SIM4BLAST, FASTA, and SIM4

Sorin IstrailCelera Genomics

BLAST (Basic Local Alignment Search Tool)

A suite of sequence comparison algorithms optimized for A suite of sequence comparison algorithms optimized for speed used to search sequence databases for optimal local speed used to search sequence databases for optimal local alignments to a protein or nucleotide queryalignments to a protein or nucleotide query

Altschul, Gish, Miller, Myers, Lipman “Basic Local Alignment Search Tool”, Altschul, Gish, Miller, Myers, Lipman “Basic Local Alignment Search Tool”, J.Mol.BiolJ.Mol.Biol. . 215215(3):403-10 (1990)(3):403-10 (1990)

Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman “Gapped BLAST and PSI-BLAST: a new Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs”, generation of protein database search programs”, NARNAR 2525(17):3389-402 (1997) (and references therein)(17):3389-402 (1997) (and references therein)

Program Program Query Query DatabaseDatabase

blastpblastp proteinprotein proteinprotein

blastnblastn DNADNA DNADNA

blastxblastx DNA DNA (translated in 6 frames)(translated in 6 frames) proteinprotein

tblastntblastn proteinprotein DNA DNA (translated in 6 frames)(translated in 6 frames)

tblastxtblastx DNA DNA (translated in 6 frames)(translated in 6 frames) DNA DNA (translated in 6 frames)(translated in 6 frames)

The BLAST algorithm

Detect all Detect all word hitsword hits (exact, or nearly identical matches) of a given (exact, or nearly identical matches) of a given length between the two sequenceslength between the two sequences k=10 for nucleotide sequences (exact word matches) k=3 for protein sequences (nearly identical word matches)

Extend the word hits in both directions to high-scoring Extend the word hits in both directions to high-scoring gap-freegap-free segment pairs (HSPs)segment pairs (HSPs) retain only HSPs that score above a threshold start from the center of the HSP (original BLAST, 1990), or from the center

of a pair of HSPs located close to each other on the same diagonal (gapped BLAST, 1997)

Extend the HSPs in both directions allowing for gapsExtend the HSPs in both directions allowing for gaps use dynamic programming, and stop when the alignment score falls more

than a threshold X below the best score yet seen

Report all statistically significant local alignmentsReport all statistically significant local alignments E-value (starting with BLAST 2.0) is used to measure the statistical significance E-value = the number of alignments with score equal to or higher than s one

would expect to find by chance when searching the database

FASTA

A program for rapid alignment of pairs of protein and DNA sequences, A program for rapid alignment of pairs of protein and DNA sequences, building a local alignment from matching sequence patterns, or wordsbuilding a local alignment from matching sequence patterns, or words

Algorithm for comparing a query to a database of sequencesAlgorithm for comparing a query to a database of sequences

For each database sequence:For each database sequence: Identify the 10 diagonal regions having the largest number of perfect word

matches of a given length word size: k=1,2 for protein, and k=6-10 for nucleotide searches

Re-score these regions using a given scoring matrix (e.g., PAM250), and trim them to form (gap-free) maximal scoring initial regions

Join (non-overlapping) initial regions from adjacent diagonals to generate longer regions, allowing for gaps

Re-score these based on the initial regions’ scores, assessing a penalty for each joining

Align the query sequence to each of the sequences in the search set having the Align the query sequence to each of the sequences in the search set having the highest overall scoreshighest overall scores

Pearson and Lipman, “Improved tools for biological sequence comparison”, Pearson and Lipman, “Improved tools for biological sequence comparison”, Proc. Natl. Acad. Sci.Proc. Natl. Acad. Sci.

USAUSA 85 85; 2444-2448 (1988). ; 2444-2448 (1988).

Sim4

Aligns an expressed DNA (EST, cDNA, mRNA) sequence with a Aligns an expressed DNA (EST, cDNA, mRNA) sequence with a genomic sequence for that gene, allowing for introns and sequencing genomic sequence for that gene, allowing for introns and sequencing errorserrors

Exon 1 Exon 2

Intron

5’5’ 3’3’GTGT AGAGExon 3

Intron

GTGT AGAGgenomicsequence

cDNA

Florea, Hartzell, Zhang, Rubin, Miller, “AFlorea, Hartzell, Zhang, Rubin, Miller, “A computer program for aligning expressed DNA and genomic computer program for aligning expressed DNA and genomic

sequences”, sequences”, Genome Res Genome Res 88(9):967:74 (1998)(9):967:74 (1998)

Stages and algorithmic techniques

Detect basic homology blocksDetect basic homology blocks Determine gap-free matches (HSPs) using a ‘blast’-like homology search

Detect all exact word matches of length k (e.g., k=12) Extend the word hits in both directions, by substitutions, to gap-free high-scoring

segment pairs (HSPs) Retain only HSPs scoring above a threshold

Connect the HSPs to form larger blocks (‘exon cores’) using sparse dynamic programming

Extend or trim the exon cores to eliminate gaps or overlaps in the cDNA Extend or trim the exon cores to eliminate gaps or overlaps in the cDNA sequencesequence Extend the similarity blocks using fast greedy sequence comparison algorithms Detect new exon cores with the ‘blast’-like homology search tuned for higher

sensitivity

Refine the intronsRefine the introns Predict the locations of splice junctions using a combined measure of the

accuracy of alignment and the intensity of splice signals at the ends of each intron

Generate the spliced alignmentGenerate the spliced alignment Align the sequences within individual exons using greedy alignment algorithms Connect the chain of exon alignments by gaps (introns)

from first assembly towards a new cyberpharmaceutical computing paradigm sorin istrail senior...

Documents

computing slide

celera genomics slide

genome sequencing slide

informatics research

compaq architecture

consecutive contig slide

john lennon slide

human genome