from first assembly towards a new cyberpharmaceutical computing paradigm sorin istrail senior...
Post on 18-Dec-2015
215 views
TRANSCRIPT
From First From First Assembly Assembly
Towards a NewTowards a NewCyberpharmaceutiCyberpharmaceuti
cal cal Computing Computing ParadigmParadigm
Sorin Istrail
Senior Director, Informatics Research
Stan Ulam’s Vision
“Don’t ask what Mathematics can do for Biology, ask what Biology can do for Mathematics.”
The Gene Counting Problem
“The more I see the less I know for sure.”
John Lennon
The O.J. Simpson Problem
No matter how good the data look,it is full of errors!
David Botstein, Albuquerque, 1994
Science and Technology
GenomicsGenomics Comparative GenomicsComparative Genomics ProteomicsProteomics PharmacogenomicsPharmacogenomics Structural GenomicsStructural Genomics Drug and Vaccine DesignDrug and Vaccine Design DNA Expression ChipsDNA Expression Chips Animal ModelsAnimal Models
GENOMICS
There is Nothing More Important than the Assembly!
Gene Myers and Granger Sutton and the Assembly TeamCelera Human Assembly:
Largest Non-Defence Computation20,000 CPU hours -- done 5 times now
160 Processors -- Compaq Architecture
Assembly Progression(Macro View)
Generally 3-10 pairs link each consecutive contigGenerally 3-10 pairs link each consecutive contig
Archaeoglobus fulgidus
Methanobacteriumthermoautotropicum
Saccharomyces cerevisiae
Mycoplasmapneumoniae
1996 19981997
Methanococcusjannaschii
Mycoplasmagenitalium
Haemophilusinfluenzae
CompletedMicrobialGenomes
Treponemapallidum
Borrelia burgdorferi
Helicobacterpylori
Bacillus subtilis
Escherichiacoli
Aquifexaeolicus
1999
Aq
Mycobacteriumtuberculosis H37Rv
Pyrococcushorikoshii
(Mycobacteriumtuberculosis
CSU#93)
(Deinococcusradiodurans)
(Thermotoga maritima)
(Rickettsia prowazekii)
Chlamydiatrachomatis
Synechocystissp.
GENOMES SEQUENCED AT TIGR and CELERA
Pathogens
*Haemophilus influenzae Rd
*Mycoplasma genitalium
*Helicobacter pylori
*Borrelia burgdorferi
*Treponema pallidum
*Plasmodium flaciparum
*Neisseria meningitidis
*Chlamydia trachomatis
*Chlamydia pneumoniae
*Vibro cholerae
*Streptococcus pneumoniae
*Mycobacterium tuberculosis
*Porphyromonas gingivalis
*Trypanosoma brucei
*Staphylococcus aureus
*Enterococcus faecalis
*Porphyromonas gingivalis
*Chlamydia psittaci
Plants
*Arabidopsis thaliana
Environment
*Methanococcus jannaschii
*Archaeoglobus fulgidus
*Thermotoga maritima
*Deinococcus radiodurans
*Chlorobium tepidum
*Caulobacter crescentus
*Shewanella putrafaciens
*Desulfovibrio vulgaris
*Pseudomonas putida
Insects
**Drosophila melanogaster
Mammals
**Human
**Mouse
* The Institute for Genomic Research
** Celera Genomics
Genesis of Celera August 1998
New 3700 automated DNA New 3700 automated DNA Sequencer changed the sequencing Sequencer changed the sequencing possibilitypossibility
Combined with TIGR Whole Combined with TIGR Whole Genome Sequencing StrategyGenome Sequencing Strategy
And 64bit computing And 64bit computing
Celera’s Sequencing / SNP Discovery Center
Celera Supercomputing Facility
Celera’s system is one of the most powerful Celera’s system is one of the most powerful civilian super-computing facilities in the civilian super-computing facilities in the worldworld
Currently over 1.5 teraflop of computing Currently over 1.5 teraflop of computing power in a virtual compute farm of Compaq power in a virtual compute farm of Compaq processors with 100 terabytes storageprocessors with 100 terabytes storage
Next phase a 100 teraflop computerNext phase a 100 teraflop computer
• Sequencing reactions produce short reads (~550bp).
Human Genome~3 billion bases
Sequence read~550 bases
• The human genome is repeat-rich.
Many short reads look identical to each other.
GCATTA...GACCGT
CGGATAGACATAACCGGATAGACATAAC
CGGATAGACATAAC
CAGCAGCAGCAGCACAGCAGCAGCAGCA
CAGCAGCAGCAGCA
Obstacles to Genome Sequencing
1. Mapping and Walking
2. Mapping and Clone by Clone Shotgun
3. Whole Genome Shotgun with Mate Pairs
Lab-Intense (SLOW)
Compute-Intense (FAST)
Comparison of Sequencing Strategies
• Mapping and Shotgun
1) Replicate mapped spans of DNA.
Chromosome
Mapped span(BAC) 35,000
2) Shear the replicates randomly and sequence the pieces.
cgattc
cgattc
cgattc
cgattc
cgattc
cgattccgattccgattc
cgattc
cgattccgattc
cgattc
cgattc
cgattc
cgattccgattc
3) Assemble reads by overlap matching. Infer the original sequence by consensus.
Computed overlapscgattc
cgattccgattc
cgattc
cgattccgattc
cgattccgattc
Computedsequence
cgattcggattctcgattctacgaa
Clone by Clone Shotgun sequencing
DNA target sampleDNA target sample
SHEAR & SIZESHEAR & SIZE
e.g., 10Kbp e.g., 10Kbp ± 8% std.dev.± 8% std.dev.
End Reads / Mate PairsEnd Reads / Mate Pairs
CLONECLONE & END SEQUENCE& END SEQUENCE
590bp
10,000bp
Mate-Pair Shotgun DNA Sequencing
– Early simulations showed that if repeats were considered black boxes, one could still cover Early simulations showed that if repeats were considered black boxes, one could still cover 99.7% of the genome unambiguously.99.7% of the genome unambiguously.
BAC 5’BAC 5’ BAC 3’BAC 3’
– Collect Collect 10-15x BAC10-15x BAC inserts and end sequence: inserts and end sequence: ~~ 300K300K pairs for Human. pairs for Human.
~~ 227 million7 million reads reads for Human.for Human.
2Kbp2Kbp 10Kbp10Kbp
Whole Genome Shotgun SequencingWhole Genome Shotgun Sequencing::
Whole Genome Sequencing Approaches
50Kbp
Building Scaffolds
Mated readsMated reads
Confirmed if at least 2 join the sameConfirmed if at least 2 join the sameunitigs and one of them is a U-unitig.unitigs and one of them is a U-unitig.
1. 1. 2k-10k Scaffolds2k-10k Scaffolds: Compute all “unitigs” in graph of: Compute all “unitigs” in graph ofU-unitigs connected by confirmed mate links.U-unitigs connected by confirmed mate links.
2. 2. BAC ScaffoldsBAC Scaffolds: Compute all “unitigs” in graph of : Compute all “unitigs” in graph of 2/10K scaffolds connected by confirmed BAC links.2/10K scaffolds connected by confirmed BAC links.
ScaffoldScaffold
Sequence or Repeat GapsSequence or Repeat Gaps(with estimated distances)(with estimated distances)
1 in 101 in 101515 that a confirmed pair that a confirmed pair is in error.is in error.
The Gene Counting Problem
The number probably will be never known exactly.
Current estimates: 30,000-40,000
Other estimates: 120,000
Gene discovery:
• sequence analysis• motif recognition• matches to mRNA• computational predictions• mouse data matches• experimental validation
Gene Counting
Random ESTs from tissuesCpG islands (55% of known genes are in CpG islands)
Complexity of EST data sets -- sampling biased on tissue and depth of collection
Underrepresented in data bases: Low abundance genes, in inaccessible tissues or developmental stages
Overrepresented: EST data sets are composed of incomplete sequences of mRNA, and non-overlapping pieces of same mRNA
Functional Assignment using Gene Ontology
Signal Transduction
4%
Enzyme18%
Nucleic Acid Binding
8%Hypothetical
11%
Unknown48%
Transporter 4%
Structural Protein2%
Ligand Binding or Carrier
2%
Cell Adhesion1%Motor Protein
1%Chaperone
1%
Nucleic Acid Binding Enzyme Signal Transduction
Transporter Structural Protein Ligand Binding or CarrierCell Adhesion Chaperone Motor Protein
Unknown Hypothetical
13,601 Genes
Drosophila
10 K
20 K
30 K
40 K
50K
Number of genes
Known genes
Otto432
1
Confidence
Gene Number in the Human Genome
Haemophilus vs. Drosophila
HfluHflu DrosophilaDrosophila XX Genome Size (Mbp)Genome Size (Mbp) 1.8 1.8 120120 6767
SequencesSequences 26,00026,000 3,100,0003,100,000 116116
Months in sequencingMonths in sequencing 44 44 11
Sequencing StaffSequencing Staff 2424 50502.12.1
Assembly Group StaffAssembly Group Staff 11 1010 1010
Human Genome Sequence from 5 Humans (3 females-2 males) completed
=Human sequencing started 9/8/99Human sequencing started 9/8/99
=Over 39X coverage of the genome in paired plasmid readsOver 39X coverage of the genome in paired plasmid reads
=First Assembly announced June 26 2.9 billion bpFirst Assembly announced June 26 2.9 billion bp
=Published in Science, February 16, 2001Published in Science, February 16, 2001
BD
GP
ST
S O
rder
BD
GP
ST
S O
rder
Validation Against STS-map
Scaffolds were aligned against Scaffolds were aligned against the BDGP STS-content mapthe BDGP STS-content map
All scaffolds with spanning 2 or All scaffolds with spanning 2 or more STSs were checked for more STSs were checked for order discrepancies.order discrepancies.
16 STS sites out of 2175 (.73%) 16 STS sites out of 2175 (.73%) were out of order, well within were out of order, well within the estimated error rate of the the estimated error rate of the STS map. 10 have been STS map. 10 have been determined to be incorrect.determined to be incorrect.
Celera Scaffold and STS OrderCelera Scaffold and STS Order
2L2L
3R3R
3L3L
2R2R
XX
44
Components vs. GeneMap ‘99
Order & Orientation is Essential to Finding Genes
Exon 1Exon 1 Exon 2Exon 2 Exon 3Exon 3 Exon 4Exon 4
Exons are shuffled and unoriented, significantly impacting the ability of Exons are shuffled and unoriented, significantly impacting the ability of gene finding programs to make a correct prediction.gene finding programs to make a correct prediction.
Users consistently report finding genes that they can’t find elsewhere.Users consistently report finding genes that they can’t find elsewhere.
But if contigs are not correctly put together:But if contigs are not correctly put together:
11 44 3 reversed3 reversed22
Contactin-associated protein gene (CNTNAP2) Comparison of genomic DNA sequences retrieved from the public working draft and the Celera database
Genomics 73, 108-112, 2001. http://www.idealibrary.com
Working draft
Celera
Mouse WGAHuman WGAHuman CSA
scaf
fold
leng
th
percent of genome coverage
Sca
ffol
d L
engt
h(M
bp
)
% of genome
Scaffold Sizes
0
5
10
15
20
25
Sca
ffol
d L
engt
h(M
bp
)Mouse WGAHuman WGAHuman CSA
% of genome
Drosophila WGA
Celera-only WGA
Scaffold Sizes
0
20
40
60
80
AllAll
Mouse WGAMouse WGA 2,4462,446 19,77819,778 212212 265,000265,000 96.896.8 95.595.5
Mouse WGAMouse WGA 2,3672,367 1,7791,779 193 193 242,000242,000
AllAll Span (Mbp)Span (Mbp) ScaffoldsScaffolds Gap (Mbp) Gap (Mbp) Gaps Gaps 30K %30K % 100K %100K %
WGAWGA 2,8472,847 119,000119,000 261 261 221,000221,000 90.490.4 88.688.6
CSACSA 2,9052,905 53,00053,000 252 252 170,000170,000 94.694.6 92.992.9
30K30K Span (Mbp)Span (Mbp) ScaffoldsScaffolds Gap (Mbp) Gap (Mbp) Gaps Gaps
WGAWGA 2,5742,574 2,5072,507 240 240 99,00099,000
CSACSA 2,7482,748 2,8452,845 224 224 112,000112,000
AllAll
C-only WGAC-only WGA 2,7812,781 6,5006,500 134134 *182,000*182,000 99.099.0 98.798.7
30K30K
C-only WGAC-only WGA 2,7542,754 537537 118 118 174,000174,000
Human and Mouse Assemblies
THE Book of Life
The Blueprint of Humanity
The Language of God
The Parts List of Humanity
The Human Genome is NOT
ADRB2
Molecular Function of Predicted proteins
BLAST, FASTA, and SIM4BLAST, FASTA, and SIM4
Sorin IstrailCelera Genomics
BLAST (Basic Local Alignment Search Tool)
A suite of sequence comparison algorithms optimized for A suite of sequence comparison algorithms optimized for speed used to search sequence databases for optimal local speed used to search sequence databases for optimal local alignments to a protein or nucleotide queryalignments to a protein or nucleotide query
Altschul, Gish, Miller, Myers, Lipman “Basic Local Alignment Search Tool”, Altschul, Gish, Miller, Myers, Lipman “Basic Local Alignment Search Tool”, J.Mol.BiolJ.Mol.Biol. . 215215(3):403-10 (1990)(3):403-10 (1990)
Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman “Gapped BLAST and PSI-BLAST: a new Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs”, generation of protein database search programs”, NARNAR 2525(17):3389-402 (1997) (and references therein)(17):3389-402 (1997) (and references therein)
Program Program Query Query DatabaseDatabase
blastpblastp proteinprotein proteinprotein
blastnblastn DNADNA DNADNA
blastxblastx DNA DNA (translated in 6 frames)(translated in 6 frames) proteinprotein
tblastntblastn proteinprotein DNA DNA (translated in 6 frames)(translated in 6 frames)
tblastxtblastx DNA DNA (translated in 6 frames)(translated in 6 frames) DNA DNA (translated in 6 frames)(translated in 6 frames)
The BLAST algorithm
Detect all Detect all word hitsword hits (exact, or nearly identical matches) of a given (exact, or nearly identical matches) of a given length between the two sequenceslength between the two sequences k=10 for nucleotide sequences (exact word matches) k=3 for protein sequences (nearly identical word matches)
Extend the word hits in both directions to high-scoring Extend the word hits in both directions to high-scoring gap-freegap-free segment pairs (HSPs)segment pairs (HSPs) retain only HSPs that score above a threshold start from the center of the HSP (original BLAST, 1990), or from the center
of a pair of HSPs located close to each other on the same diagonal (gapped BLAST, 1997)
Extend the HSPs in both directions allowing for gapsExtend the HSPs in both directions allowing for gaps use dynamic programming, and stop when the alignment score falls more
than a threshold X below the best score yet seen
Report all statistically significant local alignmentsReport all statistically significant local alignments E-value (starting with BLAST 2.0) is used to measure the statistical significance E-value = the number of alignments with score equal to or higher than s one
would expect to find by chance when searching the database
FASTA
A program for rapid alignment of pairs of protein and DNA sequences, A program for rapid alignment of pairs of protein and DNA sequences, building a local alignment from matching sequence patterns, or wordsbuilding a local alignment from matching sequence patterns, or words
Algorithm for comparing a query to a database of sequencesAlgorithm for comparing a query to a database of sequences
For each database sequence:For each database sequence: Identify the 10 diagonal regions having the largest number of perfect word
matches of a given length word size: k=1,2 for protein, and k=6-10 for nucleotide searches
Re-score these regions using a given scoring matrix (e.g., PAM250), and trim them to form (gap-free) maximal scoring initial regions
Join (non-overlapping) initial regions from adjacent diagonals to generate longer regions, allowing for gaps
Re-score these based on the initial regions’ scores, assessing a penalty for each joining
Align the query sequence to each of the sequences in the search set having the Align the query sequence to each of the sequences in the search set having the highest overall scoreshighest overall scores
Pearson and Lipman, “Improved tools for biological sequence comparison”, Pearson and Lipman, “Improved tools for biological sequence comparison”, Proc. Natl. Acad. Sci.Proc. Natl. Acad. Sci.
USAUSA 85 85; 2444-2448 (1988). ; 2444-2448 (1988).
Sim4
Aligns an expressed DNA (EST, cDNA, mRNA) sequence with a Aligns an expressed DNA (EST, cDNA, mRNA) sequence with a genomic sequence for that gene, allowing for introns and sequencing genomic sequence for that gene, allowing for introns and sequencing errorserrors
Exon 1 Exon 2
Intron
5’5’ 3’3’GTGT AGAGExon 3
Intron
GTGT AGAGgenomicsequence
cDNA
Florea, Hartzell, Zhang, Rubin, Miller, “AFlorea, Hartzell, Zhang, Rubin, Miller, “A computer program for aligning expressed DNA and genomic computer program for aligning expressed DNA and genomic
sequences”, sequences”, Genome Res Genome Res 88(9):967:74 (1998)(9):967:74 (1998)
Stages and algorithmic techniques
Detect basic homology blocksDetect basic homology blocks Determine gap-free matches (HSPs) using a ‘blast’-like homology search
Detect all exact word matches of length k (e.g., k=12) Extend the word hits in both directions, by substitutions, to gap-free high-scoring
segment pairs (HSPs) Retain only HSPs scoring above a threshold
Connect the HSPs to form larger blocks (‘exon cores’) using sparse dynamic programming
Extend or trim the exon cores to eliminate gaps or overlaps in the cDNA Extend or trim the exon cores to eliminate gaps or overlaps in the cDNA sequencesequence Extend the similarity blocks using fast greedy sequence comparison algorithms Detect new exon cores with the ‘blast’-like homology search tuned for higher
sensitivity
Refine the intronsRefine the introns Predict the locations of splice junctions using a combined measure of the
accuracy of alignment and the intensity of splice signals at the ends of each intron
Generate the spliced alignmentGenerate the spliced alignment Align the sequences within individual exons using greedy alignment algorithms Connect the chain of exon alignments by gaps (introns)