fundamentals and applications of single molecule real-time ...€¦ · dna polymerase zmw...
TRANSCRIPT
-
FIND MEANING IN COMPLEXITY
© Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved.
Fundamentals and Applications of Single Molecule
Real-Time SMRT® Sequencing
Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell are trademarks of Pacific Biosciences in the United States and/or other countries. Celera is a trademark of Celera Corporation; and HiSeq and
MiSeq are trademarks of Illumina, Inc.© Copyright 2014 by Pacific Biosciences of California, Inc. All rights reserved.
CAT-AgroFood Plant Research International Workshop for Pacbio Sequencing
March 26, 2014 Dr. Christoph König
-
DNA Polymerase ZMW Confinement Phospholinked Nucleotides
Single-Molecule, Real-Time DNA Sequencing (SMRT) Is:
-
PacBio® RS II Typical Performance
-
Read Definitions in RS System & SMRT® Analysis v2.0
SMRTbell™ Template
Polymerase Read
Definition:
• Formerly called “read”
• 1 pass
• With adapters
• 1 molecule, 1 pol. read
Uses:
• QC of instrument run
Subreads
Definition:
• Adapters removed
• 1 pass
• 1 molecule, 1+ subread
Uses:
• Applications such as
assembly and base
modification
Read (of Insert)
Definition:
• The highest quality
single sequence for an
insert
• 1+ passes including
partial passes
• 1 molecule, 1 read
Uses:
• Insert size distribution
-
Blue Pippin™ System for Size Selection
Size-Selected
Mouse Lemur
20 kb library
20 kb AMPure®
Mouse Lemur
library
- Input gDNA
- Size-selected
-
Most Uniform Coverage
• Ross et al. (2013) Characterizing and measuring bias in sequence data. Genome Biology, May 29;14(5):R51
“Pacific Biosciences coverage
levels are the least biased”
http://genomebiology.com/content/14/5/R51http://genomebiology.com/content/14/5/R51http://genomebiology.com/content/14/5/R51http://genomebiology.com/content/14/5/R51http://genomebiology.com/content/14/5/R51http://genomebiology.com/content/14/5/R51http://genomebiology.com/content/14/5/R51
-
Detection of DNA Base Modifications by SMRT
Sequencing
Flusberg et al. (2010) Nature Methods 7: 461-465
-
Summary Sequence Performance
1. Long sequence reads
– Finish genomes, de novo assemblies
– Full-length cDNA sequencing
– Long-range haplotype phasing
2. High Consensus Accuracy
– >99.999% (QV50)
– Lack of systematic sequencing errors
3. Lack of sequence context bias
– GC content
– Low complexity sequence
4. Base modification detection
– Epigenome characterization
-
FIND MEANING IN COMPLEXITY
© Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved.
De Novo Assembly
-
Advantages of SMRT® Sequencing:
Impact of Long Read Lengths on De Novo Assembly
Koren S. et. al. (2013) Reducing assembly complexity of microbial genomes with single molecule sequencing.
Genome Biology, 14:R101
What can be achieved with infinite coverage given the read length?
PacBio
http://genomebiology.com/2013/14/9/R101http://genomebiology.com/2013/14/9/R101http://genomebiology.com/2013/14/9/R101http://genomebiology.com/2013/14/9/R101
-
Easy Bioinformatics Solution to Finish Genomes Using
Only PacBio® Reads
Full push-button solution from
beginning to end
• Longest reads for continuity
• All reads for high consensus
accuracy
Hierarchical Genome Assembly Process (HGAP)
Chin CS., et. al. (2013) Nonhybrid , finished microbial genome assemblies from long-read SMRT
sequencing data. Nat Methods. Jun;10(6):563-9.
Watch SMRT® Analysis Tutorial: Bacterial Assembly and
Epigenetic Analysis
http://www.ncbi.nlm.nih.gov/pubmed/23644548http://www.ncbi.nlm.nih.gov/pubmed/23644548http://www.ncbi.nlm.nih.gov/pubmed/23644548http://www.ncbi.nlm.nih.gov/pubmed/23644548http://www.ncbi.nlm.nih.gov/pubmed/23644548http://www.ncbi.nlm.nih.gov/pubmed/23644548http://www.ncbi.nlm.nih.gov/pubmed/23644548http://www.ncbi.nlm.nih.gov/pubmed/23644548http://www.ncbi.nlm.nih.gov/pubmed/23644548http://www.ncbi.nlm.nih.gov/pubmed/23644548http://www.ncbi.nlm.nih.gov/pubmed/23644548http://www.pacificbiosciences.com/Tutorials/Bacterial_Assembly_Epigenetic_Analysis_HGAP/story_html5.htmlhttp://www.pacificbiosciences.com/Tutorials/Bacterial_Assembly_Epigenetic_Analysis_HGAP/story_html5.htmlhttp://www.pacificbiosciences.com/Tutorials/Bacterial_Assembly_Epigenetic_Analysis_HGAP/story_html5.htmlhttp://www.pacificbiosciences.com/Tutorials/Bacterial_Assembly_Epigenetic_Analysis_HGAP/story_html5.html
-
SMRT® Sequencing:
Gold Standard for microbial De Novo Assembly
-
FIND MEANING IN COMPLEXITY
© Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved.
-
Progress of PacBio-Only De Novo Assembly
Spinach
1 Gb
Contig N50
531 kb Drosophila
170 Mb
Contig N50
4.5 Mb
Arabidopsis
120 Mb
Contig N50
7.1 Mb Human
(haploid)
3.2 Gb
Contig N50
4.4 Mb
Max=44 Mb
2013 2014
Bacteria
1-10 Mb
Finished
Genomes
Yeast
12 Mb
Resolve most
chromosomes
-
PacBio-Only Sequencing of Arabidopsis
Short-read
(Ler 1)*
PacBio reads
(Ler-0) Improvement
Est. Genome
Size (Mb) 110.4 124.6 11.5%
Polished
Contigs 4,662 545 8.5X
N50 Contig
Length (Mb) 0.067 6.36 95X
Max Contig
Length (Mb) 0.46 13.21 29X
Read Blog Entry Download Arabidopsis
• Original Col-0 strain assembly (Sanger + manual finishing)
• ~$70M, several years
• PacBio® data recently used to assemble Ler-0 strain
*http://1001genomes.org/data/MPI/MPISchneeberger2011/releases/current/
http://blog.pacificbiosciences.com/2014/01/data-release-preliminary-de-novo.htmlhttp://datasets.pacb.com.s3.amazonaws.com/2014/Arabidopsis/reads/list.htmlhttp://1001genomes.org/data/MPI/MPISchneeberger2011/releases/current/
-
SNP Discovery with PacBio® Assemblies
17
Watch Arabidopsis Genome Recording Other PAG XXII Recordings
509,836
95%/68%
685,104
92%/72%
Ler0 ILMN
PE
27,106
PacBio Ler0
Assembly
PacBio Cvi
Assembly
271,335
Cvi ILMN
PE
55,947
238,637
Called SNPs between Cvi and Col
Mapping of ILMN PE or PacBio Assembly to TAIR 10
Discovery of single nucleotide polymorphism by PacBio assemblies
Mapping of ILMN PE to PacBio Assembly
Ler0 PE – Ler0 Assembly 885 homozygous SNPs
Cvi PE – Cvi Assembly 838 homozygous SNPs
SNP frequency 7.5 x 106
These SNPs are highly enriched in peri-
centromere and associate with aberrantly
high coverage number
http://aa314.gondor.co/webinar/resolving-the-complexity-of-genomic-and-epigenomic-variations-in-arabidopsis/http://blog.pacificbiosciences.com/2014/01/at-plant-animal-genome-workshop-users.html
-
SNP Discovery with PacBio® Assemblies
18
Watch Arabidopsis Genome Recording Other PAG XXII Recordings
PacBio assembly identifies SNPs in Illumina low-
coverage (unmappable) regions
Called SNPs between Cvi and Col
Both
Illumina only
PacBio only
Analysis by Jason Chin
http://aa314.gondor.co/webinar/resolving-the-complexity-of-genomic-and-epigenomic-variations-in-arabidopsis/http://blog.pacificbiosciences.com/2014/01/at-plant-animal-genome-workshop-users.html
-
Assembling Rice Genomes
21
• Watch Richard McCombie's 2014 AGBT presentation
http://aa314.gondor.co/webinar/a-near-perfect-de-novo-assembly-of-a-eukaryotic-genome-using-sequence-reads-of-greater-than-10-kilobases-generated-by-the-pacific-biosciences-rs-ii/http://aa314.gondor.co/webinar/a-near-perfect-de-novo-assembly-of-a-eukaryotic-genome-using-sequence-reads-of-greater-than-10-kilobases-generated-by-the-pacific-biosciences-rs-ii/http://aa314.gondor.co/webinar/a-near-perfect-de-novo-assembly-of-a-eukaryotic-genome-using-sequence-reads-of-greater-than-10-kilobases-generated-by-the-pacific-biosciences-rs-ii/http://aa314.gondor.co/webinar/a-near-perfect-de-novo-assembly-of-a-eukaryotic-genome-using-sequence-reads-of-greater-than-10-kilobases-generated-by-the-pacific-biosciences-rs-ii/
-
PacBio-Only Sequencing of a Spinach Genome (980 Mb)
Watch Spinach Genome Recording Other PAG XXII Recordings
http://aa314.gondor.co/webinar/a-de-novo-draft-assembly-of-spinach-using-pacific-biosciences-technology/http://blog.pacificbiosciences.com/2014/01/at-plant-animal-genome-workshop-users.html
-
Long-Read Shotgun Human Genome Data Release
Read Blog Post
• 54x coverage of CHMT1 cell line
• Avg SMRT® Cell throughput: 608 Mb
• Avg DNA insert length: 7,680 bp
• Half of sequenced bases in reads
greater than: 10,739 bp
• Longest DNA insert sequenced:
42,774 bp
Download Dataset
http://blog.pacificbiosciences.com/2014/02/data-release-54x-long-read-coverage-for.htmlhttp://datasets.pacb.com/2014/Human54x/fast.html
-
107 7,4 5,5 24 127 144
4378
0
500
1000
1500
2000
2500
3000
3500
4000
4500
2007 2009 2010 2010 2013 2013 2014
Contig N50 (kb)
Human Genome De Novo Assemblies Comparison
2007 2009 2010 2010 2013 2013 2014
HuRef (Venter) BGI YH KB1 NA12878 RP11_0.7 CHM1 CHM1
Technology ABI 3730 Illumina GA 454 GS FLX
Titanium
Illumina GA 454 GS,
HiSeq, MiSeq
HiSeq,
BAC clones
PacBio RS II
Assembly method Celera
Assembler
SOAP
de novo
Newbler ALLPATHS-LG Newbler Reference
Guided
FALCON,
Celera
Assembler
Data sources: HuRef (Venter) (http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050254); BGI YH (http://genome.cshlp.org/content/
20/2/265.abstract Table II); KB1 (http://www.nature.com/nature/journal/v463/n7283/full/nature08795.html); NA12878 (http://www.pnas.org/content/
early/2010/12/20/1017351108.abstract Table3); CHM1 (http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2/)
# of library types 4 5 2 5 3 NA 1
Total assembly size
(Gb) 2.78 2.46 2.79 2.82 2.81 2.83 3.25
-
Comparison of Human CHM1 Assemblies
2014 PacBio® de novo
2013 reference-guided
short-read with BACs
gaps
MHC region
44 MB
contig
-
The Next Challenge: Assembling Diploid Genomes
Developing
bioinformatics and
visualization tools to
resolve diploid
genomes
Early
assembly
result for the
Ler-0 + Col-0
“synthetic” diploid Watch Jason Chin’s 2014 AGBT
presentation “String Graph Assembly for
Diploid Genomes with Long Reads”
http://aa314.gondor.co/webinar/string-graph-assembly-for-diploid-genomes-with-long-reads/http://aa314.gondor.co/webinar/string-graph-assembly-for-diploid-genomes-with-long-reads/http://aa314.gondor.co/webinar/string-graph-assembly-for-diploid-genomes-with-long-reads/
-
Benefits of PacBio® Sequencing for Large Genomes
• PacBio data complements short reads to improve new and existing
de novo assemblies
• Improve N50 contig length even with modest 5x coverage
• Scaffold PacBio long reads to set framework for genome completion
• Resolve troublesome gaps with low-complexity and repetitive
genomic regions
• Catalog transposable elements
• Conduct gene-specific surveys
PacBio® De Novo Assembly Homepage
http://www.pacb.com/applications/denovo/index.htmlhttp://www.pacb.com/applications/denovo/index.htmlhttp://www.pacb.com/applications/denovo/index.htmlhttp://www.pacb.com/applications/denovo/index.htmlhttp://www.pacb.com/applications/denovo/index.htmlhttp://www.pacb.com/applications/denovo/index.html
-
FIND MEANING IN COMPLEXITY
© Copyright 2014 by Pacific Biosciences of California, Inc. All rights reserved.
PacBio® Isoform Sequencing of Full-length Transcripts
-
Transcript Diversity
-
Current State of Transcript Assembly
“The way we do RNA-seq now is…
you take the transcriptome, you
blow it up into pieces and then
you try to figure out how they all
go back together again… If you
think about it, it’s kind of a crazy
way to do things”
Michael Snyder
Professor and Chair of Genetics
Stanford University
Tal Nawy, End to end RNA Sequencing, Nature
Methods, v10, n10, Dec . 2013, p1144–1145
Ian Korf (2013) Genomics: the state of the art in
RNA-seq analysis, Nature Methods, Nov 26;10(12):1165-6.
doi: 10.1038/nmeth.2735.
http://www.ncbi.nlm.nih.gov/pubmed/24296473http://www.ncbi.nlm.nih.gov/pubmed/24296473http://www.ncbi.nlm.nih.gov/pubmed/24296473http://www.ncbi.nlm.nih.gov/pubmed/24296473http://www.ncbi.nlm.nih.gov/pubmed/24296473http://www.ncbi.nlm.nih.gov/pubmed/24296473http://www.ncbi.nlm.nih.gov/pubmed/24296473http://www.ncbi.nlm.nih.gov/pubmed/24296473http://www.ncbi.nlm.nih.gov/pubmed/24296473http://www.ncbi.nlm.nih.gov/pubmed/24296473http://www.ncbi.nlm.nih.gov/pubmed/24296473http://www.ncbi.nlm.nih.gov/pubmed/24296473http://www.ncbi.nlm.nih.gov/pubmed/24296473http://www.ncbi.nlm.nih.gov/pubmed/24296473http://www.ncbi.nlm.nih.gov/pubmed/24296473http://www.ncbi.nlm.nih.gov/pubmed/24296473http://www.ncbi.nlm.nih.gov/pubmed/24296473
-
SampleNet: Iso-Seq Method with Clonetech cDNA Synthesis Kit
PacBio’s Iso-Seq™ Method for High-quality, Full-length Transcripts
PolyA mRNA
AAAAA
AAAAA
AAAAA
AAAAA
cDNA synthesis
with adapters
AAAAA TTTTT
AAAAA TTTTT
AAAAA TTTTT
AAAAA TTTTT
AAAAA TTTTT
AAAAA TTTTT
AAAAA TTTTT
AAAAA TTTTT
Size partitioning &
PCR amplification
SMRTbell™
ligation
PacBio® RS II
Sequencing
Experimental Pipeline
Informatics Pipeline
Remove adapters
Remove artifacts
Clean
sequence
reads
Reads
clustering
Isoform
clusters
Consensus
calling
Nonredundant
transcript
isoforms
Quality
filtering
Final isoforms
PacBio raw
sequence
reads
Raw 5’ primer 3’ primer
Map to
reference genome
Experimental pipeline Informatics pipeline
PacBio raw
sequence reads
Figure 1
a b
AAAA
AAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
Size partitioning &
PCR amplification
cDNA synthesis
with adapters
SMRTbell ligation
RS sequencing
Remove adapters
Remove artifacts
Reads clustering
Quality filtering
Clean
sequence reads
Nonredundant
transcript isoforms
Final isoforms
TTTT
TTTT
Consensus calling
Isoform clusters
Map to reference genome
Evidence-based gene models
polyA mRNA
AAAA
AAAA
TTTT
TTTT
AAAATTTT
AAAATTTT
AAAATTTT
AAAATTTT
Evidenced-based
gene models
(AAA)n
(TTT)n
SMRT adapter
1 2 3 4 5
6 7 8 9 10
(TTT)n
(AAA)n
Coding sequence polyA
tail
SMRT adapter
DevNet: Iso-Seq wiki page
(AAA)n Reads of Insert (AAA)n
http://www.smrtcommunity.com/Share/Protocol?id=a1q70000000HqSvAAK&strRecordTypeName=Protocolhttp://www.smrtcommunity.com/Share/Protocol?id=a1q70000000HqSvAAK&strRecordTypeName=Protocolhttp://www.smrtcommunity.com/Share/Protocol?id=a1q70000000HqSvAAK&strRecordTypeName=Protocolhttp://www.smrtcommunity.com/Share/Protocol?id=a1q70000000HqSvAAK&strRecordTypeName=Protocolhttp://www.smrtcommunity.com/Share/Protocol?id=a1q70000000HqSvAAK&strRecordTypeName=Protocolhttp://www.smrtcommunity.com/Share/Protocol?id=a1q70000000HqSvAAK&strRecordTypeName=Protocolhttp://www.smrtcommunity.com/Share/Protocol?id=a1q70000000HqSvAAK&strRecordTypeName=Protocolhttp://www.smrtcommunity.com/Share/Protocol?id=a1q70000000HqSvAAK&strRecordTypeName=Protocolhttp://www.smrtcommunity.com/Share/Protocol?id=a1q70000000HqSvAAK&strRecordTypeName=Protocolhttps://github.com/PacificBiosciences/cDNA_primer/wikihttps://github.com/PacificBiosciences/cDNA_primer/wikihttps://github.com/PacificBiosciences/cDNA_primer/wikihttps://github.com/PacificBiosciences/cDNA_primer/wikihttps://github.com/PacificBiosciences/cDNA_primer/wikihttps://github.com/PacificBiosciences/cDNA_primer/wiki
-
No Assembly required
Multiple isoforms observed at a single loci
Tseng, PAG 2014, “ Isoform Sequencing: Unveiling the Complex Landscape of the Eukaryotic Transcriptome on the
PacBio® RS II” (poster)
Rat heart Rat lung
https://s3.amazonaws.com/files.pacb.com/pdf/Isoform+Sequencing+-+Unveiling+the+Complexity+of+the+Eukaryotic+Transcriptome.pdfhttps://s3.amazonaws.com/files.pacb.com/pdf/Isoform+Sequencing+-+Unveiling+the+Complexity+of+the+Eukaryotic+Transcriptome.pdfhttps://s3.amazonaws.com/files.pacb.com/pdf/Isoform+Sequencing+-+Unveiling+the+Complexity+of+the+Eukaryotic+Transcriptome.pdfhttps://s3.amazonaws.com/files.pacb.com/pdf/Isoform+Sequencing+-+Unveiling+the+Complexity+of+the+Eukaryotic+Transcriptome.pdf
-
“Gene Identification, Even in Well-Characterized Human
Cell Lines and Tissues, is Likely Far From Complete”
Au et al. (2013) Characterization of the human ESC transcriptome by hybrid sequencing. PNAS doi:
10.1038/pnas.1320101110.
8,048 RefSeq-annotated, full-length isoforms and 5,459
predicted isoforms
“Over one-third of these are novel isoforms, including 273
RNAs from gene loci that have not previously been identified”
http://www.ncbi.nlm.nih.gov/pubmed/24282307http://www.ncbi.nlm.nih.gov/pubmed/24282307http://www.ncbi.nlm.nih.gov/pubmed/24282307http://www.ncbi.nlm.nih.gov/pubmed/24282307http://www.ncbi.nlm.nih.gov/pubmed/24282307http://www.ncbi.nlm.nih.gov/pubmed/24282307http://www.ncbi.nlm.nih.gov/pubmed/24282307
-
ABRF NGS RNA-Seq Comparative Study:
Iso-Seq™ Application provides Most Uniform 5’ to 3’ Coverage
-
Splice Landscape of Neurexin 1a
Treutlein et al. (2014) Cartography of neurexin alternative splicing mapped by single-molecule long-read mRNA
sequencing. PNAS. doi:10.1073/pnas.1403244111
Nrxn1α domain
structure
Exons
• green – present
• white – absent
Splice isoform
abundance
(2,574 full-length
Nrxn1α mRNAs
sequence reads)
6 SMRT® Cells
247 unique
alternatively-
spliced
isoforms
http://www.pnas.org/content/early/2014/03/12/1403244111.abstracthttp://www.pnas.org/content/early/2014/03/12/1403244111.abstracthttp://www.pnas.org/content/early/2014/03/12/1403244111.abstracthttp://www.pnas.org/content/early/2014/03/12/1403244111.abstracthttp://www.pnas.org/content/early/2014/03/12/1403244111.abstracthttp://www.pnas.org/content/early/2014/03/12/1403244111.abstracthttp://www.pnas.org/content/early/2014/03/12/1403244111.abstracthttp://www.pnas.org/content/early/2014/03/12/1403244111.abstracthttp://www.pnas.org/content/early/2014/03/12/1403244111.abstracthttp://www.pnas.org/content/early/2014/03/12/1403244111.abstracthttp://www.pnas.org/content/early/2014/03/12/1403244111.abstracthttp://www.pnas.org/content/early/2014/03/12/1403244111.abstracthttp://www.pnas.org/content/early/2014/03/12/1403244111.abstracthttp://www.pnas.org/content/early/2014/03/12/1403244111.abstract
-
Confidence
Without
PacBio
reads
Including
PacBio
reads
Additional ~5000
gene models
validated
PacBio® Sequences Used for
Gene Model Validation in Lettuce
PAG 2014, Marilena Christopouku “Targeted transcriptome analysis using PacBio sequencing to dissect multi-gene
families encoding NBS-LBR resistance proteins in lettuce”
https://pag.confex.com/pag/xxii/webprogram/Paper10681.htmlhttps://pag.confex.com/pag/xxii/webprogram/Paper10681.htmlhttps://pag.confex.com/pag/xxii/webprogram/Paper10681.htmlhttps://pag.confex.com/pag/xxii/webprogram/Paper10681.htmlhttps://pag.confex.com/pag/xxii/webprogram/Paper10681.htmlhttps://pag.confex.com/pag/xxii/webprogram/Paper10681.html
-
PacBio® Iso-Seq Data Used to Confirm Predicted
Scaffolds in Norway Spruce Genome
39
PAG 2014: Yao-Cheng Lin “PacBio cDNA sequencing of Norway spruce”
14 SMRT® Cells
of PacBio data
using early
chemistry &
protocols
https://pag.confex.com/pag/xxii/webprogram/Paper9725.html
-
Selection of Additional Customer References/Publications
Case Study: A SMRT® Approach for Finishing
Plant and Animal Genomes
Click on graphic to hyperlink to example
http://files.pacb.com/pdf/CS_SMRTApproach_FinishingPlantAnimalGenomes.pdfhttp://files.pacb.com/pdf/CS_SMRTApproach_FinishingPlantAnimalGenomes.pdfhttp://files.pacb.com/pdf/CS_SMRTApproach_FinishingPlantAnimalGenomes.pdfhttp://files.pacb.com/pdf/CS_SMRTApproach_FinishingPlantAnimalGenomes.pdfhttp://files.pacb.com/pdf/CS_SMRTApproach_FinishingPlantAnimalGenomes.pdf
-
Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, and Iso-Seq are trademarks of Pacific
Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.