454 application presentation
TRANSCRIPT
The Next Generation of Genomic ResearchThe Next Generation of Genomic ResearchGenome Sequencer GS-FLX
Patrick Ng, Ph.D.
Scientific Liaison Manager, Asia-Pacific
Roche Diagnostics
May 2007 June 2007
Sample PrepRoche reagents
• RNA/DNA isolation
• RNA/DNA purification
• DNA amplification
• DNA labeling
Real-time qPCRLightCycler 2.0; LC480
• HRM for SNP validation
• qPCR for quantitative validation
Quick genome scansWhole genome tiling microarrays
• aCGH
• ChIP-chip
• Epigenetics
• Gene expression
• Targeted sequence capture (SeqCap)
High-throughput sequencingGenome Sequencer GS-FLX
• Ultra-Broad sequencing
• Ultra-Deep sequencing
• Whole genome de novo
sequencing
• Whole genome re-sequencing
Roche Genomics SolutionsRoche Genomics Solutions
Computer subsystemKeyboard and mouse (in drawer)
Optics subsystemFluidics subsystem
Genome Sequencing Genome Sequencing –– 11stst GenerationGenerationSanger dideoxy sequencing on an ABI capillary sequencer
• Sample prep
– Bacterial cloning of DNA, colony picking,
culturing, plasmid DNA extraction
– Typical time needed: weeks/ months
• DNA Sequencing
– State of the art capillary sequencer
enables maximum ~1-2 Mb/ 24hr
– Typical sequencing time needed for a
whole genome project: months/ years
• Personnel requirements
– Production sequencing facility required;
manpower needed for sample prep and
sequencing, typically 5-10 full-time staff
Genome Sequencing Genome Sequencing –– Next (2Next (2ndnd) Generation) GenerationMassively parallel pyrosequencing (454-sequencing™) on the GS-FLX
• Sample prep
– No bacterial cloning; all cloning is done in vitro
– time needed: ~15 hrs including emulsion PCR
amplification
• DNA Sequencing
– GS-FLX output/run ~100 Mb in 7.5hr
– Daily throughput ~300 Mb/ 24hrs theoretical
– Typical sequencing time needed for a
whole genome project: much shortened
• Personnel requirements
– Manpower needed for sample prep and
sequencing, typically 1-2 full-time staff
Sequencing of Sequencing of CorynebacteriumCorynebacterium kroppenstedtiikroppenstedtii strainstrain
From sequencing to manuscript submission = 1 weekTauch, A. et al. Ultrafast pyrosequencing of Corynebacterium kroppenstedtii DSM44385 revealedinsights into the physiology of a lipophilic corynebacterium that lacks mycolic acids.”Journal of Biotechnology, Available online 20 March 2008
Number of GS-FLX runs 1
Number of reads 560,248
Mean read length 196 bp
Total no. of bases 110,018,974
Coverage depth 45.8 x
Number of assembled contigs 6
Assembled bases 2,434,342
Mean G+C content 57.5%
Coding sequences (pred.) 2,119
Average gene length 1,016 bp
Average intergenic region 163 bp
7.5 hours sequencing (1 run), automated annotation overnight, paper written in 3 days.
Sequencing of Sequencing of CorynebacteriumCorynebacterium kroppenstedtiikroppenstedtii strainstrain
From sequencing to manuscript submission = 1 week
Sequencing Time:
C. glutamicum (2000) Clone-by-clone, Sanger 2 years
C. jejecum (2003) Whole genome shotgun 3 mths
C. urealyticum (2006) GS-20 shotgun 4 days
C. kroppenstedtii (2007) GS-FLX shotgun 7.5 hrs
(XLR_HD):
Expect >10 C. kroppenstedtii-like genomes to besequenced in 1 run
(~500Mb / (2.4Mb*20x oversampling) = 10.4)
Dr. Andreas Tauch, University of Bielefeld: 7.5 hours sequencing (1 run), automated annotation overnight, paper written in 3 days.
0
5
10
15
20
25
30
35
40
45
50
0
2
4
6
8
10
12
14
16
18
Cloning Bias in Conventional Cloning Bias in Conventional ABIABI SequencingSequencing
(500 kb stretch in Listeria monocytogenes)
GS-FLX coverage
Reference position (bp)
Courtesy of Drs. Nusbaum and Young of the Broad Ins titute
ABI coverage
GSGS--FLXFLX Sequencing WorkflowSequencing WorkflowOverview
One Bead
One Read 400,000+
reads per run
One Fragment
Sample input: Genomic DNA, BACs, amplicons, cDNA
Emulsification of beads and fragments in water-in-oil
microreactors
Generation of small DNA fragments via nebulization
Clonal amplification of fragments bound to beads in
microreactors
Sequencing and base calling
Ligation of A/B-Adaptors flanking single-strandedDNA fragments
Starting DNA
Fragments
GS-FLX Process Steps1. Shotgun DNA library preparation
8 h 7.5 h
SequencingemPCR
4.5 h
DNA Library Preparation and Titration
and 10.5 h
sstDNA librarygDNA
a. Genomic DNA fragmented by nebulization
b. Adaptors A and Biot-B ligated to fragments
c. Immobilize repaired, adapted DNA to paramagnetic Streptavidin beads
d. Select for only A-fragment-B and B-fragment-A
sstDNA molecules in supernatant (not Biot)e. Functional validation of sstDNA by titration run (do
emPCR and GS-FLX sequencing run to determine
best number sstDNA molecules per Capture Bead;)
Process StepsProcess Steps2. emPCR
8 h 7.5h
SequencingemPCR
4.5 h
DNA Library Preparation and Titration
and 10.5 h
Clonally-amplified sstDNA attached to capture beadsstDNA library
*Titration is required to avoid excessive “empty” or else “multi-template” beads in the emPCR
Anneal sstDNA to an
excess of DNA capturebeads (choice of no. of average molecules/bead
is based on titration
results)*. Capture beads
are non-paramagnetic.
Emulsify beads and PCR
reagents in water-in-oil microreactors (using a TissueLyser). Most of these
microreactors that contain
DNA, will contain only 1 DNA molecule and 1 bead
Break microreactors, and enrich for DNA-
positive beads (using
magnetic streptavidin
beads that bind to the biotinylated emPCR
products). Convert to
bead-bound sstDNA
Clonal amplificationoccurs inside
microreactors.
Typically, 40 PCR
cycles are performed.
Process StepsProcess Steps3. Sequencing
� A single, clonally amplified sstDNA bead
(after enrichment) is deposited per well.
� Load PicoTiterPlate (PTP) into sequencer,
begin run
Quality readsAmplified sstDNA library beads
8 h 7.5 h
SequencingemPCR
4.5 h
DNA Library Preparation and Titration
and 10.5 h
� PTP well diameter: average of 44 µm
� Capture bead diameter: 27-32 um
� Enzyme bead diameter: 2.8 um� Packing bead diameter: 0.8 um
� Wells per PTP: 1.6 Million
Process StepsProcess StepsSequencing
Pyrosequencing details (Sequencing-by-synthesis)
Quality readsAmplified sstDNA library beads
DNA capture
bead
containing
~10-30 million copies of a single clonalfragment
(sstDNA
templates)
� 4 unlabeled nt’s (TACG) are added
sequentially (flowed), 1nt at a time.� Cycled 100 times for large PTP run
� Chemiluminescent signal generation
(based on Pyrosequencing™)� Pyrophosphate released upon ntincorporation, is converted to ATP, which drives luciferase reaction, and light output
� Light signal captured on CCD camera
� Signal processing to determine base
sequence and quality score
8 h 7.5 h
SequencingemPCR
4.5 h
DNA Library Preparation and Titration
and 10.5 h
adenosine 5adenosine 5adenosine 5adenosine 5´́́́ phosphosulfatephosphosulfatephosphosulfatephosphosulfate
SoftwareSoftwareImage acquisition -> Image Processing -> Signal processing (FASTA basecalls + Quality scores) -> Applications software
Metric and image viewing software Signal output from a single well
(flowgram)
On current GS-FLX system, raw image -> FASTA basecalls = 8-9 hours
GS FLX SequencingGS FLX SequencingBioinformatics
Reference Mapper(assembly using ref. seq.)
De novo Assembler(assembly from scratch)
Amplicon Variant Analyzer
Image capture
Image processing
Signal processing
Features:
• Small dataset size = greater convenience. 13.2 GB including raw images, allows future re-analysis if desired
(e.g. post software upgrade)
• Useful software (currently 3 applications) available out-of-the-box
• Long (250bp) and accurate (99.5% single-read) reads = No filtering of reads against known ref neededNo filtering of reads against known ref needed
• Software recognizes molecular barcodes (MIDs) for greater multiplexing (and economy); also able to
assemble using various types: 454-shotgun, Sanger and paired-end reads (singly/ in combo) for best results
The Genome Sequencer FLX SystemThe Genome Sequencer FLX SystemTechnical Specifications
Current system
� ≥ 400,000 sequence reads per run
� 200 - 300 bases per read
� 1 run = 7.5 hours = ~100 Mb
� 2-3 runs per 24hr day
� Theoretical 1 Gb in 3-4 days
� Accuracy: ~99.5% over 200 bases
� Image & signal processing: 8-9 hrs post-sequencing
After “Titanium” kit upgrade, ~Q3, 2008
� ≥ 1,000,000 sequence reads per run
� 400 - 500 bases per read
� 1 run = 10 hours = ~500 Mb
� 2 runs per 24hr day
� Theoretical 1 Gb in 1 day
� Accuracy: ~99.0% over 400 bases
� Image & signal processing: 12-20 hrs post-sequencing (upgraded Unix cluster)
Current image Improved imageNew, metallized PTPNo change in sequencer
For an out-of-the-box solution we have
qualified a provider who will supply an
integrated system. This purchase is made
through Roche and is delivered as a one
box solution.
Support concept in place, deck available
Server specifications are available for a
do-it-yourself option.
Data Analysis Server for XLR HD
GSGS--FLXFLX or not?or not?
There are other Next-Generation Sequencers available that seem to be cheaper; they promise to do many things. Why don’t I buy those instead?
• Keep in mind the following: other NGS platforms give many short, lower-quality reads. These are suitable for only specific applications (mapping of tags to a reference, and counting them). Their short
reads result in poor mapping specificity and short contigs (more gaps). Few/no publications..
You need a GS-FLX if:
•You require large scale, very high throughput DNA sequencing
•You intend to study any of the following: de novo sequencing & assembly (whole/partial genome);
whole genome resequencing; targeted resequencing/ amplicon resequencing; metagenomics;
transcriptomics.
You do not need a GS-FLX if:
•You only intend to sequence small numbers of samples each time (tens of thousands of bases)
•Your applications only involve simple clone sequence verifications.
SideSide--byby--Side ComparisonSide ComparisonSanger dideoxy sequencing vs 454-sequencing ™
Errors: homopolymer
indels
Errors: homopolymer
slippage; GC-rich
hardstops
Amplicon sequencing:
each molecule
individually sequenced
Direct PCR: averaged
signals
In vitro: unbiasedIn vivo clones: biasedSample characteristics
Hours – DaysMonths – YearsSequencing time (bact. genome project)
HoursMonthsSample prep time (bact. genome project)
1-2 FTE5-10 FTEManpower (bact. genome project)
~99.5% / >99.995%99.3-99.6% / >99.99%Accuracy (Single-read/ 20x consensus)
~0.008¢~0.2¢Raw cost-per-base (US cents)
~300 million (3 runs)1-2 millionThroughput (bases per 24 hrs)
~200750-1,000Read-length (bases)
GS-FLXSanger dideoxy
ATAT-- or GCor GC--rich genomes not a problem for GSrich genomes not a problem for GS--FLXFLX
Depending on the organism,
read lengths are in the range
of 200 – 300 high quality
bases.
Genomes that are more AT- or GC-rich typically yield a longer read length distribution as compared to an AT/GC neutral genome
Read length
Long reads matterLong reads matterShort reads do not provide genome mapping uniqueness
Modified from Figure 2. Uniqueness as a function of read length; human genomic DNA.
25 to 35-bp reads: ~ 80 - 87% uniqueness
100-bp reads: > 90% uniqueness
(Whiteford, N. et al. Nucl. Acids Res. 2005)
Whole human genome
Human chr 1 only
High genome-mapping uniqueness is
important for genome annotation and
transcriptome profiling experiments
Transcriptome profiling of PlantsTranscriptome profiling of PlantsShort (singleton) reads do not provide sufficient transcriptomemapping uniqueness
78%62%53%46%Zea mays
82%65%56%51%Solanum tuberosum
91%81%75%71%Populus trichocarpa
79%61%53%48%Oryza sativa
86%74%69%67%Medicago truncatula
95%86%79%75%Lotus japonicus
82%64%56%51%Glycine max
91%80%71%66%Brassica rapus
83%65%55%49%Arabidopsis thaliana
250723625Read length
Table shows uniqueness of transcriptomic singletons. Data is extracted from a 2008
study modeling all possible reads in both directions of specified lengths across 20
different plant species. Reference transcript assemblies were from http://plantta.tigr.org/)
Transcriptomics
(Whiteford, N. et al. Nucl. Acids Res. 2005)
Percentage
of chr 1
covered10000
Long reads matterLong reads matterShort reads result in short contigs and poor coverage
Modified from Figure 2. Genome coverage (at specific contig sizes), as a fn. of read length
25 to 35-bp reads: Only 7% can form contigs of 10,000bp and larger (gaps!!!)
100-bp reads: 90% can form contigs of 10,000bp and larger
Note: Average gene size in humans ~10-15 kb
Read length is important in genome assemblyRead length is important in genome assemblyWhy long reads are needed for genome assembly
1. Genomes (especially complex ones) contain large numbers of repeats
- Repeats can be a few bases, to thousands of bases long
2. It is very difficult to completely assemble a genome, if the read length is shorter
than the repeats (see next slide)
- The reads need to bridge the repetitive DNA regions
3. Paired-end sequences can help assembly, but the end-tags still need to map specifically to the contigs
- So the end-tags themselves also need to be longer
4. In summary, short reads = short contigs, more gaps and poor assemblies
5. GS-FLX gives longer reads, fewer gaps, good genome assemblies
* NOTE: The same situation exists even for resequencing: the mapping uniqueness of short
singleton reads to a reference genome is greatly improved by increased read length. Also, the structures of splice variants are much clearer when long reads are used.
Long reads matter for Long reads matter for de novode novo assembly assembly Short reads cannot bridge repetitive regions; a gap remains
Unique DNA Sequence
Unique DNA Sequence
CGTAGGCTAGATGCATGCAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGATATAGCGATCTCGACATGCT
Repetitive DNA Sequence
GS-FLX Long read
Short reads ?
If the read does not span the repeats, no amount of increased sequencing coverage (depth) will allow either de novo genome assembly, or high-quality
resequencing (there will be gaps)
?
??
?
GSGS--FLXFLX sequencing accuracysequencing accuracy
• GS-FLX Single-Read accuracy > 99.5% (includes all homopolymer errors)
– Sanger Single-Read Accuracy = 99.3% to 99.6%
• GS-FLX (22x) Consensus-Read Accuracy > 99.995%
0.0%
0.5%
1.0%
1.5%
2.0%
2.5%
3.0%
3.5%
4.0%
0 50 100 150 200 250
Base Position
Cum
ulat
ive
Rea
d Error
09_29A09_29B09_14 + 09_18A09_18B+09_25ThermophilusC jejuni
E. coli run #1E. coli run #2E. coli run #3E. coli run #4T. thermophilusC. jejuni
Reported in Nature, 2005
GS20Q2, 2006
Currently (GS-FLX)
The very high GS-FLX single-read accuracy avoids the need for “quality-filtering”against a reference sequence (used by other sequencing platforms)
Cumulative Read Error by LengthCumulative Read Error by LengthComparing short-read system to Roche GS-FLX
• At 36 bases, Short-read system gives ~2 bases wrong (5% error)
• Up to 200 bases. GS-FLX gives < 1 base wrong (0.36%)
*Short-read sequencer data (Competitor I)
Based on data downloaded from Sanger Institute website
System Performance: System Performance: HomopolymersHomopolymersIndividual Reads versus Consensus Reads in E.coli
1. Single-read accuracy here in parsing homopolymers > 90% up to n = 5
2. Consensus accuracy provided by 22x oversampling is ~100% even at n = 9
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
1 2 3 4 5 6 7 8 9Homopolymer Length
Acc
urac
y
Consensus
Single ReadsGenome Sequencer FLX data
Distribution of Distribution of homopolymershomopolymers in the human genomein the human genome
HomoPolymer Frequency
0.00001%
0.00010%
0.00100%
0.01000%
0.10000%
1.00000%
10.00000%
100.00000%
A G C T
Nucleotide
Per
cent
of G
enom
e
4-mer
6-mer
8-mer
10-mer
12-mer
14-mer
16-mer
The incidence of homopolymeric stretches long enough to cause problems for
consensus GS-FLX sequencing (> 8-mer) in the human genome is low, < 0.05%
What can the GSWhat can the GS--FLXFLX do for you?do for you?
• May 2008: > 150 publications related to 454-sequencing
• You can request updated bibliographies (or access from www.454.com)
UDSUDSUltra Deep Amplicon
resequencingof specific
regions
• de novo sequencing and assembly
(with or without paired-end data)
• Metagenomics
(whole genome)
• Transcriptomics
• Small ncRNA identification
• Novel pathogen discovery
(Medical metagenomics)
• Ancient DNA studies
• Epigenetics and Gene
Regulation
- DNA methylation
• Medical resequencing
(targeted)
- Oncogenomics
• Microbial diversity
(16S rRNA Metagenomics)
• Virus quasispecies studies
GSGS--FLXFLX applicationsapplications
WGSWGSWhole
Genome Sequencing
UBSUBSUltra Broad Sequencing
• Whole genome re-sequencing and
assembly to identify variations
(comparative genomics);
• Whole genome Paired-End Mapping to
identify structural variations
Amplicon sequencing (Ultradeep)
• Medical resequencing for variant detection
• Virus genotyping and drug-resistance
• DNA methylation characterization
How does How does AmpliconAmplicon Sequencing detect rare variants?Sequencing detect rare variants?
Ultradeep sequencing
Mutation present at low frequency
in heterogeneous sample
PCR to obtain target locus
GS-FLXsequencing
Every species is sequenced individually, allowing every mutation to be quantified
Direct PCRsequencing (Sanger)
Result is averaged across all species present and may be undetectable
Targeted Targeted resequencingresequencing (Medical (Medical resequencingresequencing))Ultradeep amplicon resequencing to detect rare variations
Ultradeep sequencing
Disease Associated Region > 1000s bp
Long Range PCR Amplicons: 3-15 kb each
Isolate genomic DNA from each sample
Generate long-range PCR ampliconsacross desired target region
Pool amplicons (in equimolar amounts)
Shotgun Sequence on GS-FLX
Genomic DNA
A B C D
Analysis is by Ref Mapper (AVA is for direct amplico n analysis)
ResequencingResequencing exampleexample: Long : Long rangerange PCR on EGFRPCR on EGFRNo gaps, analysis of the whole gene possible
� Shotgun library created from 10 overlapping, long range PCR Amplicons
� Mapped into 1 single contig of length 70,107 nt (amplified Region: 70,1kb)
� Sample H441: 80 High Confidence variations detected (62 SNPs known from db)
� Sample H1975: 95 High Confidence variations detected (73 SNPs known from db)
Short InDels in H441
Detectable – no quality filtering based on comparison with reference sequence needed
----/CACA
-----/TTCA
24 Base Insertion
ACAC/----
TGTG/----
24 Base Deletion
GCT/---13 Base Deletion
--/CA12 Base Insertion
AT/--12 Base Deletion
Various12Single Base Substitution
Sequence Change# of Variants
All new variants have been confirmed by Sanger Sequencing;
Detecting variations > 2nt not possible using competing platform (Manufacturer I)
Whole Genome ReWhole Genome Re--SequencingSequencingBenefits of using GS-FLX
� 250bp reads effectively span many sequences that micro-reads interpret as
repeats, and minimizes numbers of gaps.
� Bias-free sequencing (coverage does not vary due to GC/AT content).
� Significantly higher genome coverage and much more efficient identification of
more variations.
� InDels can be discovered efficiently/easily (no filtering against reference
sequence needed).
Ultradeep sequencing
• Lung cancer patient had strong initial response to Erlotinib (TKI) before relapsing 12.5 mo later with pleural effusions.
• Histological specimen post-relapse had only 1-10% tumor content. Sanger sequencing showed wt EGFR only.
• 454-sequencing (amplicon sequencing of exons 18-22) showed 2 mutations: (i) 18-bp deletion in exon 19 (Del4) (3% of 11,367 reads), and (ii) C->T substitution resulting in T790Mmutation (2% of 136,776 reads).
• Validation: cells overexpressing EGFR with Del4 were sensitive to Erlotinib, but combined Del4/T790M mutation rendered cells resistant.
Targeted Targeted resequencingresequencingUltradeep sequencing for retrospective analysis of relapse in NSCLC patient R.K. Thomas et al., Nature Medicine (July 2006) 12: 852-855
� 3 mutations are frequently observed in multidrug resistant HIV viruses:
M46I/L, V82A/F/S/T, and L90M.
� The combination confers resistance to all protease inhibitors currently in use.
� If all three were found on a single major species before treatment, protease
inhibitors would have minimal effect.
� However, if they were on distinct species, they could be suppressed by
different inhibitors.
� A single GS-FLX read (250bp) read covers all 3 mutations (135bp apart)
� Provides valuable information on drug resistance and productive treatment
options
� Short reads (25-50bp) cannot distinguish between the various subtypes
For virus genotyping, long read lengths For virus genotyping, long read lengths
are of paramount importance in are of paramount importance in HAPLOTYPINGHAPLOTYPING
Ultradeep sequencing/ haplotype analysis
See also: Wensing et al. (2005) JID 192:958; Shafer et al. (2006) JID 194 (Supple 1):551; Bonaventura et al. “New
developments in HIV drug resistance and options for treatment-expererienced patients”
etc…
Individual sequencing of each template allows identification and quantificationof distinct virus subspecies within a mixed population, including “haplotypes”
Subspecies identification in HIVSubspecies identification in HIV--11
Sequencing of 207 Sequencing of 207 bpbp ampliconamplicon from virus protease genefrom virus protease gene
Mut
atio
n F
req.
Cov
erag
e
GACATGAATTTG|| |||| ||||GAAATGAGTTTG
34% of readsGAAATG-GCTTTGCC|||||| | ||||||GAAATGAG-TTTGCC
39% of reads
GAAATGCAGTT-GCCAGG|||||| |||| ||||||GAAATG-AGTTTGCCAGG
21% of reads
In collaboration with Dr. M. Kozal, Yale VA Hospita l; See also Wang et al. (2007) Genome Res , for a good look at coverage needed in HIV-1 variant analysis
Sanger, direct PCR
GAAATGG NTTTGCC
Unresolvable region
Transcriptomics
• Based on existing (some partial) reference genomes: microbes,
drosophila, human cell lines, maize, arabidopsis, medicago,
salmon, sheep.
• de novo transcriptome characterization: Paper wasp (ref
honeybee), Glanville Fritillary butterfly (ref Bombyx mori)
Transcriptomics
Transcriptome characterizationTranscriptome characterization
OverviewOverview
� cDNA synthesis from RNA, then fragmentation and 454-sequencing
- options: oligo-dT, random priming, RNA fragmentation, cDNA normalization
� Map the ESTs to reference genome or other suitable database
o Compare gene expression levels across samples by EST counting
o Perform genome annotation to improve existing gene models
Transcriptomics
� Long, accurate reads allow excellent mapping of ESTs to genome (Table)
- Torres et al., Genome Research (2008) Jan;18(1):172-7 (GS-20, Drosophila)
Transcriptome characterizationTranscriptome characterization
Advantages of using GSAdvantages of using GS--FLXFLX
Transcriptomics
� Essentially unbiased representation regardless of transcript length (Figure)or expression level
- Weber et al., Plant Physiology (2007) May; 144:32-42 (GS-20, Arabidopsis)
Transcriptome characterizationTranscriptome characterization
Advantages of using GSAdvantages of using GS--FLXFLX
Here we see 154,379 GS-20 ESTs corresponding to 1,053 transcripts (flcDNAs) of 1,000-2,000nt in size (medium length).
(position relative to 5’ end of cDNA)
5’ 3’
• ESTs cover every part
of every cDNA.
Medium length cDNAs
are shown here, but distribution pattern
similar for Short and
Long cDNAs also
• Some favoring of 5’
and 3’ ends, possibly
indicating incomplete
nebulization
• Total number of (+)
direction ESTs (55-
60%) is slightly > (-) direction ESTs. Why?
GSGS--FLXFLX sequencing of the paper wasp sequencing of the paper wasp
transcriptometranscriptome
� Conventional cDNA library preparation from wasp was sequenced using 454-Sequencing.
� 391,157 brain cDNA reads generated
� 3,017 genes hit in honey bee genome
� No wasp genome available
� 32 behavioral gene orthologs further characterized to demonstrate the link between
maternal behavior and the development of social behavior
� Study also demonstrated the ability to use a known, related genome (Bee in this case) as a hub to successfully generate assemblies
de novode novo whole transcriptome characterization whole transcriptome characterization
using 454using 454--sequencing (GSsequencing (GS--20)20)
•No reference genome available; this is 1st report of a de novo
transcriptome assembly using NGS data.
•2 cDNA libraries, normalized (Evrogen); 2 GS20 runs done;
SeqMan Pro assembly of ESTs.
• 518,079 high-quality ESTs (88% of raw) obtained, assembled
into 48,354 contigs + 59,943 singletons, thus 108,297 unigenes.
• Microarray made using assembled transcripts: high reproducibility.
• Issues: Cf B.mori database, inferred that ideally, 4x more sequencing
would be needed for complete flcDNA coverage. Also, assembly of
splice variants was “complex”, and needed human annotation. We can
expect that with GS-FLX and XLR-HD, results will be vastly improved.
• Interestingly, 618 reads were non-metazoan (mainly cryptosporidiae),
hence 454-seq can possibly be used for xenobiont detection
Transcriptomics
Molecular Ecology (2008)
Glanville fritillary butterfly (Melitaea cinxia).
Metagenomics (& Microbial Diversity)
MetagenomicsMetagenomics on the GSon the GS--FLXFLX
Benefits of long, accurate readsBenefits of long, accurate reads
• Metagenomics is the shotgun sequencing of mixed DNA isolated from environmental samples. For assessing bacterial diversity, the focus is usually on 16S rRNA
• Read length of >200 bases allows:
– accurate assessment of diversity (low rate of ambiguous mapping results)
– unambiguously identify an organism or gene in an unknown complex environmental sample.
See: R Edwards et. al. BMC Genomics, 7:57 (2006)
Environment (e.g. deep sea)
Tens of thousands of
different species
Isolatio
n of e
nviro
nmental
DNA and sh
otgun se
quencing
FLX long reads Microreads
Reads map uniquely Reads map everywhere
You can´t use them
(Paired end reads might help, but even then,
higher specificity with longer end-tags)
MetagenomicsMetagenomics on the GSon the GS--FLXFLX
Why long reads are neededWhy long reads are needed
Elucidation of symbiotic interdependence between Elucidation of symbiotic interdependence between
insect host & 2 internal bacteriainsect host & 2 internal bacteria
Ultrabroad sequencing- Metagenomics
• Tripartite symbiotic relationship between insect host
(H. coagulata, Glassy-Winged Sharpshooter) that
feeds on xylem sap, and 2 internal bacteria:
Baumannia sp. (provides vitamins and cofactors)
and Sulcia sp. (provides essential aa.’s).
• 23 Newbler contigs were assembled into a
complete circular Sulcia genome
• Illumina-Solexa 1G was used to resolve 155
homopolymeric uncertainties (Sulcia genome is
245,000 bp; homopolymer errors = 0.06%).
• As with combined Sanger/454-sequencing, perhaps
dual-platform experiments may be the most
sensible approach for highest-quality asemblies, for
genome centers that can afford them.
Novel pathogen discovery Novel pathogen discovery Medical Metagenomics
Palacios, G, et al.; N Engl J Med 2008:358The same group that did the Honey Bee
Colony Collapse Disorder metagenomics work
Novel pathogen discovery/ Metagenomics
• 3 women in Australia received various transplanted organs (liver, kidney) from same male donor;
donor had died from cerebral hemorrhage.
• 4-6 wks post-op, recipients all died from “febrile illness with varying degrees of encephalopathy”.
• Tested negative: bacterial/ viral cultures; PCR for various viruses; microarray analysis using
panmicrobial and viral arrays
• Methods: GS-FLX sequencing performed on RNA extracted (various source tissues) from 2 deceased
patients; RNA was DNaseI-treated, then RT-PCR using random primers. No further nebulization done.
• Results: Sequences filtered bioinformatically to remove repetitive DNA; human (host) DNA subtracted;
non-human sequences were clustered with Cd-hit, then CAP3 assembled. BLASTX and BLASTN
against Genbank performed. Of 103,632 sequences (mean size: 162bp), 14 fragments had homology to arenavirus, closest relationship to LCMV. Validation of novel LCMV done by RT-PCR:
22 of 30 samples (from 3 patients) positive. Other validations done, on Vero E6 cells, and patient
samples.
• Conclusions: GS-FLX sequencing has been used to identify a novel pathogen present against a
massive background of known host gDNA; “Medical Metagenomics”?
Novel pathogen discovery/ Metagenomics
Palacios, G, et al.; N Engl J Med 2008:358
Novel pathogen discovery Novel pathogen discovery –– Medical Medical MetagenomicsMetagenomics??
14 out of ~100K reads hit LCMV
Maximum contiguous match to known LCMV:Only 14 bp
Small noncoding RNA (sncRNA) characterization
Ultrabroad seq- MicroRNA
Small nonSmall non--coding RNA (coding RNA (sncRNAsncRNA) analysis) analysis
• Transcripts expressed from the genome can be
protein-coding, or non-coding
• Non-coding RNAs have important regulatory
functions; small ncRNAs include:
– Small interfering RNAs (siRNAs) ~21-25nt
• NAT-siRNAs (antisense; 21-24nt)
– microRNAs (miRNAs) ~22nt
Recently, some larger sncRNAs identified:
•Piwi-interacting RNAs (piRNAs) ~29-31nt
•Small-scan RNAs (scnRNAs) 27-31nt
•Repeat assd siRNAs (rasiRNAs) 24-29nt
•Long siRNAs (lsiRNAs) 30-40nt
–small nucleolar RNAs (snoRNAs) ~60-300nt
–“short RNAs” (sRNAs) <200nt (mean 35nt)
•Promoter-associated PASRs
•2’-Termini associated TASRs
–“long RNAs” (lRNAs) >200nt (mean 100nt)
• GS-FLX’s long read length enables the
complete sequencing of almost all small
RNA classes, in just one read
• Coupled with the high depth of coverage,
GS-FLX is well-placed to not just
characterize existing sncRNAs, but also
identify and quantify those that may be
very rare, or novel, with a high degree of
confidence.
Whole genome sequencing
• de novo sequencing and assembly
- bacteria, fungi (including mushroom), viruses, barley (4 BACs as
proof of principle), pinot noir grape (Sanger plus 454)
• Genome resequencing
- bacteria, viruses, Pea, monkey, human
• Paired-end reads and their utility in de novo assembly
& structural variation detection
de novode novo Sequencing and AssemblySequencing and AssemblyIntroduction
Contigs that are oriented w.r.t. their immediate neighbours
are gradually ordered to form a scaffold.
Random (shotgun) fragments
A
Clustering (overlapping)
CONTIG 1 CONTIG 2
CONTIG 3
B
Consensus merging
(Contigs are not ordered)
1 2
3
C
(Ordering and orienting contigs, with paired end data)
1 2 3
D
Scaffold (supercontig) formation
1 2 3
E
Scaffolds can still contain gaps.
Final step is “gap-filling” or “finishing”.
De novo sequencing and assembly
Usefulness of PairedUsefulness of Paired--End sequences (1)End sequences (1)
Comparative Genomics
1. Paired-end sequences are very useful for de novo assembly, and also for detecting variations.
2. Paired-end sequences can span repeats, to better orientate shotgun contigs.
3. The choice of what length paired-end span to use, depends on the specific genome being studied.
Biot Biot
Biot
Biot
Biot
Biot
Shear &
select
Contig 1 Contig 2
454 Paired-end Reads
Scaffold Generation
Scaffold 1
Contig 1 Contig 2
454 Paired-end Reads
Scaffold Generation
Scaffold 1
de novode novo genome assembly of bacteriagenome assembly of bacteria
GSGS--2020 shotgun reads, assisted by Pairedshotgun reads, assisted by Paired--End reads (End reads (2kb2kb span)span)
E. coli B. licheniformis S. cerevisiaeGenome size (Mb) 4.6 4.2 12.2Number of runs 3 3 9Fold oversampling 22 27 23Assembly Contigs 140 98 821PE library runs 1 1 2Number of paired reads 112000 255000 395000Supercontigs 24 9 153Genome Coverage 98.60% 99.20% 93.20%
Paired-End; de novo assembly
Roche in-house data
B.pseudomallei #22Expected Genome size (Mb) 7Number of chromosomes 2Number of runs (GS-20) 6Fold oversampling 22Assembly Contigs (1221 - 79098 bp) 940PE library runs 1Supercontigs using 2kb PE library 50 using 5kb PE library 11 using 10kb PE library 4Genome Coverage (w.r.t. Sanger ref) 93.04%
RocheRoche--GIS collaborationGIS collaborationResults of de novo assembly
Choice of Paired-End library target size depends on the particular genome
Improved PairedImproved Paired--End Protocol (16End Protocol (16--20kb20kb span)span)Protocol to be released with Titanium upgrade
10000 20000 30000
Pair Distance (bp)
0
500
1000
1500
2000
Co
un
t
16 kb
10000 20000 30000
Pair Distance (bp)
0
500
1000
1500
2000
Co
un
t
16 kbNumber of Contigs/ Scaffolds
Shotgun 15× 98
3 kb span 18× 7
+
+
GS-FLXRead Type
Genome Coverage
Assembly of 4.6Mb E.coli genome into 1 scaffold (Consensus accuracy ~99.999%)
Paired-End; de novo assembly
New!
16-20 kb span 20× 1
Usefulness of PairedUsefulness of Paired--End sequencing (2)End sequencing (2)
Paired-End; Comparative Genomics
1. Paired-end sequences are very useful for de novo assembly, and for detecting variations.
2. Paired-end sequences can span repeats, to better orientate shotgun contigs.
3. The choice of what length paired-end span to use, depends on the specific genome being studied.
Biot Biot
Biot
Biot
Biot
Biot
Mapped Span (sample to ref) < [mean-3SD]
Reference
SampleInsertion
Deletion
Mapped Span > [mean + 3SD]
Reference
Sample
RocheRoche--Yale publishes study in Yale publishes study in ScienceScienceHuman structural variations identifiedJ. Korbel et al. Science, 13 September 2007
• Same principle as used in GIS, but here the Roche-Yale group used only 3kb PE spans for higher resolution, and ~100bp tags (for best human genome mapping specificity)
• For SV identification, Yale sequenced between 10M – 21M of 3kb-PE reads (thus 10x- 21x coverage)
– GIS used 323,632 of 10kb-PE reads (thus ~ 450x coverage of Bp genome)
• Between NA15510 (putative European female) and NA18505 (Yoruban female),
– A total of 1,297 SVs were identified (1,175 indels, 122 inversions)
• PCR validation on 40 randomly-selected SVs: 97% validation success
• Note: no actual genome-genome alignment was done in this Roche-Yale study
Comparative Genomics
Resequencing
Whole Human Genome ResequencingWhole Human Genome ResequencingThe first Next Gen Sequencer-based resequencing of the human genome
• Blood sample provided by Dr. James Watson in 2005
• 454 Life Sciences/Baylor Human Genome Sequencing
Center (HGSC) collaboration
• Browser at http://jimwatsonsequence.cshl.edu/
• Completed May 2007
• On a GS-FLX: 2 months, US$ 1M, 78.5 Million reads, 19.7 Billion bases (6.5x oversampling).
• Cf. HGP: 10-15 years, US$ 4B.
• Only 3% of the total reads could not be mapped to reference human genome (UCSC and Celera assembly). Of the 3%, ~1.1% completely unknown; remainder were
unmapped because of DNA repeats.
• Identified 177,181 InDels ranging from 3 to >7,000bp.
• 1.8 Million known (in dbSNP) SNPs observed; 200,000 novel SNPs identified
David Wheeler et al., Nature 452:872 (17 April 2008)
De Novo De Novo sequencing of potato BACssequencing of potato BACs
Long accurate reads – few contigs
Pilot study for a member of the potato consortium
� 56 plant BACs sequenced using MIDs, and assembled
� 8 in milestone I, 48 on two LR70 runs in milestone II
� Average BAC insert size: 136 kb
� Average number of contigs >500 nt (N=56): 16.6
� Average N50 contigs (N=56): 39,808 nt
� New sequence information not sequenceable using Sanger capillary sequencer was detected
� Comparison with Sanger sequence often revealed several
kb new information per BAC, because GS-FLX has no
cloning bias
Technological innovations
• Sequence Capture (SeqCap) microarrays
• Multiplex Identifiers (MIDs)
Sample preparation is the new Sample preparation is the new bottleneckbottleneck
for for resequencingresequencing applicationsapplications
Conventional Sanger dideoxy sequencing
Sample prep by PCR and cloning
Next Generation Sequencing: ultra-high throughput
Sample prep by NimbleGen Microarray-based Sequence Capture (SeqCap)
Labor, Infrastructure, Throughput
Capture of specific DNA regions at kb to Mb scale, Capture of specific DNA regions at kb to Mb scale,
will enable the full potential of Next Generation Sequencing to will enable the full potential of Next Generation Sequencing to be exploited. be exploited.
Targeted Resequencing
NimbleGen SeqCap arrays and GSNimbleGen SeqCap arrays and GS--FLX FLX An easier way to do targeted resequencing
7 108 9
Hybridize to
SeqCap array
Wash and elute,
perform PCR
Sequence on GS-FLX
Analyze sequences of
captured exons
Fragment
Targeted Resequencing
NimbleGen SeqCap arrays and GSNimbleGen SeqCap arrays and GS--FLXFLXExon Capture Results
• Probes >60mer
• 1 probe every 10 bases
• Highly reproducible
• Accurately captures targets
• Mean enrichment ~378-fold
(T.J. Albert et al. Nature Methods. October 2007)
454-sequencing reads, BLAST hits
Array probe positions
Replicate 1
Replicate 2
Replicate 3
Chr16 exon capture
• 385,000 probes targeted ~3 Mb of 11p12 locus
• 72% of the region covered by probes (window masking of repeats)
Repetitive regions
NimbleGen Sequence Capture Array Targeting NimbleGen Sequence Capture Array Targeting
11p12 Diabetes Locus11p12 Diabetes Locus
Genomic region
Targeted region
base cov depth
Mapped reads
High Coverage and Specificity of High Coverage and Specificity of
Sequence Capture at 11p12 Sequence Capture at 11p12
Median coverage DEPTH
Total reads
Number of reads in target regions
Percent of reads in target regions
HapMap SNPs classified correctly
Average coverage DEPTH
Percent target bases covered
Target bases covered
Initial target bases
Hapmap Samples
Median coverage DEPTH
Total reads
Number of reads in target regions
Percent of reads in target regions
HapMap SNPs classified correctly
Average coverage DEPTH
Percent target bases covered
Target bases covered
Initial target bases
Hapmap Samples
High Coverage DEPTH
NimbleGen Sequence Capture 385K NimbleGen Sequence Capture 385K
Custom ServiceCustom Service
Step 1: Array Design.NimbleGen will design probes against regions provided by the researcher. Repetitive regions will not be covered by the design, and researchers will approve the design before Step 2 starts.
Step 1: Array Design.NimbleGen will design probes against regions provided by the researcher. Repetitive regions will not be covered by the design, and researchers will approve the design before Step 2 starts.
End-April 2008 (estimated)
Step 2: Sequence CaptureThe researcher ships genomic DNA samples to the Roche NimbleGen Service Lab. Roche NimbleGen will manufacture the array from Step 1 and perform sequence capture on the samples. The enriched DNA will be amplified, tested for enrichment level, and shipped back to the researcher.
Step 2: Sequence CaptureThe researcher ships genomic DNA samples to the Roche NimbleGen Service Lab. Roche NimbleGen will manufacture the array from Step 1 and perform sequence capture on the samples. The enriched DNA will be amplified, tested for enrichment level, and shipped back to the researcher.
Researchers will provide:
• High-quality genomic DNA (human or mouse only; > 21 µg/sample). - WGA samples are currently not acceptable.
• Sequence information on regions to target in the genome - from hg18 or mm9, currently up to 5Mb max.
Researchers will get:
Captured DNA (10 µg amplified DNA/sample) with report on yield and level of enrichment –qPCR ref against ctrl loci.
• List of regions targeted by the design and visualization software.
• User’s Guides, including how to sequence the captured DNA with GS-FLX.
Molecular Barcoding conceptMolecular Barcoding conceptMIDs (Multiplex Identifers)
•MIDs allow barcoding of up to 12 different samples, for mixing, emPCR and
pooling into each region for sequencing; maximum possible = 16 regions x 12
samples / region, = 192 samples per PTP (1000 reads per sample)
Primer A MIDKey Library fragment Primer B
Sequencing primer
ATATCGCGAGL
TACTGAGCTAK
TGATACGTCTJ
TCTCTATGCGI
TAGTATCAGCH
CTCGCGTGTCG
CGTGTCTCTAF
ATCAGACACGE
AGCACTGTAGD
AGACGCACTCC
ACGCTCGACAB
ACGAGTGCGTA
SequenceMID
ATATCGCGAGL
TACTGAGCTAK
TGATACGTCTJ
TCTCTATGCGI
TAGTATCAGCH
CTCGCGTGTCG
CGTGTCTCTAF
ATCAGACACGE
AGCACTGTAGD
AGACGCACTCC
ACGCTCGACAB
ACGAGTGCGTA
SequenceMID