454 application presentation

The Next Generation of Genomic ResearchThe Next Generation of Genomic ResearchGenome Sequencer GS-FLX

Patrick Ng, Ph.D.

Scientific Liaison Manager, Asia-Pacific

Roche Diagnostics

[email protected]

May 2007 June 2007

Sample PrepRoche reagents

• RNA/DNA isolation

• RNA/DNA purification

• DNA amplification

• DNA labeling

Real-time qPCRLightCycler 2.0; LC480

• HRM for SNP validation

• qPCR for quantitative validation

Quick genome scansWhole genome tiling microarrays

• aCGH

• ChIP-chip

• Epigenetics

• Gene expression

• Targeted sequence capture (SeqCap)

High-throughput sequencingGenome Sequencer GS-FLX

• Ultra-Broad sequencing

• Ultra-Deep sequencing

• Whole genome de novo

sequencing

• Whole genome re-sequencing

Roche Genomics SolutionsRoche Genomics Solutions

Computer subsystemKeyboard and mouse (in drawer)

Optics subsystemFluidics subsystem

Genome Sequencing Genome Sequencing –– 11stst GenerationGenerationSanger dideoxy sequencing on an ABI capillary sequencer

• Sample prep

– Bacterial cloning of DNA, colony picking,

culturing, plasmid DNA extraction

– Typical time needed: weeks/ months

• DNA Sequencing

– State of the art capillary sequencer

enables maximum ~1-2 Mb/ 24hr

– Typical sequencing time needed for a

whole genome project: months/ years

• Personnel requirements

– Production sequencing facility required;

manpower needed for sample prep and

sequencing, typically 5-10 full-time staff

Genome Sequencing Genome Sequencing –– Next (2Next (2ndnd) Generation) GenerationMassively parallel pyrosequencing (454-sequencing™) on the GS-FLX

• Sample prep

– No bacterial cloning; all cloning is done in vitro

– time needed: ~15 hrs including emulsion PCR

amplification

• DNA Sequencing

– GS-FLX output/run ~100 Mb in 7.5hr

– Daily throughput ~300 Mb/ 24hrs theoretical

– Typical sequencing time needed for a

whole genome project: much shortened

• Personnel requirements

– Manpower needed for sample prep and

sequencing, typically 1-2 full-time staff

Sequencing of Sequencing of CorynebacteriumCorynebacterium kroppenstedtiikroppenstedtii strainstrain

From sequencing to manuscript submission = 1 weekTauch, A. et al. Ultrafast pyrosequencing of Corynebacterium kroppenstedtii DSM44385 revealedinsights into the physiology of a lipophilic corynebacterium that lacks mycolic acids.”Journal of Biotechnology, Available online 20 March 2008

Number of GS-FLX runs 1

Number of reads 560,248

Mean read length 196 bp

Total no. of bases 110,018,974

Coverage depth 45.8 x

Number of assembled contigs 6

Assembled bases 2,434,342

Mean G+C content 57.5%

Coding sequences (pred.) 2,119

Average gene length 1,016 bp

Average intergenic region 163 bp

7.5 hours sequencing (1 run), automated annotation overnight, paper written in 3 days.

Sequencing of Sequencing of CorynebacteriumCorynebacterium kroppenstedtiikroppenstedtii strainstrain

From sequencing to manuscript submission = 1 week

Sequencing Time:

C. glutamicum (2000) Clone-by-clone, Sanger 2 years

C. jejecum (2003) Whole genome shotgun 3 mths

C. urealyticum (2006) GS-20 shotgun 4 days

C. kroppenstedtii (2007) GS-FLX shotgun 7.5 hrs

(XLR_HD):

Expect >10 C. kroppenstedtii-like genomes to besequenced in 1 run

(~500Mb / (2.4Mb*20x oversampling) = 10.4)

Dr. Andreas Tauch, University of Bielefeld: 7.5 hours sequencing (1 run), automated annotation overnight, paper written in 3 days.

0

5

10

15

20

25

30

35

40

45

50

0

2

4

6

8

10

12

14

16

18

Cloning Bias in Conventional Cloning Bias in Conventional ABIABI SequencingSequencing

(500 kb stretch in Listeria monocytogenes)

GS-FLX coverage

Reference position (bp)

Courtesy of Drs. Nusbaum and Young of the Broad Ins titute

ABI coverage

GSGS--FLXFLX Sequencing WorkflowSequencing WorkflowOverview

One Bead

One Read 400,000+

reads per run

One Fragment

Sample input: Genomic DNA, BACs, amplicons, cDNA

Emulsification of beads and fragments in water-in-oil

microreactors

Generation of small DNA fragments via nebulization

Clonal amplification of fragments bound to beads in

microreactors

Sequencing and base calling

Ligation of A/B-Adaptors flanking single-strandedDNA fragments

Starting DNA

Fragments

GS-FLX Process Steps1. Shotgun DNA library preparation

8 h 7.5 h

SequencingemPCR

4.5 h

DNA Library Preparation and Titration

and 10.5 h

sstDNA librarygDNA

a. Genomic DNA fragmented by nebulization

b. Adaptors A and Biot-B ligated to fragments

c. Immobilize repaired, adapted DNA to paramagnetic Streptavidin beads

d. Select for only A-fragment-B and B-fragment-A

sstDNA molecules in supernatant (not Biot)e. Functional validation of sstDNA by titration run (do

emPCR and GS-FLX sequencing run to determine

best number sstDNA molecules per Capture Bead;)

Process StepsProcess Steps2. emPCR

8 h 7.5h

SequencingemPCR

4.5 h


and 10.5 h

Clonally-amplified sstDNA attached to capture beadsstDNA library

*Titration is required to avoid excessive “empty” or else “multi-template” beads in the emPCR

Anneal sstDNA to an

excess of DNA capturebeads (choice of no. of average molecules/bead

is based on titration

results)*. Capture beads

are non-paramagnetic.

Emulsify beads and PCR

reagents in water-in-oil microreactors (using a TissueLyser). Most of these

microreactors that contain

DNA, will contain only 1 DNA molecule and 1 bead

Break microreactors, and enrich for DNA-

positive beads (using

magnetic streptavidin

beads that bind to the biotinylated emPCR

products). Convert to

bead-bound sstDNA

Clonal amplificationoccurs inside

microreactors.

Typically, 40 PCR

cycles are performed.

Process StepsProcess Steps3. Sequencing

� A single, clonally amplified sstDNA bead

(after enrichment) is deposited per well.

� Load PicoTiterPlate (PTP) into sequencer,

begin run

Quality readsAmplified sstDNA library beads

8 h 7.5 h

SequencingemPCR

4.5 h


and 10.5 h

� PTP well diameter: average of 44 µm

� Capture bead diameter: 27-32 um

� Enzyme bead diameter: 2.8 um� Packing bead diameter: 0.8 um

� Wells per PTP: 1.6 Million

Process StepsProcess StepsSequencing

Pyrosequencing details (Sequencing-by-synthesis)

Quality readsAmplified sstDNA library beads

DNA capture

bead

containing

~10-30 million copies of a single clonalfragment

(sstDNA

templates)

� 4 unlabeled nt’s (TACG) are added

sequentially (flowed), 1nt at a time.� Cycled 100 times for large PTP run

� Chemiluminescent signal generation

(based on Pyrosequencing™)� Pyrophosphate released upon ntincorporation, is converted to ATP, which drives luciferase reaction, and light output

� Light signal captured on CCD camera

� Signal processing to determine base

sequence and quality score

8 h 7.5 h

SequencingemPCR

4.5 h


and 10.5 h

adenosine 5adenosine 5adenosine 5adenosine 5´́́́ phosphosulfatephosphosulfatephosphosulfatephosphosulfate

SoftwareSoftwareImage acquisition -> Image Processing -> Signal processing (FASTA basecalls + Quality scores) -> Applications software

Metric and image viewing software Signal output from a single well

(flowgram)

On current GS-FLX system, raw image -> FASTA basecalls = 8-9 hours

GS FLX SequencingGS FLX SequencingBioinformatics

Reference Mapper(assembly using ref. seq.)

De novo Assembler(assembly from scratch)

Amplicon Variant Analyzer

Image capture

Image processing

Signal processing

Features:

• Small dataset size = greater convenience. 13.2 GB including raw images, allows future re-analysis if desired

(e.g. post software upgrade)

• Useful software (currently 3 applications) available out-of-the-box

• Long (250bp) and accurate (99.5% single-read) reads = No filtering of reads against known ref neededNo filtering of reads against known ref needed

• Software recognizes molecular barcodes (MIDs) for greater multiplexing (and economy); also able to

assemble using various types: 454-shotgun, Sanger and paired-end reads (singly/ in combo) for best results

The Genome Sequencer FLX SystemThe Genome Sequencer FLX SystemTechnical Specifications

Current system

� ≥ 400,000 sequence reads per run

� 200 - 300 bases per read

� 1 run = 7.5 hours = ~100 Mb

� 2-3 runs per 24hr day

� Theoretical 1 Gb in 3-4 days

� Accuracy: ~99.5% over 200 bases

� Image & signal processing: 8-9 hrs post-sequencing

After “Titanium” kit upgrade, ~Q3, 2008

� ≥ 1,000,000 sequence reads per run

� 400 - 500 bases per read

� 1 run = 10 hours = ~500 Mb

� 2 runs per 24hr day

� Theoretical 1 Gb in 1 day

� Accuracy: ~99.0% over 400 bases

� Image & signal processing: 12-20 hrs post-sequencing (upgraded Unix cluster)

Current image Improved imageNew, metallized PTPNo change in sequencer

For an out-of-the-box solution we have

qualified a provider who will supply an

integrated system. This purchase is made

through Roche and is delivered as a one

box solution.

Support concept in place, deck available

Server specifications are available for a

do-it-yourself option.

Data Analysis Server for XLR HD

GSGS--FLXFLX or not?or not?

There are other Next-Generation Sequencers available that seem to be cheaper; they promise to do many things. Why don’t I buy those instead?

• Keep in mind the following: other NGS platforms give many short, lower-quality reads. These are suitable for only specific applications (mapping of tags to a reference, and counting them). Their short

reads result in poor mapping specificity and short contigs (more gaps). Few/no publications..

You need a GS-FLX if:

•You require large scale, very high throughput DNA sequencing

•You intend to study any of the following: de novo sequencing & assembly (whole/partial genome);

whole genome resequencing; targeted resequencing/ amplicon resequencing; metagenomics;

transcriptomics.

You do not need a GS-FLX if:

•You only intend to sequence small numbers of samples each time (tens of thousands of bases)

•Your applications only involve simple clone sequence verifications.

SideSide--byby--Side ComparisonSide ComparisonSanger dideoxy sequencing vs 454-sequencing ™

Errors: homopolymer

indels

Errors: homopolymer

slippage; GC-rich

hardstops

Amplicon sequencing:

each molecule

individually sequenced

Direct PCR: averaged

signals

In vitro: unbiasedIn vivo clones: biasedSample characteristics

Hours – DaysMonths – YearsSequencing time (bact. genome project)

HoursMonthsSample prep time (bact. genome project)

1-2 FTE5-10 FTEManpower (bact. genome project)

~99.5% / >99.995%99.3-99.6% / >99.99%Accuracy (Single-read/ 20x consensus)

~0.008¢~0.2¢Raw cost-per-base (US cents)

~300 million (3 runs)1-2 millionThroughput (bases per 24 hrs)

~200750-1,000Read-length (bases)

GS-FLXSanger dideoxy

ATAT-- or GCor GC--rich genomes not a problem for GSrich genomes not a problem for GS--FLXFLX

Depending on the organism,

read lengths are in the range

of 200 – 300 high quality

bases.

Genomes that are more AT- or GC-rich typically yield a longer read length distribution as compared to an AT/GC neutral genome

Read length

Long reads matterLong reads matterShort reads do not provide genome mapping uniqueness

Modified from Figure 2. Uniqueness as a function of read length; human genomic DNA.

25 to 35-bp reads: ~ 80 - 87% uniqueness

100-bp reads: > 90% uniqueness

(Whiteford, N. et al. Nucl. Acids Res. 2005)

Whole human genome

Human chr 1 only

High genome-mapping uniqueness is

important for genome annotation and

transcriptome profiling experiments

Transcriptome profiling of PlantsTranscriptome profiling of PlantsShort (singleton) reads do not provide sufficient transcriptomemapping uniqueness

78%62%53%46%Zea mays

82%65%56%51%Solanum tuberosum

91%81%75%71%Populus trichocarpa

79%61%53%48%Oryza sativa

86%74%69%67%Medicago truncatula

95%86%79%75%Lotus japonicus

82%64%56%51%Glycine max

91%80%71%66%Brassica rapus

83%65%55%49%Arabidopsis thaliana

250723625Read length

Table shows uniqueness of transcriptomic singletons. Data is extracted from a 2008

study modeling all possible reads in both directions of specified lengths across 20

different plant species. Reference transcript assemblies were from http://plantta.tigr.org/)

Transcriptomics

(Whiteford, N. et al. Nucl. Acids Res. 2005)

Percentage

of chr 1

covered10000

Long reads matterLong reads matterShort reads result in short contigs and poor coverage

Modified from Figure 2. Genome coverage (at specific contig sizes), as a fn. of read length

25 to 35-bp reads: Only 7% can form contigs of 10,000bp and larger (gaps!!!)

100-bp reads: 90% can form contigs of 10,000bp and larger

Note: Average gene size in humans ~10-15 kb

Read length is important in genome assemblyRead length is important in genome assemblyWhy long reads are needed for genome assembly

1. Genomes (especially complex ones) contain large numbers of repeats

- Repeats can be a few bases, to thousands of bases long

2. It is very difficult to completely assemble a genome, if the read length is shorter

than the repeats (see next slide)

- The reads need to bridge the repetitive DNA regions

3. Paired-end sequences can help assembly, but the end-tags still need to map specifically to the contigs

- So the end-tags themselves also need to be longer

4. In summary, short reads = short contigs, more gaps and poor assemblies

5. GS-FLX gives longer reads, fewer gaps, good genome assemblies

* NOTE: The same situation exists even for resequencing: the mapping uniqueness of short

singleton reads to a reference genome is greatly improved by increased read length. Also, the structures of splice variants are much clearer when long reads are used.

Long reads matter for Long reads matter for de novode novo assembly assembly Short reads cannot bridge repetitive regions; a gap remains

Unique DNA Sequence

Unique DNA Sequence

CGTAGGCTAGATGCATGCAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGATATAGCGATCTCGACATGCT

Repetitive DNA Sequence

GS-FLX Long read

Short reads ?

If the read does not span the repeats, no amount of increased sequencing coverage (depth) will allow either de novo genome assembly, or high-quality

resequencing (there will be gaps)

?

??

?

GSGS--FLXFLX sequencing accuracysequencing accuracy

• GS-FLX Single-Read accuracy > 99.5% (includes all homopolymer errors)

– Sanger Single-Read Accuracy = 99.3% to 99.6%

• GS-FLX (22x) Consensus-Read Accuracy > 99.995%

0.0%

0.5%

1.0%

1.5%

2.0%

2.5%

3.0%

3.5%

4.0%

0 50 100 150 200 250

Base Position

Cum

ulat

ive

Rea

d Error

09_29A09_29B09_14 + 09_18A09_18B+09_25ThermophilusC jejuni

E. coli run #1E. coli run #2E. coli run #3E. coli run #4T. thermophilusC. jejuni

Reported in Nature, 2005

GS20Q2, 2006

Currently (GS-FLX)

The very high GS-FLX single-read accuracy avoids the need for “quality-filtering”against a reference sequence (used by other sequencing platforms)

Cumulative Read Error by LengthCumulative Read Error by LengthComparing short-read system to Roche GS-FLX

• At 36 bases, Short-read system gives ~2 bases wrong (5% error)

• Up to 200 bases. GS-FLX gives < 1 base wrong (0.36%)

*Short-read sequencer data (Competitor I)

Based on data downloaded from Sanger Institute website

System Performance: System Performance: HomopolymersHomopolymersIndividual Reads versus Consensus Reads in E.coli

1. Single-read accuracy here in parsing homopolymers > 90% up to n = 5

2. Consensus accuracy provided by 22x oversampling is ~100% even at n = 9

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

1 2 3 4 5 6 7 8 9Homopolymer Length

Acc

urac

y

Consensus

Single ReadsGenome Sequencer FLX data

Distribution of Distribution of homopolymershomopolymers in the human genomein the human genome

HomoPolymer Frequency

0.00001%

0.00010%

0.00100%

0.01000%

0.10000%

1.00000%

10.00000%

100.00000%

A G C T

Nucleotide

Per

cent

of G

enom

e

4-mer

6-mer

8-mer

10-mer

12-mer

14-mer

16-mer

The incidence of homopolymeric stretches long enough to cause problems for

consensus GS-FLX sequencing (> 8-mer) in the human genome is low, < 0.05%

What can the GSWhat can the GS--FLXFLX do for you?do for you?

• May 2008: > 150 publications related to 454-sequencing

• You can request updated bibliographies (or access from www.454.com)

UDSUDSUltra Deep Amplicon

resequencingof specific

regions

• de novo sequencing and assembly

(with or without paired-end data)

• Metagenomics

(whole genome)

• Transcriptomics

• Small ncRNA identification

• Novel pathogen discovery

(Medical metagenomics)

• Ancient DNA studies

• Epigenetics and Gene

Regulation

- DNA methylation

• Medical resequencing

(targeted)

- Oncogenomics

• Microbial diversity

(16S rRNA Metagenomics)

• Virus quasispecies studies

GSGS--FLXFLX applicationsapplications

WGSWGSWhole

Genome Sequencing

UBSUBSUltra Broad Sequencing

• Whole genome re-sequencing and

assembly to identify variations

(comparative genomics);

• Whole genome Paired-End Mapping to

identify structural variations

Amplicon sequencing (Ultradeep)

• Medical resequencing for variant detection

• Virus genotyping and drug-resistance

• DNA methylation characterization

How does How does AmpliconAmplicon Sequencing detect rare variants?Sequencing detect rare variants?

Ultradeep sequencing

Mutation present at low frequency

in heterogeneous sample

PCR to obtain target locus

GS-FLXsequencing

Every species is sequenced individually, allowing every mutation to be quantified

Direct PCRsequencing (Sanger)

Result is averaged across all species present and may be undetectable

Targeted Targeted resequencingresequencing (Medical (Medical resequencingresequencing))Ultradeep amplicon resequencing to detect rare variations


Disease Associated Region > 1000s bp

Long Range PCR Amplicons: 3-15 kb each

Isolate genomic DNA from each sample

Generate long-range PCR ampliconsacross desired target region

Pool amplicons (in equimolar amounts)

Shotgun Sequence on GS-FLX

Genomic DNA

A B C D

Analysis is by Ref Mapper (AVA is for direct amplico n analysis)

ResequencingResequencing exampleexample: Long : Long rangerange PCR on EGFRPCR on EGFRNo gaps, analysis of the whole gene possible

� Shotgun library created from 10 overlapping, long range PCR Amplicons

� Mapped into 1 single contig of length 70,107 nt (amplified Region: 70,1kb)

� Sample H441: 80 High Confidence variations detected (62 SNPs known from db)

� Sample H1975: 95 High Confidence variations detected (73 SNPs known from db)

Short InDels in H441

Detectable – no quality filtering based on comparison with reference sequence needed

----/CACA

-----/TTCA

24 Base Insertion

ACAC/----

TGTG/----

24 Base Deletion

GCT/---13 Base Deletion

--/CA12 Base Insertion

AT/--12 Base Deletion

Various12Single Base Substitution

Sequence Change# of Variants

All new variants have been confirmed by Sanger Sequencing;

Detecting variations > 2nt not possible using competing platform (Manufacturer I)

Whole Genome ReWhole Genome Re--SequencingSequencingBenefits of using GS-FLX

� 250bp reads effectively span many sequences that micro-reads interpret as

repeats, and minimizes numbers of gaps.

� Bias-free sequencing (coverage does not vary due to GC/AT content).

� Significantly higher genome coverage and much more efficient identification of

more variations.

� InDels can be discovered efficiently/easily (no filtering against reference

sequence needed).


• Lung cancer patient had strong initial response to Erlotinib (TKI) before relapsing 12.5 mo later with pleural effusions.

• Histological specimen post-relapse had only 1-10% tumor content. Sanger sequencing showed wt EGFR only.

• 454-sequencing (amplicon sequencing of exons 18-22) showed 2 mutations: (i) 18-bp deletion in exon 19 (Del4) (3% of 11,367 reads), and (ii) C->T substitution resulting in T790Mmutation (2% of 136,776 reads).

• Validation: cells overexpressing EGFR with Del4 were sensitive to Erlotinib, but combined Del4/T790M mutation rendered cells resistant.

Targeted Targeted resequencingresequencingUltradeep sequencing for retrospective analysis of relapse in NSCLC patient R.K. Thomas et al., Nature Medicine (July 2006) 12: 852-855

� 3 mutations are frequently observed in multidrug resistant HIV viruses:

M46I/L, V82A/F/S/T, and L90M.

� The combination confers resistance to all protease inhibitors currently in use.

� If all three were found on a single major species before treatment, protease

inhibitors would have minimal effect.

� However, if they were on distinct species, they could be suppressed by

different inhibitors.

� A single GS-FLX read (250bp) read covers all 3 mutations (135bp apart)

� Provides valuable information on drug resistance and productive treatment

options

� Short reads (25-50bp) cannot distinguish between the various subtypes

For virus genotyping, long read lengths For virus genotyping, long read lengths

are of paramount importance in are of paramount importance in HAPLOTYPINGHAPLOTYPING

Ultradeep sequencing/ haplotype analysis

See also: Wensing et al. (2005) JID 192:958; Shafer et al. (2006) JID 194 (Supple 1):551; Bonaventura et al. “New

developments in HIV drug resistance and options for treatment-expererienced patients”

etc…

Individual sequencing of each template allows identification and quantificationof distinct virus subspecies within a mixed population, including “haplotypes”

Subspecies identification in HIVSubspecies identification in HIV--11

Sequencing of 207 Sequencing of 207 bpbp ampliconamplicon from virus protease genefrom virus protease gene

Mut

atio

n F

req.

Cov

erag

e

GACATGAATTTG|| |||| ||||GAAATGAGTTTG

34% of readsGAAATG-GCTTTGCC|||||| | ||||||GAAATGAG-TTTGCC

39% of reads

GAAATGCAGTT-GCCAGG|||||| |||| ||||||GAAATG-AGTTTGCCAGG

21% of reads

In collaboration with Dr. M. Kozal, Yale VA Hospita l; See also Wang et al. (2007) Genome Res , for a good look at coverage needed in HIV-1 variant analysis

Sanger, direct PCR

GAAATGG NTTTGCC

Unresolvable region

Transcriptomics

• Based on existing (some partial) reference genomes: microbes,

drosophila, human cell lines, maize, arabidopsis, medicago,

salmon, sheep.

• de novo transcriptome characterization: Paper wasp (ref

honeybee), Glanville Fritillary butterfly (ref Bombyx mori)

Transcriptomics

Transcriptome characterizationTranscriptome characterization

OverviewOverview

� cDNA synthesis from RNA, then fragmentation and 454-sequencing

- options: oligo-dT, random priming, RNA fragmentation, cDNA normalization

� Map the ESTs to reference genome or other suitable database

o Compare gene expression levels across samples by EST counting

o Perform genome annotation to improve existing gene models

Transcriptomics

� Long, accurate reads allow excellent mapping of ESTs to genome (Table)

- Torres et al., Genome Research (2008) Jan;18(1):172-7 (GS-20, Drosophila)


Advantages of using GSAdvantages of using GS--FLXFLX

Transcriptomics

� Essentially unbiased representation regardless of transcript length (Figure)or expression level

- Weber et al., Plant Physiology (2007) May; 144:32-42 (GS-20, Arabidopsis)


Advantages of using GSAdvantages of using GS--FLXFLX

Here we see 154,379 GS-20 ESTs corresponding to 1,053 transcripts (flcDNAs) of 1,000-2,000nt in size (medium length).

(position relative to 5’ end of cDNA)

5’ 3’

• ESTs cover every part

of every cDNA.

Medium length cDNAs

are shown here, but distribution pattern

similar for Short and

Long cDNAs also

• Some favoring of 5’

and 3’ ends, possibly

indicating incomplete

nebulization

• Total number of (+)

direction ESTs (55-

60%) is slightly > (-) direction ESTs. Why?

GSGS--FLXFLX sequencing of the paper wasp sequencing of the paper wasp

transcriptometranscriptome

� Conventional cDNA library preparation from wasp was sequenced using 454-Sequencing.

� 391,157 brain cDNA reads generated

� 3,017 genes hit in honey bee genome

� No wasp genome available

� 32 behavioral gene orthologs further characterized to demonstrate the link between

maternal behavior and the development of social behavior

� Study also demonstrated the ability to use a known, related genome (Bee in this case) as a hub to successfully generate assemblies

de novode novo whole transcriptome characterization whole transcriptome characterization

using 454using 454--sequencing (GSsequencing (GS--20)20)

•No reference genome available; this is 1st report of a de novo

transcriptome assembly using NGS data.

•2 cDNA libraries, normalized (Evrogen); 2 GS20 runs done;

SeqMan Pro assembly of ESTs.

• 518,079 high-quality ESTs (88% of raw) obtained, assembled

into 48,354 contigs + 59,943 singletons, thus 108,297 unigenes.

• Microarray made using assembled transcripts: high reproducibility.

• Issues: Cf B.mori database, inferred that ideally, 4x more sequencing

would be needed for complete flcDNA coverage. Also, assembly of

splice variants was “complex”, and needed human annotation. We can

expect that with GS-FLX and XLR-HD, results will be vastly improved.

• Interestingly, 618 reads were non-metazoan (mainly cryptosporidiae),

hence 454-seq can possibly be used for xenobiont detection

Transcriptomics

Molecular Ecology (2008)

Glanville fritillary butterfly (Melitaea cinxia).

Metagenomics (& Microbial Diversity)

MetagenomicsMetagenomics on the GSon the GS--FLXFLX

Benefits of long, accurate readsBenefits of long, accurate reads

• Metagenomics is the shotgun sequencing of mixed DNA isolated from environmental samples. For assessing bacterial diversity, the focus is usually on 16S rRNA

• Read length of >200 bases allows:

– accurate assessment of diversity (low rate of ambiguous mapping results)

– unambiguously identify an organism or gene in an unknown complex environmental sample.

See: R Edwards et. al. BMC Genomics, 7:57 (2006)

Environment (e.g. deep sea)

Tens of thousands of

different species

Isolatio

n of e

nviro

nmental

DNA and sh

otgun se

quencing

FLX long reads Microreads

Reads map uniquely Reads map everywhere

You can´t use them

(Paired end reads might help, but even then,

higher specificity with longer end-tags)

MetagenomicsMetagenomics on the GSon the GS--FLXFLX

Why long reads are neededWhy long reads are needed

Elucidation of symbiotic interdependence between Elucidation of symbiotic interdependence between

insect host & 2 internal bacteriainsect host & 2 internal bacteria

Ultrabroad sequencing- Metagenomics

• Tripartite symbiotic relationship between insect host

(H. coagulata, Glassy-Winged Sharpshooter) that

feeds on xylem sap, and 2 internal bacteria:

Baumannia sp. (provides vitamins and cofactors)

and Sulcia sp. (provides essential aa.’s).

• 23 Newbler contigs were assembled into a

complete circular Sulcia genome

• Illumina-Solexa 1G was used to resolve 155

homopolymeric uncertainties (Sulcia genome is

245,000 bp; homopolymer errors = 0.06%).

• As with combined Sanger/454-sequencing, perhaps

dual-platform experiments may be the most

sensible approach for highest-quality asemblies, for

genome centers that can afford them.

Novel pathogen discovery Novel pathogen discovery Medical Metagenomics

Palacios, G, et al.; N Engl J Med 2008:358The same group that did the Honey Bee

Colony Collapse Disorder metagenomics work

Novel pathogen discovery/ Metagenomics

• 3 women in Australia received various transplanted organs (liver, kidney) from same male donor;

donor had died from cerebral hemorrhage.

• 4-6 wks post-op, recipients all died from “febrile illness with varying degrees of encephalopathy”.

• Tested negative: bacterial/ viral cultures; PCR for various viruses; microarray analysis using

panmicrobial and viral arrays

• Methods: GS-FLX sequencing performed on RNA extracted (various source tissues) from 2 deceased

patients; RNA was DNaseI-treated, then RT-PCR using random primers. No further nebulization done.

• Results: Sequences filtered bioinformatically to remove repetitive DNA; human (host) DNA subtracted;

non-human sequences were clustered with Cd-hit, then CAP3 assembled. BLASTX and BLASTN

against Genbank performed. Of 103,632 sequences (mean size: 162bp), 14 fragments had homology to arenavirus, closest relationship to LCMV. Validation of novel LCMV done by RT-PCR:

22 of 30 samples (from 3 patients) positive. Other validations done, on Vero E6 cells, and patient

samples.

• Conclusions: GS-FLX sequencing has been used to identify a novel pathogen present against a

massive background of known host gDNA; “Medical Metagenomics”?

Novel pathogen discovery/ Metagenomics

Palacios, G, et al.; N Engl J Med 2008:358

Novel pathogen discovery Novel pathogen discovery –– Medical Medical MetagenomicsMetagenomics??

14 out of ~100K reads hit LCMV

Maximum contiguous match to known LCMV:Only 14 bp

Small noncoding RNA (sncRNA) characterization

Ultrabroad seq- MicroRNA

Small nonSmall non--coding RNA (coding RNA (sncRNAsncRNA) analysis) analysis

• Transcripts expressed from the genome can be

protein-coding, or non-coding

• Non-coding RNAs have important regulatory

functions; small ncRNAs include:

– Small interfering RNAs (siRNAs) ~21-25nt

• NAT-siRNAs (antisense; 21-24nt)

– microRNAs (miRNAs) ~22nt

Recently, some larger sncRNAs identified:

•Piwi-interacting RNAs (piRNAs) ~29-31nt

•Small-scan RNAs (scnRNAs) 27-31nt

•Repeat assd siRNAs (rasiRNAs) 24-29nt

•Long siRNAs (lsiRNAs) 30-40nt

–small nucleolar RNAs (snoRNAs) ~60-300nt

–“short RNAs” (sRNAs) <200nt (mean 35nt)

•Promoter-associated PASRs

•2’-Termini associated TASRs

–“long RNAs” (lRNAs) >200nt (mean 100nt)

• GS-FLX’s long read length enables the

complete sequencing of almost all small

RNA classes, in just one read

• Coupled with the high depth of coverage,

GS-FLX is well-placed to not just

characterize existing sncRNAs, but also

identify and quantify those that may be

very rare, or novel, with a high degree of

confidence.

Whole genome sequencing

• de novo sequencing and assembly

- bacteria, fungi (including mushroom), viruses, barley (4 BACs as

proof of principle), pinot noir grape (Sanger plus 454)

• Genome resequencing

- bacteria, viruses, Pea, monkey, human

• Paired-end reads and their utility in de novo assembly

& structural variation detection

de novode novo Sequencing and AssemblySequencing and AssemblyIntroduction

Contigs that are oriented w.r.t. their immediate neighbours

are gradually ordered to form a scaffold.

Random (shotgun) fragments

A

Clustering (overlapping)

CONTIG 1 CONTIG 2

CONTIG 3

B

Consensus merging

(Contigs are not ordered)

1 2

3

C

(Ordering and orienting contigs, with paired end data)

1 2 3

D

Scaffold (supercontig) formation

1 2 3

E

Scaffolds can still contain gaps.

Final step is “gap-filling” or “finishing”.

De novo sequencing and assembly

Usefulness of PairedUsefulness of Paired--End sequences (1)End sequences (1)

Comparative Genomics

1. Paired-end sequences are very useful for de novo assembly, and also for detecting variations.

2. Paired-end sequences can span repeats, to better orientate shotgun contigs.

3. The choice of what length paired-end span to use, depends on the specific genome being studied.

Biot Biot

Biot

Biot

Biot

Biot

Shear &

select

Contig 1 Contig 2

454 Paired-end Reads

Scaffold Generation

Scaffold 1

Contig 1 Contig 2

454 Paired-end Reads

Scaffold Generation

Scaffold 1

de novode novo genome assembly of bacteriagenome assembly of bacteria

GSGS--2020 shotgun reads, assisted by Pairedshotgun reads, assisted by Paired--End reads (End reads (2kb2kb span)span)

E. coli B. licheniformis S. cerevisiaeGenome size (Mb) 4.6 4.2 12.2Number of runs 3 3 9Fold oversampling 22 27 23Assembly Contigs 140 98 821PE library runs 1 1 2Number of paired reads 112000 255000 395000Supercontigs 24 9 153Genome Coverage 98.60% 99.20% 93.20%

Paired-End; de novo assembly

Roche in-house data

B.pseudomallei #22Expected Genome size (Mb) 7Number of chromosomes 2Number of runs (GS-20) 6Fold oversampling 22Assembly Contigs (1221 - 79098 bp) 940PE library runs 1Supercontigs using 2kb PE library 50 using 5kb PE library 11 using 10kb PE library 4Genome Coverage (w.r.t. Sanger ref) 93.04%

RocheRoche--GIS collaborationGIS collaborationResults of de novo assembly

Choice of Paired-End library target size depends on the particular genome

Improved PairedImproved Paired--End Protocol (16End Protocol (16--20kb20kb span)span)Protocol to be released with Titanium upgrade

10000 20000 30000

Pair Distance (bp)

0

500

1000

1500

2000

Co

un

t

16 kb

10000 20000 30000

Pair Distance (bp)

0

500

1000

1500

2000

Co

un

t

16 kbNumber of Contigs/ Scaffolds

Shotgun 15× 98

3 kb span 18× 7

+

+

GS-FLXRead Type

Genome Coverage

Assembly of 4.6Mb E.coli genome into 1 scaffold (Consensus accuracy ~99.999%)

Paired-End; de novo assembly

New!

16-20 kb span 20× 1

Usefulness of PairedUsefulness of Paired--End sequencing (2)End sequencing (2)

Paired-End; Comparative Genomics

1. Paired-end sequences are very useful for de novo assembly, and for detecting variations.

2. Paired-end sequences can span repeats, to better orientate shotgun contigs.

3. The choice of what length paired-end span to use, depends on the specific genome being studied.

Biot Biot

Biot

Biot

Biot

Biot

Mapped Span (sample to ref) < [mean-3SD]

Reference

SampleInsertion

Deletion

Mapped Span > [mean + 3SD]

Reference

Sample

RocheRoche--Yale publishes study in Yale publishes study in ScienceScienceHuman structural variations identifiedJ. Korbel et al. Science, 13 September 2007

• Same principle as used in GIS, but here the Roche-Yale group used only 3kb PE spans for higher resolution, and ~100bp tags (for best human genome mapping specificity)

• For SV identification, Yale sequenced between 10M – 21M of 3kb-PE reads (thus 10x- 21x coverage)

– GIS used 323,632 of 10kb-PE reads (thus ~ 450x coverage of Bp genome)

• Between NA15510 (putative European female) and NA18505 (Yoruban female),

– A total of 1,297 SVs were identified (1,175 indels, 122 inversions)

• PCR validation on 40 randomly-selected SVs: 97% validation success

• Note: no actual genome-genome alignment was done in this Roche-Yale study

Comparative Genomics

Resequencing

Whole Human Genome ResequencingWhole Human Genome ResequencingThe first Next Gen Sequencer-based resequencing of the human genome

• Blood sample provided by Dr. James Watson in 2005

• 454 Life Sciences/Baylor Human Genome Sequencing

Center (HGSC) collaboration

• Browser at http://jimwatsonsequence.cshl.edu/

• Completed May 2007

• On a GS-FLX: 2 months, US$ 1M, 78.5 Million reads, 19.7 Billion bases (6.5x oversampling).

• Cf. HGP: 10-15 years, US$ 4B.

• Only 3% of the total reads could not be mapped to reference human genome (UCSC and Celera assembly). Of the 3%, ~1.1% completely unknown; remainder were

unmapped because of DNA repeats.

• Identified 177,181 InDels ranging from 3 to >7,000bp.

• 1.8 Million known (in dbSNP) SNPs observed; 200,000 novel SNPs identified

David Wheeler et al., Nature 452:872 (17 April 2008)

De Novo De Novo sequencing of potato BACssequencing of potato BACs

Long accurate reads – few contigs

Pilot study for a member of the potato consortium

� 56 plant BACs sequenced using MIDs, and assembled

� 8 in milestone I, 48 on two LR70 runs in milestone II

� Average BAC insert size: 136 kb

� Average number of contigs >500 nt (N=56): 16.6

� Average N50 contigs (N=56): 39,808 nt

� New sequence information not sequenceable using Sanger capillary sequencer was detected

� Comparison with Sanger sequence often revealed several

kb new information per BAC, because GS-FLX has no

cloning bias

Technological innovations

• Sequence Capture (SeqCap) microarrays

• Multiplex Identifiers (MIDs)

Sample preparation is the new Sample preparation is the new bottleneckbottleneck

for for resequencingresequencing applicationsapplications

Conventional Sanger dideoxy sequencing

Sample prep by PCR and cloning

Next Generation Sequencing: ultra-high throughput

Sample prep by NimbleGen Microarray-based Sequence Capture (SeqCap)

Labor, Infrastructure, Throughput

Capture of specific DNA regions at kb to Mb scale, Capture of specific DNA regions at kb to Mb scale,

will enable the full potential of Next Generation Sequencing to will enable the full potential of Next Generation Sequencing to be exploited. be exploited.

Targeted Resequencing

NimbleGen SeqCap arrays and GSNimbleGen SeqCap arrays and GS--FLX FLX An easier way to do targeted resequencing

7 108 9

Hybridize to

SeqCap array

Wash and elute,

perform PCR

Sequence on GS-FLX

Analyze sequences of

captured exons

Fragment

Targeted Resequencing

NimbleGen SeqCap arrays and GSNimbleGen SeqCap arrays and GS--FLXFLXExon Capture Results

• Probes >60mer

• 1 probe every 10 bases

• Highly reproducible

• Accurately captures targets

• Mean enrichment ~378-fold

(T.J. Albert et al. Nature Methods. October 2007)

454-sequencing reads, BLAST hits

Array probe positions

Replicate 1

Replicate 2

Replicate 3

Chr16 exon capture

• 385,000 probes targeted ~3 Mb of 11p12 locus

• 72% of the region covered by probes (window masking of repeats)

Repetitive regions

NimbleGen Sequence Capture Array Targeting NimbleGen Sequence Capture Array Targeting

11p12 Diabetes Locus11p12 Diabetes Locus

Genomic region

Targeted region

base cov depth

Mapped reads

High Coverage and Specificity of High Coverage and Specificity of

Sequence Capture at 11p12 Sequence Capture at 11p12

Median coverage DEPTH

Total reads

Number of reads in target regions

Percent of reads in target regions

HapMap SNPs classified correctly

Average coverage DEPTH

Percent target bases covered

Target bases covered

Initial target bases

Hapmap Samples

Median coverage DEPTH

Total reads

Number of reads in target regions

Percent of reads in target regions

HapMap SNPs classified correctly

Average coverage DEPTH

Percent target bases covered

Target bases covered

Initial target bases

Hapmap Samples

High Coverage DEPTH

NimbleGen Sequence Capture 385K NimbleGen Sequence Capture 385K

Custom ServiceCustom Service

Step 1: Array Design.NimbleGen will design probes against regions provided by the researcher. Repetitive regions will not be covered by the design, and researchers will approve the design before Step 2 starts.

Step 1: Array Design.NimbleGen will design probes against regions provided by the researcher. Repetitive regions will not be covered by the design, and researchers will approve the design before Step 2 starts.

End-April 2008 (estimated)

Step 2: Sequence CaptureThe researcher ships genomic DNA samples to the Roche NimbleGen Service Lab. Roche NimbleGen will manufacture the array from Step 1 and perform sequence capture on the samples. The enriched DNA will be amplified, tested for enrichment level, and shipped back to the researcher.

Step 2: Sequence CaptureThe researcher ships genomic DNA samples to the Roche NimbleGen Service Lab. Roche NimbleGen will manufacture the array from Step 1 and perform sequence capture on the samples. The enriched DNA will be amplified, tested for enrichment level, and shipped back to the researcher.

Researchers will provide:

• High-quality genomic DNA (human or mouse only; > 21 µg/sample). - WGA samples are currently not acceptable.

• Sequence information on regions to target in the genome - from hg18 or mm9, currently up to 5Mb max.

Researchers will get:

Captured DNA (10 µg amplified DNA/sample) with report on yield and level of enrichment –qPCR ref against ctrl loci.

• List of regions targeted by the design and visualization software.

• User’s Guides, including how to sequence the captured DNA with GS-FLX.

Molecular Barcoding conceptMolecular Barcoding conceptMIDs (Multiplex Identifers)

•MIDs allow barcoding of up to 12 different samples, for mixing, emPCR and

pooling into each region for sequencing; maximum possible = 16 regions x 12

samples / region, = 192 samples per PTP (1000 reads per sample)

Primer A MIDKey Library fragment Primer B

Sequencing primer

ATATCGCGAGL

TACTGAGCTAK

TGATACGTCTJ

TCTCTATGCGI

TAGTATCAGCH

CTCGCGTGTCG

CGTGTCTCTAF

ATCAGACACGE

AGCACTGTAGD

AGACGCACTCC

ACGCTCGACAB

ACGAGTGCGTA

SequenceMID

ATATCGCGAGL

TACTGAGCTAK

TGATACGTCTJ

TCTCTATGCGI

TAGTATCAGCH

CTCGCGTGTCG

CGTGTCTCTAF

ATCAGACACGE

AGCACTGTAGD

AGACGCACTCC

ACGCTCGACAB

ACGAGTGCGTA

SequenceMID

454 application presentation

Documents