incorporating new sequencing technologies into … talks/sean_sykes... · 280kb 300 320 340 360 380...
TRANSCRIPT
Sean M. Sykes
2007 Finishing In The FutureSanta Fe, NM06/19/2007
Incorporating New Sequencing Technologies Into Finishing Strategy
Overview
• Problems With Conventional Methods
• Roche/454 Brings Something New To The Table: Both Good And Bad
• What Can We Do About 454 Limitations?
Neurospora Finishing
39.2 Mb Finished Quality Sequence- > 20x Sequence Coverage- > 100x Physical Coverage
- 8 library types
41 Mb In Optical MapNeurospora Chromosome 2
Neurospora Gaps In “Finished” Assembly
~180 Gaps Remain- Unclonable- Low % GC
GC75%
50%
25%
280Kb 380300 320 340 360
GC Distribution (120 kb)
Does 454 Improve Neurospora Assembly?
GC75%
50%
25%
280Kb 380300 320 340 360
41.5 Mb 24x 454 AssemblyFinished Territory: 44% GC
• Spans 36 gaps• Extends 200 additional ends
454 Only Territory: 27% GCGC75%
50%
25%
280Kb 380300 320 340 360
Who Else Needs Help?
ABICoverage 8X% Genome 80%Contigs 315
Listeria monocytogenes2.9 Mb Genome
• Typical 8x draft assembly covers >= 95%• Significant cloning bias here
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
454 Gets Stuff ABI Misses
ABI 454Coverage 8X 23X% Genome 80% 95%Contigs 315 51
Yes indeed!
Listeria story like Neurospora?
Yes- unclonable regions sequenced by 454No- not correlated to GC content
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
0
5
10
15
20
25
30
35
40
45
50
g
0
2
4
6
8
10
12
14
16
18
SC
454 Captures Regions of Cloning Bias45
4 co
vera
ge
AB
I cov
erag
e
Coverage distribution (500 kb)
0
5
10
15
20
25
30
35
40
45
50
0 0 0 0 0 0 0 0 0 0 0 0 0 0
g
0
2
4
6
8
10
12
14
16
18
SC
Bias Does Not Correlate To GC
% G
C
-50
-40
-30
Coverage distribution (500 kb)
454 Coverage: A Closer Look
Human BAC 454 Coverage Across 100 bp GC Windows
0
10
20
30
40
50
60
70
80
90
100
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91
% GC
0
100
200
300
400
500
600
Win
do
ws
An
aly
zed
Windows
Human BAC 454 Coverage Across 100 bp GC Windows
454 Coverage: A Closer Look
0
10
20
30
40
50
60
70
80
90
100
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91
% GC
0
100
200
300
400
500
600
Win
do
ws
An
aly
zed
Avg. Coverage Windows
Human BAC 454 Coverage Across 100 bp GC Windows
Is Coverage Drop An Isolated Case?
0
25
50
75
100
125
150
175
200
225
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101
% GC
0
500
1000
1500
2000
2500
3000
3500
Win
do
ws
An
aly
zed
454.12 454.13 454.14 454.15 454.16 454.17 All Windows
Assessing Sequence Accuracy of 454
Compare 454 assembly to finished sequence
• Base errors rare in consensus (~q40)• >95% of errors are insertion/deletion
• Majority in homopolymer runs • ~1% error rate in 5+ bp homopolymer runs
Hypothetical Broaderia Genome
• 3 Mb genome• 3000 genes • 1/10 kb Indel errors• 300 errors/genome
– 270 in coding region– frameshift in gene
9% gene models broken
Frameshifting Indels In Listeria
• Collaborators less than delighted• Validation showed only 8/41 correct
Total Percent 10403s 47 1.57J0161 130 4.33J2818 238 7.93
Gene Models Broken
Error Rates In Listeria 454 Assemblies
ABI q40
454 Assembly
Indel errors per 10kb in Listeria
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
10403S (24x) J0161 (29x) J2818 (24x)Genome (Cover
9% GeneModels Broken
Increased 454 Coverage Increase Accuracy?
Errors per 10kb in 10403S Assem
0
200
400
600
800
1000
1200
1400
48x 36x 24x 12xCoverage
0%
10%
20%
30%
40%
50%
60%
70%
80%
% 4
54 A
ssem
bly
Cov
ered
by
AB
I
Indel Error Rate Genome Covere
• Different Bias?• Yes
• Accuracy?• Indels break gene models
Revisiting 454 Questions
Solexa/Illumina Improve 454 Accuracy?
Solexa Accuracy
High sensitivity and specificity
Sequenced 1 lane of F11 by Solexa2.5M passing reads, 27b = ~14x coverage
Aligned all reads to H37Rv genome
Identified 98% of 800 known SNPsIdentified 95% of indelsNO false positivesMissing SNPs in regions >80%GC
Sequence Tb by SolexaCompare to finished sequence
Indel Errors per 10kb in Listeria Genom
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
10403S (24x) J0161 (29x) J2818 (24x)Genome (Coverage
Pre Solexa Correction After Solexa Correctio
ABI q40
Can Solexa Fix 454 Errors In Listeria?
454 Assembly
Solexa
Conclusions
• 454 captures regions of ABI cloning bias, but other biases still exist• 454 indels impact gene annotation• Solexa data correct indel errors in 454 data• Significant improvement in consensus quality for relatively small cost
Acknowledgements
Assembly Analysis Team• Sarah Young• Aaron Berlin
WGA Team• Will Brockman• Pablo Alvarez• David Jaffe
Chad NusbaumBruce Birren
Technology Development• Georgia Giannoukos• Will Lee• Sheila Fisher• Jim Meldrim• Nabil Hafez• Diana Shih• Tracey Honan• Todd Sparrow• Michael Weiand• Zac Zwirko• Nicole Stange-Thomann• Alex Melnikov• Andreas Gnirke• Pat Cahill• Robert Nicol
Computer Finishing• Mike FitzGerald• Lynne Aftuck
Genome Annotation• Chandri Yandava• Chinnapa Kodira
Scientific Project Mgt• Carsten Russ