scalable genome sequence polishing ntedit+ ≥ k/y : apply change, resume 1 if s k_alt + 6 ≥ k/y :...
TRANSCRIPT
● ● ● ● ● ●
● ● ● ● ● ●
●
●● ● ● ●
●
●
●● ● ●
●
● ● ● ● ●
●
● ● ● ● ●
●
● ● ● ● ●●
● ● ● ● ●1
10
100
10 20 30 40 50
# M
ism
atch
es p
er 1
00kb
p
A● ● ● ● ● ●
● ● ● ● ● ●
●
●●
●●
●
●
●
● ● ● ●
● ● ● ● ● ●●
● ● ● ● ●
●
● ● ● ● ●●● ● ● ● ● 0.3
1.03.0
10.0
10 20 30 40 50 # In
dels
per
100
kbpB
●●
●
●
●
●
● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●
●
●
●
●●
●
●●
●
●
●
●
0.02.55.07.5
10.0
10 20 30 40 50Ti
me
(min
)
C
● ● ● ● ●●
● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●
● ● ●● ●
●
●
●
●
●
●
●
1
2
10 20 30 40 50 Peak
Mem
ory
(GB)D
Tool ●
●
●
●
●
●
●
●
BaselineGATK
RaconPilon
ntEdit k=20ntEdit k=25
ntEdit k=30ntEdit iterative k=35,30,25
● ● ● ● ●
● ● ● ● ●●
●● ● ●
●
●● ● ●
● ● ● ● ●● ● ● ● ●● ● ● ● ●
● ● ● ● ●
1
10
100
20 30 40 50
# M
ism
atch
es p
er 1
00kb
p
E● ● ● ● ●
● ● ● ● ●
● ●● ● ●●
●● ● ●
● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●
0.3
1.03.0
10.0
20 30 40 50 # In
dels
per
100
kbpF
●
●●
●
●
● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●
●
●
● ● ●
●●
●
●
●
0
100
200
300
20 30 40 50
Tim
e (m
in)
G
●
●● ● ●
● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●
●●
● ● ●
●
●
●
●
●
01020304050
20 30 40 50 Peak
Mem
ory
(GB)H
Tool ●
●
●
●
●
●
●
●
BaselineGATK
RaconPilon
ntEdit k=25ntEdit k=30
ntEdit k=35ntEdit iterative k=40,35,30,25
● ● ● ● ● ● ●●
● ●● ● ● ● ●
●1
10
100
30 40 50 60 70k
Erro
rs p
er 1
00kb
p
Error type Indels per 100kbp Mismatches per 100kbp
I
● ● ● ● ● ● ●0200400600800
30 40 50 60 70k
Tim
e (m
in)
J
●● ● ● ● ●
●
2025303540
30 40 50 60 70k
Peak
mem
ory
(GB)K
Tool ●
●
●
●
●BaselineGATK
RaconPilon
ntEdit
Coverage
Coverage
3
0 0.5 1
1
2
3
4
#Edits (M)
BUSCO (%)
Base 10xG linked reads
N/A 5,670 (90.7)
+ntEdit k50i1d1 0.2 5,677
(90.8)
Base PacBio N/A 1,248
(31.6)
+ntEdit k40i3d3 59.0 1,354
(34.3)
82 82.5 83 83.5 84 84.5 85 85.5 86 86.5 87 87.5 88 88.5 89 89.5 90 90.5 91 91.5 92 92.5
1
2
3
4
5
6
11 Simao, 2015 12 Koren, 2017
96.0
95.4
93.1
56.1
1.3
1.6
2.5
7.7
2.7
3.1
4.4
36.2
0% 20% 40% 60% 80% 100%
canu
ntedit
pilon
+ntedit
BUSCOs(%of1,440searched)
complete fragmented missing
Polishing
●●●●●●●●●
●
●
●●●●●●●●
●
●
●
444036322824201612
8
6
0
5
10
15
0.00010.00100.01000.1000Bloom filter false positive rate
Erro
rs p
er 1
00 k
bp
Error Type●
●
IndelsMismatches
20X Coverage
●●●●●●●●●
●
●
●●●●●●●●
●
●
●
444036322824201612
8
6
0
5
10
15
0.00010.00100.01000.1000Bloom filter false positive rate
Error Type●
●
IndelsMismatches
40X Coverage
haploid or diploid DNA source Sequence reads
Bloom filter
c
ccccc
TTTTT
✓
T
1. Check each word of size k (kmer) in filter
2. Check k kmer subset (Sk) for absence If Sk
- ≥ k/x :
4. Insert 3’-end positions Check k kmer subset presence If Sk_alt
+ ≥ k/y : Apply change, resume 1
If Sk_alt+ ≥ k/y :
Apply change to sequence, resume 3
3. Permutate 3’-end base Check k kmer* subset (Sk_alt) for presence
✓
ntHits kmers
kmers
5. Delete 3’-end positions Check k kmer subset presence If Sk_alt
+ ≥ k/y : Apply change, resume 1
✗
✗
Sequence
*kmerswithalternate3’endbase(k_alt)
Edited sequence
refcopy
1
2
3
edited4Bloomfilter
dra5 edited
ntHits ntHits
ntEdit ntEdit
kmers
NGSreads
Definitions kmer..................................word of length k Sk…...................................subset of overlapping k kmers Sk
- …..................................subset of absent, overlapping k kmers Sk_alt
+…..............................subset of present, overlapping, k alternate kmers x….....................................leniency factor 1, test for absence y….....................................leniency factor 2, test for presence
4
56
8
912
4
56
89
12
4
56
89
12
4
56
8
9 12
x: 4 x: 5 x: 6 x: 8
0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003
0.96
0.97
0.98
E. coli (30x, k25)
45
689
1216
2024
28
32
35
45
6
89
1216
2024
2832 35
4
5689
1216
2024
2832 35
45
689
1216
2024
2832 35
x: 4 x: 5 x: 12 x: 20
0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04
0.940.950.960.970.980.99
C. elegans (30x, k35)
45
689
1220
3040 50
45
689
1220 30
40 50
4
56
8912
2030
40 50
4
56
89
1220 30
40 50
x: 4 x: 5 x: 20 x: 30
0 0.05 0.10 0 0.05 0.10 0 0.05 0.10 0 0.05 0.100.85
0.87
0.89
0.91
0.93
H. sapiens Chr21 (17x, k50)
FDR
Sensitivity
4McKenna, 2010 5Vaser, 2017 6Jain, 2018 7Pendleton, 2015
*Single Molecule Sequencing draft genomes **Time for pipeline ***15GB RAM
●
●
● ● ●
● ●●
●●
●
●●
●●
●
●
● ● ●
●
●
● ● ●
3
5
10
20 30 40 50Coverage
# M
ism
atch
es p
er 1
00kb
p
●
●
●
●
●
c1c2c3c4auto
●
●
● ● ●●
● ● ●●
●
●● ● ●
●
●
● ●●
●
●
● ● ●
0.5
1.0
2.0
20 30 40 50Coverage
# In
dels
per
100
kbp
Method
https://github.com/bcgsc/nthits https://github.com/bcgsc/ntedit
Human*
Controlled
Software
Funding
Cacao8, Beluga9, Axolotl10
www.bcgsc.ca � [email protected]
Tuning
Experimental
ntEdit René Warren � Jessica Zhang � Lauren Coombe � Hamid Mohamadi � Inanç Birol
Effect of Bloom filter FPR
cove
rage
th
resh
old
(-c)
FPR ~ 0.0005
Threshold error kmers
ntCard 2
Controlled C. elegans sequence data
Base Illumina
White spruce
Interior spruce
Subs. (M) 49.39 47.29
Indels (M) 1.11 1.02
ntHits 3h29m 4h23m
ntEdit 25m 23m
ntHits 207.8 206.9
ntEdit 90.2 86.1
Polish w\ 54X Illumina Time** Edits (M) BUSCO (%)
GATK4 41h45 0.97 5,654 (91.3)
ntEdit*** 2h18 0.95 5,670 (91.6)
Racon5 45h54 N/A 5,681 (91.7)
GATK 42h21 2.66 5,285 (85.4)
ntEdit*** 2h10 3.63 5,651 (91.3)
Racon 40h55 N/A 5,670 (91.6)
Nan
opor
e6
5,64
7 (9
1.2)
P
acB
io7
5,28
5 (8
5.4)
8 Morrissey, 2019 9 Jones, 2017 10 Nowoshilow, 2018
1Mikheenko, 2018
http://birol-lab.ca
2Mohamadi, 2017
http://renewarren.ca
ntHits 2h, 40GB / ntEdit 5m, 12GB
Experimental
Results
(FPR)
haploid
scalable genome sequence polishing
E. c
oli
C. e
lega
ns
H. s
apie
ns c
hr21
30x k25
30x k35
17x k50
Sen
sitiv
ity
*Values of y are indicated on the plot
ntHits ntEdit
Bloom filter <bit size
2.3
Gbp
gen
ome
20 G
bp g
enom
e 32
Gbp
gen
ome
ntHits 3h, 210GB / ntEdit 20m, 95GB
[laur
asia
ther
ia]
[tetra
poda
]
< 8X
Illu
min
a 60
X Il
lum
ina
diploid
RA
M
(GB
) Ti
me
net
7 23
34
0
366
385
Baseline SMS* BUSCO%
Δ7
Δ106
Che
ck a
bsen
ce
Che
ck p
rese
nce
NGS reads (e.g. Illumina)
SMS, 10xG, Illumina genome assembly gene sequence, etc.
New feature (v1.2.0)
-m option editing mode 0-2 [default=0] 0: best substitution, or first supported indel 1: best substitution, or best indel 2: best edit overall (exhaustive)
Testing
4
56
8
912
4
56
89
12
4
56
89
12
4
56
8
9 12
x: 4 x: 5 x: 6 x: 8
0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003
0.96
0.97
0.98
E. coli (30x, k25)
45
689
1216
2024
28
32
35
45
6
89
1216
2024
2832 35
4
5689
1216
2024
2832 35
45
689
1216
2024
2832 35
x: 4 x: 5 x: 12 x: 20
0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04
0.940.950.960.970.980.99
C. elegans (30x, k35)
45
689
1220
3040 50
45
689
1220 30
40 50
4
56
8912
2030
40 50
4
56
89
1220 30
40 50
x: 4 x: 5 x: 20 x: 30
0 0.05 0.10 0 0.05 0.10 0 0.05 0.10 0 0.05 0.100.85
0.87
0.89
0.91
0.93
H. sapiens Chr21 (17x, k50)
FDR
Sensitivity
4
56
8
912
4
56
89
12
4
56
89
12
4
56
8
9 12
x: 4 x: 5 x: 6 x: 8
0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003
0.96
0.97
0.98
E. coli (30x, k25)
45
689
1216
2024
28
32
35
45
6
89
1216
2024
2832 35
4
5689
1216
2024
2832 35
45
689
1216
2024
2832 35
x: 4 x: 5 x: 12 x: 20
0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04
0.940.950.960.970.980.99
C. elegans (30x, k35)
45
689
1220
3040 50
45
689
1220 30
40 50
4
56
8912
2030
40 50
4
56
89
1220 30
40 50
x: 4 x: 5 x: 20 x: 30
0 0.05 0.10 0 0.05 0.10 0 0.05 0.10 0 0.05 0.100.85
0.87
0.89
0.91
0.93
H. sapiens Chr21 (17x, k50)
FDR
Sensitivity
1
False discovery rate
Copy reference (ref)
Subs. 0.001 Indels 0.0001
Simulate
Run
Assess QUAST1
PE100, 300bp frag. err0.1%
FPR~0.0005 3 hash fn
96.0
95.4
93.1
56.1
1.3
1.6
2.5
7.7
2.7
3.1
4.4
36.2
0% 20% 40% 60% 80% 100%
canu
ntedit
pilon
+ntedit
BUSCOs(%of1,440searched)
complete fragmented missing
‘Haploidizing’
Spruce13
13 Warren, 2015
Canu12
+ntEdit
+pilon
+pilon +ntEdit
400 Mbp genome
Base Nanopore
k35 30 27 25 23 i5 d5 m1 ntH
its 1
5m 4
GB
/ nt
Edit
5m <
2GB
� Polish w\ 30X PE100 Illumina reads � Assess completeness / accuracy w\ BUSCO11: single-copy gene orthologs
[embryophyta]
k35 30 27 25 23 i5 d5 m1
Acknowledgements
Controlled
Warren et al. 2019. Bioinformatics. DOI: 10.1093/bioinformatics/btz400
Reference
E. coli
3 Walker, 2014
Bloom filter
Bloom filter
Summary
Short
Linked
Long
S
Sea
ler Assembly Correction Scaffolding Gap-filling Polishing
Illumina, SMS drafts (Nanopore/PacBio), etc.
Read Technology
Scalablesolu+onsforgenomeassembly
https://github.com/ bcgsc
Scalable solutions for genome assembly
n=6,
253
n=3,
950
[euarchontoglires] n=6,192
17X
Illu
min
a
(pseudohap.)
FPR~0.0005 3 hash fn
C. elegans
H. sapiens chr21
*kmers with alternate 3’end base (k_alt)
Genomecacaobelugahumanspruceaxolotl
48 threads (250bp reads, k40)
250 125 0
k50i3d3
k40i3d3
0 0.5 1 1.5
0
200
400
0
1
2
3
4
0 0.5 1 1.5
Bases (billion)
Mem
ory
(GB
)
Tim
e (h
ours
)
Reads (billion)
375
x +
0 0.5 1 1.5
0
200
400
0
1
2
3
4
0 0.5 1 1.5
Bases (billions)
Mem
ory
(GB
)
Tim
e (h
ours
)
Reads (billion)
ntEdit k50i1d1
rate~0.0023
rate~0.0001
ntHits
1. Check kmer