autoeditor automated base caller error correction tool slides courtesy of pawel gajer, ph.d

21
AutoEditor AutoEditor Automated base caller error correction Automated base caller error correction tool tool Slides courtesy of Slides courtesy of Pawel Gajer, Ph.D. Pawel Gajer, Ph.D.

Upload: alan-lamb

Post on 04-Jan-2016

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D

AutoEditorAutoEditor

Automated base caller error correction toolAutomated base caller error correction tool

Slides courtesy ofSlides courtesy ofPawel Gajer, Ph.D.Pawel Gajer, Ph.D.

Page 2: AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D

AutoEditorBase-calling in the context of single chromatogram is hard…

but finding base-calling “mistakes” in a multiple alignment is easy.

Page 3: AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D

• Principal and secondary aims of AutoEditor• AutoEditor as a higher level base caller• Tiling discrepancy types• Base caller error types• Resolving discrepancies of the form B…B*• Resolving discrepancies of the form *…*B• AutoEditor statistics

Page 4: AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D

A principal goal of AutoEditor is to automatically correct a majority of tiling discrepancies, reducing human editing effort to the most problematic discrepancy types.

A tiling discrepancy is any deviation from the homogeneous coverage of a consensus base.

Page 5: AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D

autoEditor as a higher level base caller

single read trace data base caller nucleotide sequence

tiling of reads

tiling discrepancies multiple read trace data

autoEditor

list of corrected discrepancies

Page 6: AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D

Other applications:

• Clear range editing (read expansion)

• SNP detection

Page 7: AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D

Clear range editing

single read quality values datatrimming algorithm

trimmed read

less stringently trimmed reads

assembler

tiling of reads autoEditor

Page 8: AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D

SNP detection

Alignment data of genome 1

Alignment data of genome 2

Combined genomes alignment data List of putative SNPs

autoEditor

List of putative SNPs that pass autoEditor error screening

Page 9: AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D

Tiling discrepancy types

Single deletion:

Single insertion:

Page 10: AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D

Single insertion and single deletion are extreme cases of insertion/deletion discrepancies

A A A AA A A *A A * *A * * ** * * *

The above sequence of discrepancies can be representedschematically as an edge in a two vertex graph:

A *

Page 11: AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D

The configuration space of all tiling discrepancy types can be schematically represented as a 4-dimensional simplex

A

T

C

G

*

Page 12: AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D

support

supportsupport (b)

amplitude (a)

minimum difference between amplitude and local minimum (c)

Open dots on the signal curve indicate local maxima and open circles indicate local minima.

Re-calling individual bases

Page 13: AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D

Base caller error types

• Missed signal

• Signal shift

•Unresolved peaks

Page 14: AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D

Resolving a single deletion discrepancy

compute discrepancy’s read multiplicity: mult

if mult = 0 then check for a missed signal error

if |mult| > 0 then check for a signal shift errorif it is not a signal shift error then it is a unresolved peaks error

To resolve it, find two other reads with well resolved peaks over the unresolved peaks

bases

A discrepancy read multiplicity is the number of bases to the right or left (negative sign) of the discrepancy positions equal to the consensus base covering the discrepancy.

Page 15: AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D
Page 16: AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D

Resolving a single insertion discrepancy

compute discrepancy’s read multiplicity - mult

if mult = 0 then check if the signal parameters are within allowable ranges

if | mult | > 0 then check if the insertion base is a part of |mult |+1 well-

resolved signal peaksif not find two other reads whose traces have exactly |mult | well-

resolved signal peaks between the bases flanking the discrepancy position

Page 17: AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D

mult = 0, weak signal error

mult = -2, unresolved peakserror with two other readswith exactly 2 signal peaksbetween Gs flanking AA*

Page 18: AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D

from Nov 12, 2002 Test set: the first 10 contigs of Mycoplasma arthritidisasmbl_id size(kb) # corrections # autoEdit # errors in

errors newer autoEdit1 132 124 3 02 64 78 4 13 40 55 3 04 53 45 2 15 16 15 0 06 22 29 1 07 23 19 0 08 51 48 1 09 26 33 1 010 15 15 0 0----------------------------------------------------------------------Total: 442 461 15 2

~3.25% ~0.43%

Missed-signal (MS) and signal shift (SS) correction errors AutoEditor version 1.1

Page 19: AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D

Test set: the first 10 contigs of Mycoplasma arthritidis

asmbl_id size(in kb) #disc #corr %corr

1 132 3390 3266 96%2 64 2195 2142 98%3 40 1344 1325 99%4 53 1304 1242 95%5 16 508 487 96%6 22 777 757 97%7 23 624 613 98%8 51 1303 1232 95%9 26 783 760 97%10 15 437 423 97%--------------------------------------------------------------------Total: 442 12665 12065 95%

where #disc is the total number of discrepancies in the given contig#corr is the number of corrected discrepancies%corr is the percentage of corrected discrepancies

AutoEditor version 1.2 correcting all single deletion errors

Page 20: AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D

Organism Discrep’s Corrected % Contig

Discrep’s Corrected % Acidobacterium capsulatum 103539 93729 90.5% 99555 89977 90.4% Neorickettsia sennetsu Miyayama 41408 37425 90.4% 38355 34579 90.2% Bacillus anthracis Kruger B 317745 284503 89.5% 296222 264646 89.3% Coxiella burnetii 131183 117232 89.4% 118723 105562 88.9% Dichelobacter nodosus 83804 73547 87.8% 76766 67900 88.5% Clostridium perfringens 71928 62822 87.3% 66546 59929 90.1% Mycoplasma capricolum 17805 15444 86.7% 16574 14584 88.0% Brucella suis 129870 112359 86.5% 120799 105250 87.1% Plasmodium vivax 783495 655642 83.7% 734298 618268 84.2% Pseudomonas fluorescens 234264 194771 83.1% 224049 186276 83.1% Campylobacter jejuni 96231 79237 82.3% 88800 73940 83.3% Fibrobacter succinogenes 243270 196150 80.6% 208790 175294 84.0% Erwinia chrysanthemi 219370 176354 80.4% 205161 165070 80.5% Mycobacterium smegmatis 433105 346503 80.0% 363017 309774 85.3% Prevotella intermedia 118857 94162 79.2% 110750 87931 79.4% Pseudomonas syringae 227887 177897 78.1% 200223 164561 82.2% Silicibacter pomeroyi 156130 116907 74.9% 148006 112093 75.7% Chlamydophila caviae 50137 36972 73.7% 47875 35103 73.3% Wolbachia sp. 70782 51163 72.3% 57357 45401 79.2% Burkholderia mallei 139359 99711 71.6% 130158 94540 72.6% Streptococcus agalactiae 152330 105878 69.5% 109821 92153 83.9% Streptococcus pneumoniae 53566 36557 68.3% 43093 33432 77.6% Myxococcus xanthus 33525 21789 65.0% 33254 21699 65.3% Dehalococcoides ethenogenes 71587 46416 64.8% 61878 42649 68.9% Listeria monocytogenes 229172 145274 63.4% 148177 123268 83.2% Streptococcus mitis 157348 92377 58.7% 106172 74203 69.9% Total 4367697 3470821 79.5% 3854419 3198082 83.0%

AutoEditoraccuracy

Page 21: AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D

Organism Read length Corrections AE Errors Listeria monocytogenes 37420828 145274 4 Wolbachia sp. 11446011 51163 0 Burkholderia mallei 47407080 99711 28 Brucella suis 26629877 112359 2 Streptococcus agalactiae 23485615 105878 3 Coxiella burnetii 29135115 117232 30 Campylobacter jejuni 15013845 79237 11 Chlamydophila caviae 10286694 36972 6 Dehalococcoides ethenogenes 10724521 46416 12 Neorickettsia sennetsu Miyayama 8805232 37425 0 Fibrobacter succinogenes 46463268 196150 4 Mycoplasma capricolum 9353819 15444 0 Prevotella intermedia 20084365 94162 3 Pseudomonas syringae 50369232 177897 46 Total 346625502 1315320 149 Table 2. Comparison of AutoEditor corrections on 14 genomes to the finished sequence of those genomes.

AutoEditor accuracy