amos tools for assembly validation automatically scan an assembly to locate misassembly signatures...

27
AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction Load Assembly Data into Bank Evaluate Mate Pairs & Libraries Evaluate Read Alignments Evaluate Read Breakpoints Analyze Depth of Coverage Identify “Surrogates” Load Misassembly Signatures into Bank AMOS Bank http://amos.sourceforge.net

Upload: avis-warner

Post on 02-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

AMOS tools for assembly validation

Automatically scan an assembly to locate misassembly signatures for further analysis and correction

Load Assembly Data into Bank Evaluate Mate Pairs & Libraries Evaluate Read Alignments Evaluate Read Breakpoints Analyze Depth of Coverage Identify “Surrogates” Load Misassembly Signatures into Bank

AMOSBank

http://amos.sourceforge.net

Page 2: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

Assembly QC: mate happiness

Evaluate mate “happiness” across assembly Happy = Correct orientation and distance

Finds regions with multiple: Compressed Mates (too close together) Expanded Mates (too far apart) Invalid same orientation ( ) Invalid “outie” orientation ( ) Missing Mates

Linking mates (mate in a different scaffold) Singleton mates (mate is not in any contig)

Regions with high C/E statistic

Page 3: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

Mate happiness

Excision: Skip reads between flanking repeats

Truth

Misassembly: Compressed Mates, Missing Mates

Page 4: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

Mate happiness

Insertion: Additional reads between flanking repeats

Truth

Misassembly: Expanded Mates, Missing Mates

Page 5: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

Mate happiness

Rearrangement: Reordering of reads

Truth

Misassembly: Misoriented Mates

AB

Note: if A,B too far apart, mates may all be “happy”

BA

Page 6: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

Compression/Expansion (C/E) Statistic

The presence of individual compressed or expanded mates is rare but expected

Do the inserts spanning a given position differ from the rest of the library?

Flag large differences as potential misassemblies Even if each individual mate is “happy”

Compute the statistic at all positions (Local Mean – Global Mean) / Scaling Factor

Introduced by Jim Yorke’s group at UMD

Page 7: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

Library size variation

2kb 4kb 6kb

8 inserts: 3kb-6kb

Local Mean: 4048

C/E Stat: (4048-4000) = +0.33

(400 / √8)

Near 0 indicates overall happiness

0kb

Page 8: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

C/E statistic: Compression

8 inserts: 3.2 kb-4.8kb

Local Mean: 3488

C/E Stat: (3488-4000) = -3.62

(400 / √8)

C/E Stat ≤ -3.0 indicates Compression

2kb 4kb 6kb0kb

Page 9: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

Read Alignment

Multiple reads with same conflicting base are unlikely

1x QV 30: 1/1000 base calling error 2x QV 30: 1/1,000,000 base calling error 3x QV 30: 1/1,000,000,000 base calling error

Correlated SNPs are likely to be assembly errors, usually collapsed repeats

AMOS Tools: analyzeSNPs & clusterSNPs Locate regions with high rate of correlated SNPs Parameterized thresholds:

Multiple positions within 100bp sliding window 2+ conflicting reads Cumulative QV >= 40 (1/10000 base calling error)

A G C A G C A G C A G C A G C A G C C T A C T A C T A C T A C T A

Page 10: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

“chimeric” reads mates

ribosomal RNA repeats, B. anthracis

Read breakpoints: compression error

QC METHOD:

Align singleton reads to consensus assembly

Find any breakpoints shared by multiple reads

Page 11: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

“Uncompress” by creating new repeat copy

Tandem duplication

Reference: B. anthracis Ames ‘ancestor’ strain

B. anthracis Ames Porton Down strain

Page 12: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

Read Coverage Find regions of contigs where the

depth of coverage is unusually high

AMOS Tool: analyzeReadDepth 2.5x mean coverage

A R1 + R2 B

A R1 BR2

Page 13: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

Hawkeye: assembly viewer and debugger

Page 14: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

Launch Pad

Page 15: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

Histograms & Statistics

InsertSize

GCContent

ReadLength

OverallStatistics

Bird’s eye view of data and assembly quality

Page 16: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

Scaffold View

a. Statistical Plots

b. Scaffold

c. Features

d. Clone inserts

e. Overview

f. Control Panel

g. Details

Page 17: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

Standard Feature Types

[B] BreakpointAlignment ends at this position

[C] CoverageLocation of unusual mate coverage (asmQC)

[S] SNPsLocation of Correlated SNPs

[U] UnitigUsed to report location of surrogate unitigs in CA assemblies

[X] OtherAll other Features

Page 18: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

Insert (mate) HappinessHappy Oriented Correctly && |Insert Size – Library.mean| <= Happy-Distance *

Library.sd

Stretched Oriented Correctly && Insert Size > Library.mean + Happy-Distance *

Library.sd

Compressed Oriented Correctly && Insert Size < Library.mean - Happy-Distance *

Library.sd

Misoriented Same or Outies

Linking Read’s mate is in some other scaffold

Singleton Read’s mate is a singleton

Unmated No mate was provided for read

Both

mate

s pre

sent

Only

1 r

ead p

rese

nt

Page 19: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

Contig View: detailed alignment of reads to contigs

Consensus & Position

ScrollableRead Tiling

Read Orientation DiscrepancyHighlight

Discrepancy

Summary

DiscrepancyNavigation

ContigQuick Select

Regular ExpressionConsensus Search

Page 20: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

SNP View

SNP SortedReads

PolymorphismView

Zoom Out

Page 21: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

SNP Barcode

SNP SortedReads

Colored Rectangle indicate the positions and composition of the SNPs

Page 22: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

Scaffold View

CE Statistic

Coverage

SNP Feature

Happy

Stretched

Compressed

Misoriented Linking

Page 23: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

Collapsed Repeat

68 Correlated SNPs

-5.5 CE Dip

CompressedMates

Cluster

ReadCoverageSpike

Page 24: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

Example 1: Compression in Prevotella intermedia 17assembly, found by the CE statistic

Green inserts are <=2 standard deviations from the mean, and the orange inserts are compressed by > 2 standard deviations.

Vertical yellow line shows the most likely place of a compression misassembly.

Only one insert in this case is compressed by > 3 standard deviations

Page 25: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

Example 2: Compression in Prevotella intermedia 17assembly, found by the CE statistic

Page 26: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

Fixing collapsed repeats with AMOS

Befo

reA

fter Resolved “Stitched” Contig

Original ContigCompression Point

Patch Contig

Page 27: AMOS tools for assembly validation Automatically scan an assembly to locate misassembly signatures for further analysis and correction  Load Assembly

Assemblies can be preserved at NCBI’s Assembly Archive

http://www.ncbi.nlm.nih.gov/Traces/assembly/assmbrowser.cgi