assembly by alignment
DESCRIPTION
Assembly by alignment. Instead of overlap-layout-consensus we use alignment-consensus. Alignment algorithm. AMOScmp uses MUMmer MUMmer will be covered in detail by Adam Phillippy in a later lecture MUMmer provides very fast alignment of closely-related sequences. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/1.jpg)
1
![Page 2: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/2.jpg)
Assembly by alignmentAssembly by alignment
Instead of
overlap-layout-consensus
we use
alignment-consensus
2
![Page 3: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/3.jpg)
Alignment algorithmAlignment algorithm
• AMOScmp uses MUMmer
• MUMmer will be covered in detail by Adam Phillippy in a later lecture
• MUMmer provides very fast alignment of closely-related sequences
3
![Page 4: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/4.jpg)
Assembly of a close relativeAssembly of a close relative
4
![Page 5: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/5.jpg)
AMOScmp algorithmAMOScmp algorithm
• Read alignment: Each shotgun read is aligned to the reference genome using MUMmer.
• Repetitive sequences and polymorphisms between the target and the reference cause some reads to align in a non-contiguous fashion.
• We used a modified version of the Longest Increasing Subsequence (LIS) algorithm in order to generate chains of mutually consistent matches between each read and the reference.
5
![Page 6: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/6.jpg)
Repeat resolutionRepeat resolution
1. Check to see if the paired-end sequence (the “mate”) is uniquely anchored in the genome. If it is, we place the read in the location that satisfies the constraints imposed by the mate-pair information.
2. If a read and its mate are both ambiguously placed, we attempt to find whether the mate-pair information allows us to place them both in the assembly. In some cases, there exists only one placement of both a read and its mate that satisfies the mate-pair constraints on distance and orientation.
3. When the first two steps leave us with more than one placement for a pair of reads, we choose at random one of the possible placements that satisfy the mate-pair constraints.
6
![Page 7: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/7.jpg)
Repeat resolution: exampleRepeat resolution: example
• Aligned all shotgun reads from Streptococcus agalactiae 2603 to the final, finished genome
• 26,099 reads total• 25,310 uniquely anchored in genome • 314 placed with the help of a uniquely anchored mate• 22 were placed as unique pairs, with neither read being
unique on its own• 442 had to be placed in a randomly chosen copy of a repeat
7
![Page 8: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/8.jpg)
Read alignment: anomaliesRead alignment: anomalies
• Reads don’t always align properly
• Certain alignment patterns are used by AMOScmp to detect differences in the new “target” genome
• Many of these can be resolved
8
![Page 9: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/9.jpg)
9
Mapping reads to the reference genome when the target genome contains an insertion. The bottom indicates the true layout of the reads (A,B,C) along the target. The top indicates the alignment of the reads to the reference. Slanted lines depict portions of the read that do not match; in the case of read B, the entire read does not align to the reference.
![Page 10: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/10.jpg)
10
The insertion in the target genome is shorter than a single read. The "bubbles" identify the portions of the two reads that do not align to the reference.
![Page 11: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/11.jpg)
11
Insertion into the reference. The alignment of reads to the reference (top) indicates the presence of the insertion. Dashed lines indicate the “stretch” of the reads needed to align to the reference.
![Page 12: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/12.jpg)
12
Regions II and III from the target appear in a different order in the reference. Reads A, B, and C match the reference in disjoint locations — the dashed lines connect sections of a read that are adjacent in the target genome.
Signature of a genome rearrangement
![Page 13: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/13.jpg)
13
The gray areas are divergent – they are not recognizably similar. Portions of the reads not matching the reference are shown at an angle.
Signature of a divergent region
![Page 14: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/14.jpg)
14
Effect of short flanking repeats on the alignment of a read to the reference in the case of an insertion in the reference. The repeat is shown in gray.
The dashed lines connect sections of read A that occur twice in the reference but once in A and in the target genome.
![Page 15: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/15.jpg)
15
The rows correspond (top to bottom) to: CA — scratch assembly contigs created by Celera Assembler; 2603 — AMOS-Comp contigs created using strain 2603 as a reference; NEM — AMOS-Comp contigs using strain NEM 316 as a reference; nucmer — the alignment of strain NEM 316 to strain 2603. Stacked arrows in the bottom row correspond to repeats.
Assembly of 1Mb of S. agalactiae 2603
![Page 16: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/16.jpg)
Assemblies of strain 2603 Assemblies of strain 2603 produced by AMOScmpproduced by AMOScmp
16
![Page 17: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/17.jpg)
Completeness of assemblyCompleteness of assembly(mapped back to finished strain 2603)(mapped back to finished strain 2603)
17
The total gap size indicates the total number of bases missing from the assembled contigs after mapping them to the finished genome. The column marked LW represents the theoretical estimate of coverage based on Lander-Waterman [19] statistics.
![Page 18: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/18.jpg)
Limits on comparative assemblyLimits on comparative assembly
18
![Page 19: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/19.jpg)
19
![Page 20: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/20.jpg)
Fishing in the Trace ArchiveFishing in the Trace Archive
• 2,772,509 reads (traces) for Drosophila ananassae
• 2,214,248 traces for D. simulans
• 2,445,065 traces for D. mojavensis
20
![Page 21: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/21.jpg)
21
Discovery of fruit-fly bacterial endosymbionts in published data
Wolbachia pipientis is an intra-cellular bacterial endosymbiont of fruit flies (genus Drosophila) and other insects, primarily found in the reproductive organs of females. The endosymbiont is often inadvertently sequenced as part of a fruit fly genome project.
Assembly strategy
Use completed sequence of Wolbachia endosymbiont of Drosophia melanogaster (wMel) to extract Wolbachia reads from Drosophila shotgun data deposited in NCB I Trace Archive.
![Page 22: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/22.jpg)
22
Strategy 1• Identify reads matching wMel with nucmer• Assemble extracted reads with Celera Assembler
Strategy 2• Extract and assemble reads with comparative assembler AMOScmp
![Page 23: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/23.jpg)
23
AMOScmpwMel
Drosophila + Wolbachia Wolbachia assembly
![Page 24: Assembly by alignment](https://reader035.vdocuments.us/reader035/viewer/2022081519/56813942550346895da0dd30/html5/thumbnails/24.jpg)
24
wAna wSim wWil
Molecule length 1,440,650 896,761 922,146
# matching reads 32,720 3,727 2,291
# contigs 464 388 485
# scaffolds 329 84
# genes 1,837 790
wAna – Wolbachia endosymbiont of D. ananassaewSim – Wolbachia endosymbiont of D. simulanswWil – Wolbachia endosymbiont of D. willistoni(NOTE: D. mojavensis turned out to be an erroneous submission; D. willistoni was discovered later)