![Page 1: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/1.jpg)
CSE182-L10
LW statistics/Assembly
![Page 2: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/2.jpg)
Whole Genome Shotgun
• Break up the entire genome into pieces
• Sequence ends, and assemble using a computer
• LW statistics & Repeats argue against the success of such an approach
Alternative: build a roadmap of the genome, with physical clones mapped for each region. Sequence each of the clones, and put them together
![Page 3: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/3.jpg)
Questions
• Algorithmic: How do you put the genome back together from the pieces?
• Statistical? How many pieces do you need to sequence, etc.?– The answer to the statistical questions had
already been given in the context of mapping, by Lander and Waterman.
![Page 4: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/4.jpg)
Lander Waterman Statistics
G
L
• The fragments are falling randomly on the genome• Overlapping fragments form islands of contiguous sequence. • Ideally, we want one island for each chromosome. How many
fragments should we sequence?
![Page 5: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/5.jpg)
Lander Waterman Statistics
G
L€
G = Genome Length
L = Fragment Length
N = Number of Fragments
T = Required Overlap
c = Coverage = LN/G
α = N/G
θ = T/L
σ = 1-θ
![Page 6: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/6.jpg)
LW statistics: questions
• As the coverage c increases, more and more areas of the genome are likely to be covered. Ideally, you want to see 1 island.• Q1: What is the expected number of islands?
• Ans: N exp(-c)• The number
increases at first, and gradually decreases.
![Page 7: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/7.jpg)
Analysis: Expected Number Islands
• Computing Expected # islands.• Let Xi=1 if an island ends at position i,
Xi=0 otherwise.• Number of islands = ∑i Xi
• Expected # islands = E(∑i Xi) = ∑i E(Xi)
![Page 8: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/8.jpg)
Prob. of an island ending at i
• E(Xi) = Prob (Island ends at pos. i)
• =Prob(clone began at position i-L+1
AND no clone began in the next L-T positions)
iL
T
€
E(X i) =α 1−α( )L−T
=αe−cσ
€
Expected # islands = E(X i) =i
∑ Gαe−cσ = Ne−cσ
![Page 9: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/9.jpg)
LW statistics
• Pr[Island contains exactly j clones]?• Consider an island that has already begun. With probability e-c,
it will never be continued. Therefore• Pr[Island contains exactly j clones]=
€
(1− e−cσ ) j−1e−cσ
• Expected # j-clone islands
€
=Ne−cσ (1− e−cσ ) j−1e−cσ
![Page 10: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/10.jpg)
Expected # of clones in an island
• Expected # of clones in an island =
€
ecσ
Q: How? Why do we care?
Often, at the beginning of a genome project, we do not know the length of the genome. This equation helps us determine the length.
![Page 11: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/11.jpg)
Expected length of an island
€
Lecσ −1
c
⎛
⎝ ⎜
⎞
⎠ ⎟+ (1−σ )
⎡
⎣ ⎢
⎤
⎦ ⎥
![Page 12: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/12.jpg)
Whole Genome Sequencing & Assembly
![Page 13: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/13.jpg)
Whole Genome Shotgun
• Break up the entire genome into pieces
• Sequence ends, and assemble using a computer
• LW statistics & Repeats argue against the success of such an approach
![Page 14: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/14.jpg)
Assembly Basics
• Three main components:– Overlap– Layout– Consensus
![Page 15: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/15.jpg)
Overlap
• Given a pair of fragments s1 and s2, do they belong together?
• Yes, if a prefix of s2 matches a suffix of s1
• How would you compute such a match?
![Page 16: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/16.jpg)
Overlap
• S[i,j] = optimum score of an alignment of s1[1..i] against a suffix of s2[1..j]
i
j
• The best prefix-suffix alignment is given by:
• Maxi {S[i,n] }
![Page 17: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/17.jpg)
Overlap Detection
• Compute the best prefix-suffix alignments between each pair of fragments.
• Keep the “high-scoring” ones as evidence of true overlap.
• What is the problem?
![Page 18: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/18.jpg)
Overlap detection problem
• Consider the number of fragments. The LW statistics say that we need good coverage (c=8, 10) to get most of the base-pairs. – G = 3000Mb, L=500– Coverage LN/G = 10– N = 10*3*109/500 = 6*107
– Number of comparisons needed = 3.6 * 1015
• Not good! (Only a small fraction are true overlaps)
![Page 19: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/19.jpg)
k-mer based overlap (Piegeonhole principle again)
• Consider a 25bp sequence. – Expected number of occurrences
in the genome– 3*109*4-25 = 2*10-6
• A 25-bp sequence appears is unique to the genome!
• Two overlapping sequences should share a 25-mer
• Two non-overlapping sequences should not!
25bp
![Page 20: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/20.jpg)
Sorting k-mers
• Build a list of k-mers that appear in the sequences and their reverse complements
• Create a record with 4 entries:– K-mer– Sequence number– Position in the sequence– Reverse complementation flag
• Sort a vector of these according to k-mer
• How many records per k-mer are expected?
• If number of records exceeds threshold, discard (why?)
K-mer S.idPos.
![Page 21: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/21.jpg)
Alignment module
• Coalesce k-mer hits into longer, gap-free partial alignments.
• These extended k-mer hits are saved.
• For each pair of sequences, form a directed graph.
• For each maximal path in the graph, construct an alignment.
• Refine alignment via banded DP
![Page 22: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/22.jpg)
Problem2: Size
• Islands might simply be too small in length = (1-T/L) = (1-50/500) = 0.9, c = 8.• #Islands = N e-c = 45K• Size of an island = 54K• Not enough to make it an acceptable assembly!• PLUS, there is the problem of Repeats, Chimerism etc.
![Page 23: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/23.jpg)
Solution 2: Clones can have mate-pairs
• Recall that we sequence about 1000bp of the end of a clone
• If we sequenced both ends, we get extra information, particularly if we know the length of the original clone.
![Page 24: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/24.jpg)
Mate Pairs
• Mate-pairs allow you to merge islands (contigs) into super-contigs
![Page 25: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/25.jpg)
Super-contigs are quite large
• Make clones of truly predictable length. EX: 3 sets can be used: 2Kb, 10Kb and 50Kb. The variance in these lengths should be small.
• Use the mate-pairs to order and orient the contigs, and make super-contigs.
![Page 26: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/26.jpg)
Whole genome shotgun
• Input: – Shotgun sequence fragments (reads)– Mate pairs
• Output:– A single sequence created by consensus of overlapping reads
• First generation of assemblers did not include mate-pairs (Phrap, CAP..)
• Second generation: CA, Arachne, Euler• We will discuss Arachne, a freely available sequence
assembler (2nd generation)
![Page 27: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/27.jpg)
Problem 3: Repeats
![Page 28: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/28.jpg)
Repeats & Chimerisms
• 40-50% of the human genome is made up of repetitive elements.
• Repeats can cause great problems in the assembly!
• Chimerism causes a clone to be from two different parts of the genome. Can again give a completely wrong assembly
![Page 29: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/29.jpg)
Repeat detection
• Lander Waterman strikes again!• The expected number of clones in a Repeat containing
island is MUCH larger than in a non-repeat containing island (contig).
• Thus, every contig can be marked as Unique, or non-unique. In the first step, throw away the non-unique islands.
Repeat
![Page 30: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/30.jpg)
Detecting Repeat Contigs 1: Read Density
• Compute the log-odds ratio of two hypotheses:
• H1: The contig is from a unique region of the genome.
• The contig is from a region that is repeated at least twice
![Page 31: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/31.jpg)
Detecting Chimeric reads
• Chimeric reads: Reads that contain sequence from two genomic locations.
• Good overlaps: G(a,b) if a,b overlap with a high score
• Transitive overlap: T(a,c) if G(a,b), and G(b,c)
• Find a point x across which only transitive overlaps occur. X is a point of chimerism
![Page 32: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/32.jpg)
Contig assembly
• Reads are merged into contigs upto repeat boundaries.
• (a,b) & (a,c) overlap, (b,c) should overlap as well. Also,
– shift(a,c)=shift(a,b)+shift(b,c)
• Most of the contigs are unique pieces of the genome, and end at some Repeat boundary.
• Some contigs might be entirely within repeats. These must be detected
![Page 33: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/33.jpg)
Creating Super Contigs
![Page 34: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/34.jpg)
Supercontig assembly
• Supercontigs are built incrementally• Initially, each contig is a supercontig.• In each round, a pair of super-contigs is
merged until no more can be performed.• Create a Priority Queue with a score for every
pair of ‘mergeable supercontigs’.– Score has two terms:
• A reward for multiple mate-pair links• A penalty for distance between the links.
![Page 35: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/35.jpg)
Supercontig merging
• Remove the top scoring pair (S1,S2) from the priority queue.
• Merge (S1,S2) to form contig T.• Remove all pairs in Q containing S1 or S2
• Find all supercontigs W that share mate-pair links with T and insert (T,W) into the priority queue.
• Detect Repeated Supercontigs and remove
![Page 36: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/36.jpg)
Repeat Supercontigs
• If the distance between two super-contigs is not correct, they are marked as Repeated
• If transitivity is not maintained, then there is a Repeat
![Page 37: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/37.jpg)
Filling gaps in Supercontigs
![Page 38: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/38.jpg)
Consensus Derivation
• Consensus sequence is created by converting pairwise read alignments into multiple-read alignments
![Page 39: CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics](https://reader030.vdocuments.us/reader030/viewer/2022032522/56649d6b5503460f94a49ff8/html5/thumbnails/39.jpg)
Summary
• Whole genome shotgun is now routine:– Human, Mouse, Rat, Dog, Chimpanzee..– Many Prokaryotes (One can be sequenced in a day)– Plant genomes: Arabidopsis, Rice – Model organisms: Worm, Fly, Yeast
• A lot is not known about genome structure, organization and function.– Comparative genomics offers low hanging fruit