16 assembly scs v2 - department of computer science
TRANSCRIPT
![Page 1: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/1.jpg)
Assembly & Shortest Common SuperstringBen Langmead
Department of Computer Science
Please sign guestbook (www.langmead-lab.org/teaching-materials) to tell me briefly how you are using the slides. For original Keynote files, email me ([email protected]).
![Page 2: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/2.jpg)
Input DNA
Reads Reference genome
+
Assembly
XHow do we assemble puzzle without the benefit of knowing what the finished product should look like?
(That's what the Human Genome Project had to do!)
![Page 3: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/3.jpg)
De novo shotgun assembly
![Page 4: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/4.jpg)
Assembly
Whole-genome “shotgun” sequencing first copies the input DNA:
Input: GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT
Copy: GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT
Fragment: GGCGTCTA TATCTCGG CTCTAGGCCCTC ATTTTTT GGC GTCTATAT CTCGGCTCTAGGCCCTCA TTTTTT GGCGTC TATATCT CGGCTCTAGGCCCT CATTTTTT GGCGTCTAT ATCTCGGCTCTAG GCCCTCA TTTTTT
Then fragments it:
“Shotgun” refers to the random fragmentation of the whole genome; like it was fired from a shotgun
![Page 5: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/5.jpg)
Assembly: Human Genome Project debate
Weber, James L., and Eugene W. Myers. "Human whole-genome shotgun sequencing." Genome Research 7.5 (1997): 401-409.
![Page 6: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/6.jpg)
Assembly: Human Genome Project debate
Weber, James L., and Eugene W. Myers. "Human whole-genome shotgun sequencing." Genome Research 7.5 (1997): 401-409.
![Page 7: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/7.jpg)
Assembly: Human Genome Project debate
Green, Philip. "Against a whole-genome shotgun." Genome Research 7.5 (1997): 410-417.
![Page 8: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/8.jpg)
Assembly: Human Genome Project debate
Green, Philip. "Against a whole-genome shotgun." Genome Research 7.5 (1997): 410-417.
![Page 9: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/9.jpg)
Assembly
Reconstruct thisFrom these
GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT
CTAGGCCCTCAATTTTT CTCTAGGCCCTCAATTTTT GGCTCTAGGCCCTCATTTTTT CTCGGCTCTAGCCCCTCATTTT TATCTCGACTCTAGGCCCTCA TATCTCGACTCTAGGCC TCTATATCTCGGCTCTAGG GGCGTCTATATCTCG GGCGTCGATATCT GGCGTCTATATCT
![Page 10: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/10.jpg)
Assembly
Reconstruct thisFrom these
???????????????????????????????????
CTAGGCCCTCAATTTTT GGCGTCTATATCT CTCTAGGCCCTCAATTTTT TCTATATCTCGGCTCTAGG GGCTCTAGGCCCTCATTTTTT CTCGGCTCTAGCCCCTCATTTT TATCTCGACTCTAGGCCCTCA GGCGTCGATATCT TATCTCGACTCTAGGCC GGCGTCTATATCTCG
![Page 11: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/11.jpg)
GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT
CTAGGCCCTCAATTTTT CTCTAGGCCCTCAATTTTT GGCTCTAGGCCCTCATTTTTT CTCGGCTCTAGCCCCTCATTTT TATCTCGACTCTAGGCCCTCA TATCTCGACTCTAGGCC TCTATATCTCGGCTCTAGG GGCGTCTATATCTCG GGCGTCGATATCT GGCGTCTATATCT
Coverage = 5
Coverage
![Page 12: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/12.jpg)
GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT
CTAGGCCCTCAATTTTT CTCTAGGCCCTCAATTTTT GGCTCTAGGCCCTCATTTTTT CTCGGCTCTAGCCCCTCATTTT TATCTCGACTCTAGGCCCTCA TATCTCGACTCTAGGCC TCTATATCTCGGCTCTAGG GGCGTCTATATCTCG GGCGTCGATATCT GGCGTCTATATCT
Coverage = 5
Coverage
![Page 13: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/13.jpg)
GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT
CTAGGCCCTCAATTTTT CTCTAGGCCCTCAATTTTT GGCTCTAGGCCCTCATTTTTT CTCGGCTCTAGCCCCTCATTTT TATCTCGACTCTAGGCCCTCA TATCTCGACTCTAGGCC TCTATATCTCGGCTCTAGG GGCGTCTATATCTCG GGCGTCGATATCT GGCGTCTATATCT 35 bases
177 bases
Average coverage = 177 / 35 ≈ 5-fold
![Page 14: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/14.jpg)
TCTATATCTCGGCTCTAGG
TATCTCGACTCTAGGCC
![Page 15: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/15.jpg)
TCTATATCTCGGCTCTAGG ||||||| ||||||| TATCTCGACTCTAGGCC
![Page 16: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/16.jpg)
First law of assembly
If a suffix of read A is similar to a prefix of read B...
...then A and B might overlap in the genome
TCTATATCTCGGCTCTAGG ||||||| ||||||| TATCTCGACTCTAGGCC
TCTATATCTCGGCTCTAGG GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT TATCTCGACTCTAGGCC
![Page 17: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/17.jpg)
TCTATATCTCGGCTCTAGG ||||||| ||||||| TATCTCGACTCTAGGCC
Why the differences?
1. Sequencing errors 2. Ploidy: e.g. humans have 2 copies of each
chromosome, and copies can differ
![Page 18: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/18.jpg)
Second law of assembly
More coverage leads to more and longer overlaps
GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT
CTAGGCCCTCAATTTTT CTCGGCTCTAGCCCCTCATTTT TCTATATCTCGGCTCTAGG GGCGTCGATATCT
CTAGGCCCTCAATTTTT GGCTCTAGGCCCTCATTTTTT CTCGGCTCTAGCCCCTCATTTT TATCTCGACTCTAGGCCCTCA TCTATATCTCGGCTCTAGG GGCGTCTATATCTCG GGCGTCTATATCT
less coverage
more coverage
![Page 19: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/19.jpg)
TCTATATCTCGGCTCTAGG ||||||| ||||||| TATCTCGACTCTAGGCC
![Page 20: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/20.jpg)
TCTATATCTCGGCTCTAGG ||||||| ||||||| TATCTCGACTCTAGGCC
TATCTCGACTCTAGGCC |||| |||||| || CTCGGCTCTAGCCCCTCAT
![Page 21: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/21.jpg)
Directed graph
PoloniusHamlet
NodeEdge
![Page 22: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/22.jpg)
Directed graph
Polonius
Ophelia
Laertes
Hamlet
GertrudeKing
Hamlet
Claudius
![Page 23: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/23.jpg)
Overlap graph
Each node is a read
Draw edge A -> B when suffix of A overlaps prefix of B
CTCGGCTCTAGCCCCTCATTTT
CTCGGCTCTAGCCCCTCATTTT
GGCTCTAGGCCCTCATTTTTT
![Page 24: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/24.jpg)
Overlap graph
Nodes: all 6-mers from GTACGTACGAT
Edges: overlaps of length ≥4
TACGAT
ACGTAC
GTACGT
4
CGTACG5
GTACGA
4
TACGTA54 4
5
4
4
5
5
5
![Page 25: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/25.jpg)
Overlap graph
Nodes: all 6-mers from GTACGTACGAT
Edges: overlaps of length ≥4
TACGAT
ACGTAC
GTACGT
4
CGTACG5
GTACGA
4
TACGTA54 4
5
4
4
5
5
5
![Page 26: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/26.jpg)
Shortest common superstring
Given set of strings S, find SCS(S): shortest string containing the strings in S as substrings
BAA AAB BBA ABA ABB BBB AAA BABS:
BAAAABBBAABAABBBBBAAABABConcat(S):24
SCS(S): AAABBBABAA10
![Page 27: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/27.jpg)
TACGAT
ACGTAC
GTACGT
4
CGTACG5
GTACGA
4
TACGTA54 4
5
4
4
5
5
5
>>> scs(['GTACGT', 'TACGTA', 'ACGTAC', 'CGTACG', 'GTACGA', 'TACGAT']) 'GTACGTACGAT'
Reads: all 6-mers from GTACGTACGAT
![Page 28: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/28.jpg)
Shortest common superstring
NP-complete: no efficient algorithms for large inputs
![Page 29: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/29.jpg)
Idea: pick order for strings in S and construct superstring
AAA AAB ABA ABB BAA BAB BBA BBBorder 1:
AAA
![Page 30: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/30.jpg)
AAA AAB ABA ABB BAA BAB BBA BBBorder 1:
AAAB
Idea: pick order for strings in S and construct superstring
![Page 31: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/31.jpg)
AAA AAB ABA ABB BAA BAB BBA BBBorder 1:
AAABA
Idea: pick order for strings in S and construct superstring
![Page 32: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/32.jpg)
AAA AAB ABA ABB BAA BAB BBA BBBorder 1:
AAABABB
Idea: pick order for strings in S and construct superstring
![Page 33: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/33.jpg)
AAA AAB ABA ABB BAA BAB BBA BBBorder 1:
AAABABBAABABBABBB
Idea: pick order for strings in S and construct superstring
superstring 1
![Page 34: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/34.jpg)
AAA AAB ABA ABB BAA BAB BBA BBBorder 1:
AAABABBAABABBABBB
Idea: pick order for strings in S and construct superstring
AAA AAB ABA BAB ABB BBB BAA BBAorder 2:
AAABABBBAABBA
superstring 1
superstring 2
Try all possible orderings and pick shortest superstring
If S contains n strings, n ! (n factorial) orderings possible
![Page 35: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/35.jpg)
AAA AAB ABA ABB BAA BAB BBA BBBorder 1:
AAABABBAABABBABBB
AAA AAB ABA BAB ABB BBB BAA BBAorder 2:
AAABABBBAABBA
superstring 1
superstring 2
If S contains n strings, n ! (n factorial) orderings possible
![Page 36: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/36.jpg)
Greedy shortest common superstring
AAB
ABB
BBABBB
AAA
2
1112
1
2
2 1
12
![Page 37: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/37.jpg)
Greedy shortest common superstring
AAB
ABB
BBABBB
AAA
2
1112
1
2
2 1
12
![Page 38: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/38.jpg)
Greedy shortest common superstring
ABB
BBABBB
AAAB 2
11 1
2
2 1
2
![Page 39: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/39.jpg)
Greedy shortest common superstring
ABB
BBABBB
AAAB 2
11 1
2
2 1
2
![Page 40: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/40.jpg)
Greedy shortest common superstring
BBAABBB
AAAB
21
1
2
1
![Page 41: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/41.jpg)
Greedy shortest common superstring
BBAABBB
AAAB
21
1
2
1
![Page 42: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/42.jpg)
Greedy shortest common superstring
ABBBA
AAAB
21
![Page 43: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/43.jpg)
Greedy shortest common superstring
ABBBA
AAAB
21
![Page 44: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/44.jpg)
Greedy shortest common superstring
AAABBBA superstring, length=7
![Page 45: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/45.jpg)
Shortest common superstring: greedy
Greedy-SCS: in each round, merge pair of strings with maximal overlap. Stop when there’s 1 string left. l = minimum overlap.
AAA AAB ABB BBB BBA Input strings
Algorithm in action (l = 1):
AAB
ABB
BBABBB
AAA
2
1112
1
2
2 1
12
![Page 46: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/46.jpg)
Shortest common superstring: greedy
Greedy-SCS: in each round, merge pair of strings with maximal overlap. Stop when there’s 1 string left. l = minimum overlap.
Input strings
Algorithm in action (l = 1):
AAB
ABB
BBABBB
AAA
2
1112
1
2
2 1
12
AAA AAB ABB BBB BBA AAA AAB ABB BBB BBA
![Page 47: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/47.jpg)
Shortest common superstring: greedy
Greedy-SCS: in each round, merge pair of strings with maximal overlap. Stop when there’s 1 string left. l = minimum overlap.
Input strings
Algorithm in action (l = 1):
ABB
BBABBB
AAAB
111
2
2 1
22
AAA AAB ABB BBB BBA AAA AAB ABB BBB BBA AAAB ABB BBB BBA
![Page 48: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/48.jpg)
Shortest common superstring: greedy
Greedy-SCS: in each round, merge pair of strings with maximal overlap. Stop when there’s 1 string left. l = minimum overlap.
Input strings
Algorithm in action (l = 1):
ABB
BBABBB
AAAB
111
2
2 1
22
AAA AAB ABB BBB BBA AAA AAB ABB BBB BBA AAAB ABB BBB BBA
![Page 49: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/49.jpg)
Shortest common superstring: greedy
Greedy-SCS: in each round, merge pair of strings with maximal overlap. Stop when there’s 1 string left. l = minimum overlap.
Input strings
Algorithm in action (l = 1):
ABB
BBBA
AAAB
1
2
2
AAA AAB ABB BBB BBA AAA AAB ABB BBB BBA AAAB ABB BBB BBA AAAB BBBA ABB
11
![Page 50: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/50.jpg)
Shortest common superstring: greedy
Greedy-SCS: in each round, merge pair of strings with maximal overlap. Stop when there’s 1 string left. l = minimum overlap.
Input strings
Algorithm in action (l = 1):
ABB
BBBA
AAAB
1
2
2
AAA AAB ABB BBB BBA AAA AAB ABB BBB BBA AAAB ABB BBB BBA AAAB BBBA ABB
11
![Page 51: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/51.jpg)
Shortest common superstring: greedy
Greedy-SCS: in each round, merge pair of strings with maximal overlap. Stop when there’s 1 string left. l = minimum overlap.
Input strings
Algorithm in action (l = 1):
BBBA
AAABB
2
AAA AAB ABB BBB BBA AAA AAB ABB BBB BBA AAAB ABB BBB BBA AAAB BBBA ABB AAABB BBBA
1
![Page 52: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/52.jpg)
Shortest common superstring: greedy
Greedy-SCS: in each round, merge pair of strings with maximal overlap. Stop when there’s 1 string left. l = minimum overlap.
Input strings
Algorithm in action (l = 1):
AAA AAB ABB BBB BBA AAA AAB ABB BBB BBA AAAB ABB BBB BBA AAAB BBBA ABB AAABB BBBA AAABBBA
That’s the SCS
AAABBBA
![Page 53: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/53.jpg)
Greedy shortest common superstring
AAA AAB ABB BBA BBB
AAAB ABB BBA BBB
![Page 54: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/54.jpg)
Greedy shortest common superstring
AAA AAB ABB BBA BBB
AAAB ABB BBA BBB
AAAB ABBA BBB
![Page 55: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/55.jpg)
Greedy shortest common superstring
AAA AAB ABB BBA BBB
AAAB ABB BBA BBB
AAAB ABBA BBB
AAABBA BBB
![Page 56: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/56.jpg)
Greedy shortest common superstring
AAA AAB ABB BBA BBB
AAAB ABB BBA BBB
AAAB ABBA BBB
AAABBA BBB
AAABBABBB superstring, length=9
![Page 57: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/57.jpg)
Greedy shortest common superstring
AAA AAB ABB BBA BBB
AAAB ABB BBA BBB
AAAB ABBA BBB
AAABBA BBB
AAABBABBB superstring, length=9
AAABBBA superstring, length=7
Greedy answer isn't necessarily optimal
![Page 58: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/58.jpg)
Shortest common superstring: greedy
Greedy-SCS assembling all substrings of length 6 from: a_long_long_long_time. l = 3.
ng_lon _long_ a_long long_l ong_ti ong_lo long_t g_long g_time ng_tim ng_time ng_lon _long_ a_long long_l ong_ti ong_lo long_t g_long ng_time g_long_ ng_lon a_long long_l ong_ti ong_lo long_t ng_time long_ti g_long_ ng_lon a_long long_l ong_lo ng_time ong_lon long_ti g_long_ a_long long_l ong_lon long_time g_long_ a_long long_l long_lon long_time g_long_ a_long long_lon g_long_time a_long long_long_time a_long a_long_long_time
Foiled by repeat!
![Page 59: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/59.jpg)
Shortest common superstring: greedy
Same example, but increased the substring length from 6 to 8
long_lon ng_long_ _long_lo g_long_t ong_long g_long_l ong_time a_long_l _long_ti long_tim long_time long_lon ng_long_ _long_lo g_long_t ong_long g_long_l a_long_l _long_ti _long_time long_lon ng_long_ _long_lo g_long_t ong_long g_long_l a_long_l _long_time a_long_lo long_lon ng_long_ g_long_t ong_long g_long_l _long_time ong_long_ a_long_lo long_lon g_long_t g_long_l g_long_time ong_long_ a_long_lo long_lon g_long_l g_long_time ong_long_ a_long_lon g_long_l g_long_time ong_long_l a_long_lon g_long_time a_long_long_l a_long_long_long_time a_long_long_long_time
Got the whole thing: a_long_long_long_time
![Page 60: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/60.jpg)
Shortest common superstring: greedy
Why are substrings of length 8 long enough for Greedy-SCS to figure out there are 3 copies of long?
a_long_long_long_time
One length-8 substring spans all three longs
g_long_l
![Page 61: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/61.jpg)
Third law of assembly
Repeats make assembly difficult; whether we can assemble without mistakes depends on length of reads and repetitive patterns in genome
a_long_long_long_time
a_long_long_time
Collapsing a tandem repeat:
Spurious rearrangement:
![Page 62: 16 assembly scs v2 - Department of Computer Science](https://reader031.vdocuments.us/reader031/viewer/2022012014/615959689e6f070b601118ce/html5/thumbnails/62.jpg)
Repeats foil assembly
Portion of overlap graph involving repeat family A
As are longer than read length
A
Lots of overlaps among A reads
Even if we avoid collapsing copies of A, we can’t know which paths in correspond to which paths out
L1
L2
L3
L4
R1
R2
R3
R4
L1
L2
L3
L4
R1
R2
R3
R4
Stre
tche
s of
ge
nom
eRe
ads
RepetitiveUnique Unique