![Page 1: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/1.jpg)
CS 466Introduction to Bioinformatics
Lecture 5
Mohammed El-KebirSeptember 9, 2020
![Page 2: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/2.jpg)
Course AnnouncementsInstructor:• Mohammed El-Kebir (melkebir)• Office hours: Wednesdays, 3:15-4:15pm
TA:• Aswhin Ramesh (aramesh7)• Office hours: Fridays, 11:00-11:59am in SC 3405
2
Homework 1: Due on Sept. 18 (11:59pm)
Midterm: 10/4, 11-1pm @Transportation Building 103(conflict: 10/7, 7-9pm @Siebel 1302 -- to sign up email me)
![Page 3: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/3.jpg)
Global, Fitting and Local Alignment
3
Local Alignment problem: Given strings 𝐯 ∈ Σ!and 𝐰 ∈ Σ" and scoring function 𝛿, find a substring
of 𝐯 and a substring of 𝐰 whose alignment has maximum global alignment score 𝑠∗ among allglobal alignments of all substrings of 𝐯 and 𝐰
[Smith-Waterman algorithm]
Fitting Alignment problem: Given strings 𝐯 ∈ Σ!and 𝐰 ∈ Σ" and scoring function 𝛿, find an
alignment of 𝐯 and a substring of 𝐰 with maximum global alignment score 𝑠∗ among all global
alignments of 𝐯 and all substrings of 𝐰
Global Alignment problem: Given strings 𝐯 ∈ Σ!and 𝐰 ∈ Σ" and scoring function 𝛿, find alignment
of 𝐯 and 𝐰 with maximum score.[Needleman-Wunsch algorithm]
A
G
G
0T A C G G C0𝐯\𝐰
Question: How to assess resulting algorithms?
![Page 4: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/4.jpg)
Time Complexity
4
0 1 2 3 n = 4
0 O O O O O
1 O O O O O
2 O O O O O
3 O O O O O
m = 4 O O O O O
WA T C G
A
T
G
T
V
Alignment is a path from source (0, 0)to target (𝑚, 𝑛) in edit graph
Edit graph is a weighed, directed grid graph 𝐺 = (𝑉, 𝐸) with source vertex (0, 0) and target vertex (𝑚, 𝑛). Each
edge ((𝑖, 𝑗), (𝑘, 𝑙)) has weight depending on direction.
Running time is 𝑂(𝑚𝑛)[quadratic time]
![Page 5: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/5.jpg)
Time Complexity
5
0 1 2 3 n = 4
0 O O O O O
1 O O O O O
2 O O O O O
3 O O O O O
m = 4 O O O O O
WA T C G
A
T
G
T
V
Alignment is a path from source (0, 0)to target (𝑚, 𝑛) in edit graph
Edit graph is a weighed, directed grid graph 𝐺 = (𝑉, 𝐸) with source vertex (0, 0) and target vertex (𝑚, 𝑛). Each
edge ((𝑖, 𝑗), (𝑘, 𝑙)) has weight depending on direction.
Running time is 𝑂(𝑚𝑛)[quadratic time]
Question: Compute alignment faster than 𝑂(𝑚𝑛) time? [subquadratic time]
![Page 6: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/6.jpg)
Space Complexity
6
0 1 2 3 n = 4
0
1
2
3
m = 4
WA T C G
A
T
G
T
VThus, space complexity is 𝑂(𝑚𝑛)
[quadratic space]
Size of DP table is 𝑚 + 1 × (𝑛 + 1)
Example: To align a short read (𝑚 = 100) to
human genome (𝑛 = 3 4 10!), we need300 GB memory.
![Page 7: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/7.jpg)
Space Complexity
7
0 1 2 3 n = 4
0
1
2
3
m = 4
WA T C G
A
T
G
T
VThus, space complexity is 𝑂(𝑚𝑛)
[quadratic space]
Size of DP table is 𝑚 + 1 × (𝑛 + 1)
Example: To align a short read (𝑚 = 100) to
human genome (𝑛 = 3 4 10!), we need300 GB memory.
Question: How long is an alignment?
![Page 8: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/8.jpg)
Space Complexity
8
0 1 2 3 n = 4
0
1
2
3
m = 4
WA T C G
A
T
G
T
VThus, space complexity is 𝑂(𝑚𝑛)
[quadratic space]
Size of DP table is 𝑚 + 1 × (𝑛 + 1)
Example: To align a short read (𝑚 = 100) to
human genome (𝑛 = 3 4 10!), we need300 GB memory.
Question: How long is an alignment?
Question: Compute alignment in 𝑂(𝑚) space? [linear space]
![Page 9: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/9.jpg)
Outline1. Recap of global, fitting, local and gapped alignment2. Space-efficient alignment3. Subquadratic time alignment
Reading:• Jones and Pevzner. Chapters 7.1-7.4• Lecture notes
9
![Page 10: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/10.jpg)
2/2/12 1:56 PMIntroduction to Bioinformatics Algorithms
Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=c12cdb…e1417e8453f71b12fffb812a0765e61cabb13181e78b7cf61aa3092b81458876e
Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 232.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=251Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
𝑚
00 𝑛
Space Efficient Alignment
10
Computing 𝑠[𝑖, 𝑗] requires access to: 𝑠 𝑖 − 1, 𝑗 ,𝑠[𝑖, 𝑗 − 1] and 𝑠[𝑖 − 1, 𝑗 − 1]
𝑖
𝑗
s[i, j] = max
8>>><
>>>:
0, if i = 0 and j = 0,
s[i� 1, j] + �(vi,�), if i > 0,
s[i, j � 1] + �(�, wj), if j > 0,
s[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0.
![Page 11: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/11.jpg)
2/2/12 1:56 PMIntroduction to Bioinformatics Algorithms
Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=c12cdb…e1417e8453f71b12fffb812a0765e61cabb13181e78b7cf61aa3092b81458876e
Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 232.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=251Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
𝑚
00 𝑛
Space Efficient Alignment
11
Computing 𝑠[𝑖, 𝑗] requires access to: 𝑠 𝑖 − 1, 𝑗 ,𝑠[𝑖, 𝑗 − 1] and 𝑠[𝑖 − 1, 𝑗 − 1]
𝑖
𝑗
Thus it suffices to store only two columns to compute optimal alignment score 𝑠 𝑚, 𝑛 ,
i.e., 2 𝑚 + 1 = 𝑂(𝑚) space.
s[i, j] = max
8>>><
>>>:
0, if i = 0 and j = 0,
s[i� 1, j] + �(vi,�), if i > 0,
s[i, j � 1] + �(�, wj), if j > 0,
s[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0.
![Page 12: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/12.jpg)
2/2/12 1:56 PMIntroduction to Bioinformatics Algorithms
Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=c12cdb…e1417e8453f71b12fffb812a0765e61cabb13181e78b7cf61aa3092b81458876e
Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 232.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=251Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
𝑚
00 𝑛
Space Efficient Alignment
12
Question: What if we want alignment itself?
Computing 𝑠[𝑖, 𝑗] requires access to: 𝑠 𝑖 − 1, 𝑗 ,𝑠[𝑖, 𝑗 − 1] and 𝑠[𝑖 − 1, 𝑗 − 1]
𝑖
𝑗
Thus it suffices to store only two columns to compute optimal alignment score 𝑠 𝑚, 𝑛 ,
i.e., 2 𝑚 + 1 = 𝑂(𝑚) space.
s[i, j] = max
8>>><
>>>:
0, if i = 0 and j = 0,
s[i� 1, j] + �(vi,�), if i > 0,
s[i, j � 1] + �(�, wj), if j > 0,
s[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0.
![Page 13: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/13.jpg)
Space Efficient Alignment – First Attempt
• What if also want optimal alignment?• Easy: keep best pointers as fill in table.• No! Do not know which path to keep until computing
recurrence at each step.
w
vw
v
![Page 14: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/14.jpg)
Space Efficient Alignment – First Attempt
• What if also want optimal alignment?• Easy: keep best pointers as fill in table.• No! Do not know which path to keep until computing
recurrence at each step.
w
vw
v
![Page 15: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/15.jpg)
Space Efficient Alignment – First Attempt
• What if also want optimal alignment?• Easy: keep best pointers as fill in table.• No! Do not know which path to keep until computing
recurrence at each step.
w
vw
v
Best score for column might not be part of best alignment!
![Page 16: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/16.jpg)
Space Efficient Alignment – Second Attempt
16
𝑛/2
𝑚
𝑖∗
𝑖
Maximum weight path from (0,0) to (𝑚, 𝑛) passes
through (𝑖∗, 𝑛/2)
Question: What is 𝑖∗?
Alignment is a path from source (0, 0) to target (𝑚, 𝑛)
in edit graph
![Page 17: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/17.jpg)
Space Efficient Alignment – Second Attempt
17
𝑛/2
𝑚
𝑖∗
𝑖
![Page 18: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/18.jpg)
Space Efficient Alignment – Second Attempt
18
𝑛/2
𝑚
𝑖∗
𝑖
![Page 19: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/19.jpg)
19
2/2/12 1:53 PMIntroduction to Bioinformatics Algorithms
Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=516d6f…41115138e47718afb5f0e14342e57210a6977481c8e3ce3915fa0a11e19f36678
Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 235.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=254Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
Hirschberg(𝑖, 𝑗, 𝑖", 𝑗′)1. if 𝑗" − 𝑗 > 12. 𝑖∗ ß arg max
$%$!!%$!wt(𝑖′′)
3. Report (𝑖∗, 𝑗 + &!'&( )
4. Hirschberg(𝑖, 𝑗, 𝑖∗, 𝑗 + &!'&( )
5. Hirschberg(𝑖∗, 𝑗 + &!'&(, 𝑖", 𝑗′)
![Page 20: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/20.jpg)
20
2/2/12 1:53 PMIntroduction to Bioinformatics Algorithms
Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=516d6f…41115138e47718afb5f0e14342e57210a6977481c8e3ce3915fa0a11e19f36678
Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 235.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=254Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
Time: area + area/2 + area/4 + … = area (1 + ½ + ¼ + ⅛ + …) ≤ 2 × area = O(mn)
Space: O(m)
Hirschberg(𝑖, 𝑗, 𝑖", 𝑗′)1. if 𝑗" − 𝑗 > 12. 𝑖∗ ß arg max
$%$!!%$!wt(𝑖′′)
3. Report (𝑖∗, 𝑗 + &!'&( )
4. Hirschberg(𝑖, 𝑗, 𝑖∗, 𝑗 + &!'&( )
5. Hirschberg(𝑖∗, 𝑗 + &!'&(, 𝑖", 𝑗′)
![Page 21: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/21.jpg)
21
2/2/12 1:53 PMIntroduction to Bioinformatics Algorithms
Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=516d6f…41115138e47718afb5f0e14342e57210a6977481c8e3ce3915fa0a11e19f36678
Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 235.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=254Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
Hirschberg(𝑖, 𝑗, 𝑖", 𝑗′)1. if 𝑗" − 𝑗 > 12. 𝑖∗ ß arg max
$%$!!%$!wt(𝑖′′)
3. Report (𝑖∗, 𝑗 + &!'&( )
4. Hirschberg(𝑖, 𝑗, 𝑖∗, 𝑗 + &!'&( )
5. Hirschberg(𝑖∗, 𝑗 + &!'&(, 𝑖", 𝑗′)
Time: area + area/2 + area/4 + … = area (1 + ½ + ¼ + ⅛ + …) ≤ 2 × area = O(mn)
Space: O(m)
Question: How to reconstruct alignment from reported vertices?
![Page 22: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/22.jpg)
Hirschberg Algorithm: Reversing Edges Necessary?
22
𝑖∗ = arg max{$%&%!
pre2ix 𝑖 + suf2ix 𝑖 }
Max weight path from (0,0) to (𝑚, 𝑛) through (𝑖∗, 𝑛/2)
Compute pre2ix 𝑖 0 ≤ 𝑖 ≤ 𝑚} in O(𝑚𝑗) time and O(𝑚)space, by starting from (0,0) to 𝑚, 𝑗 keeping only two
columns in memory. [single-source multiple destinations]
𝑗
𝑚
𝑖∗
𝑖
![Page 23: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/23.jpg)
Hirschberg Algorithm: Reversing Edges Necessary?
23
𝑖∗ = arg max{$%&%!
pre2ix 𝑖 + suf2ix 𝑖 }
Max weight path from (0,0) to (𝑚, 𝑛) through (𝑖∗, 𝑛/2)
Want: Compute suf2ix 𝑖 0 ≤ 𝑖 ≤ 𝑚} in O(𝑚𝑗) time and O(𝑚) space
Doing a longest path from each 𝑖, 𝑗 to 𝑚, 𝑛 (for all 0 ≤ 𝑖 ≤ 𝑚) will not achieve desired running time!
Reversing edges enables single-source multiple destination computation in desired time and space bound!
𝑗
𝑚
𝑖∗
𝑖
Compute pre2ix 𝑖 0 ≤ 𝑖 ≤ 𝑚} in O(𝑚𝑗) time and O(𝑚)space, by starting from (0,0) to 𝑚, 𝑗 keeping only two
columns in memory. [single-source multiple destinations]
![Page 24: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/24.jpg)
Hirschberg Algorithm: Reconstructing Alignment
24
Problem: Given reported vertices and scores { 𝑖$, 0, 𝑠$ , … , 𝑖", 𝑛, 𝑠" },
find intermediary vertices.
Hirschberg(𝑖, 𝑗, 𝑖', 𝑗′)1. if 𝑗' − 𝑗 > 12. 𝑖∗ ß arg max
$%&%!wt(𝑖)
3. Report (𝑖∗, 𝑗 + (!)(*)
4. Hirschberg(𝑖, 𝑗, 𝑖∗, 𝑗 + (!)(*
)
5. Hirschberg(𝑖∗, 𝑗 + (!)(*, 𝑖', 𝑗′)
0 1 2 3 4 5
0 0 -1 -2 -3 -4 -5
1 -1 1 0 -1 -2 -3
2 -2 0 2 1 0 -1
3 -3 -1 1 1 2 1
4 -4 -2 0 0 1 1
5 -5 -3 -1 1 0 2
WA T C G
A
T
G
T
V
A T - G T C
A T C G - C
C
CTransposing matrix does not help, because gaps could occur in both
input sequences (and there might be multiple opt. alignments)
![Page 25: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/25.jpg)
Linear Space Alignment – The Hirschberg Algorithm
25
![Page 26: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/26.jpg)
Outline1. Recap of global, fitting, local and gapped alignment2. Space-efficient alignment3. Subquadratic time alignment
Reading:• Jones and Pevzner. Chapters 7.1-7.4• Lecture notes
26
![Page 27: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/27.jpg)
Banded Alignment
27
3. A�ne Gap Costs (More biologically realistic): We can extend this problem further byintroducing an a�ne gap penalty function
� + (n� 1)✏
where � is the gap open penalty, n is the gap length and ✏ is the gap extension penalty.A gap extension penalty which is smaller than a gap open penalty more directly modelsthe underlying biology, as a few longer gaps (corresponding to, say, double strandedDNA breaks and repair) are more likely than several small gaps. Gotoh’s algorithmallows local alignment with a�ne gap costs in quadratic time.
4. Linear Space: For large pairs of sequences, the O(MN) memory taken up by the tableof scores can be substantial. Hirschberg’s algorithm is a way of computing only thenecessary entries of F which allows O(N) memory usage (hence linear space) at theexpense of roughly doubly the running time.
5. Banded Alignment : Banded alignment is best illustrated with a figure. It reduces thetime complexity of alignment to roughly linear time by being semi-greedy, as it disallowsthe possibility of a high scoring alignment that wanders significantly o↵ the diagonal(corresponding to a big gap in either sequence).
x1
x2
x3
.
.
.
.
.
.
xM
y1 y2 y3 ... ... ... ... yN
Constrain traceback to band of DP matrix (penalize big gaps)
Figure 7 | By constraining the traceback to a band of the matrix, we can make the running timeroughly linear (if N = M , then the running time would be O(Nk(N)), where k(N) is the char-acteristic band width. Biologically, this corresponds to disallowing long gaps. Computationally,cells which are outside of the banded region are not considered when taking the max; this meanssetting F (i, j) = �1 for |i� j| > k(N).
10
Figure source: http://jinome.stanford.edu/stat366/pdfs/stat366_win0607_lecture04.pdf
Constraint path to band of width 𝑘around diagonal
Running time: O(𝑛𝑘)
Gives a good approximation of highly identical sequences
![Page 28: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/28.jpg)
Banded Alignment
28
3. A�ne Gap Costs (More biologically realistic): We can extend this problem further byintroducing an a�ne gap penalty function
� + (n� 1)✏
where � is the gap open penalty, n is the gap length and ✏ is the gap extension penalty.A gap extension penalty which is smaller than a gap open penalty more directly modelsthe underlying biology, as a few longer gaps (corresponding to, say, double strandedDNA breaks and repair) are more likely than several small gaps. Gotoh’s algorithmallows local alignment with a�ne gap costs in quadratic time.
4. Linear Space: For large pairs of sequences, the O(MN) memory taken up by the tableof scores can be substantial. Hirschberg’s algorithm is a way of computing only thenecessary entries of F which allows O(N) memory usage (hence linear space) at theexpense of roughly doubly the running time.
5. Banded Alignment : Banded alignment is best illustrated with a figure. It reduces thetime complexity of alignment to roughly linear time by being semi-greedy, as it disallowsthe possibility of a high scoring alignment that wanders significantly o↵ the diagonal(corresponding to a big gap in either sequence).
x1
x2
x3
.
.
.
.
.
.
xM
y1 y2 y3 ... ... ... ... yN
Constrain traceback to band of DP matrix (penalize big gaps)
Figure 7 | By constraining the traceback to a band of the matrix, we can make the running timeroughly linear (if N = M , then the running time would be O(Nk(N)), where k(N) is the char-acteristic band width. Biologically, this corresponds to disallowing long gaps. Computationally,cells which are outside of the banded region are not considered when taking the max; this meanssetting F (i, j) = �1 for |i� j| > k(N).
10
Figure source: http://jinome.stanford.edu/stat366/pdfs/stat366_win0607_lecture04.pdf
Constraint path to band of width 𝑘around diagonal
Running time: O(𝑛𝑘)
Gives a good approximation of highly identical sequences
Question: How to change recurrence to accomplish this?
![Page 29: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/29.jpg)
Block Alignment
29
Divide input sequences into blocks of length 𝑡
v1, …, vt vt+1, …, v2t … vm-t+1, …, vm
w1, …, wt wt+1, …, w2t … wn-t+1, …, wn
![Page 30: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/30.jpg)
2/7/12 1:59 PMIntroduction to Bioinformatics Algorithms
Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=91cb81…6f194fedd9f1e790119b175e5ad3d121b64d11105dbafa31a487926919095c69
Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 236.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=255Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
Block Alignment
30
Divide input sequences into blocks of length 𝑡
v1, …, vt vt+1, …, v2t … vm-t+1, …, vm
w1, …, wt wt+1, …, w2t … wn-t+1, …, wn
Require that paths in edit graph pass through corners of blocks
![Page 31: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/31.jpg)
2/7/12 1:59 PMIntroduction to Bioinformatics Algorithms
Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=91cb81…6f194fedd9f1e790119b175e5ad3d121b64d11105dbafa31a487926919095c69
Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 236.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=255Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
Block Alignment
31
Divide input sequences into blocks of length 𝑡
v1, …, vt vt+1, …, v2t … vm-t+1, …, vm
w1, …, wt wt+1, …, w2t … wn-t+1, …, wn
Require that paths in edit graph pass through corners of blocks
s[i, j] = max
8>>><
>>>:
0, if i = 0 and j = 0,
s[i� 1, j]� �, if i > 0,
s[i, j � 1]� �, if j > 0,
s[i� 1, j � 1] + �(i, j), if i > 0 and j > 0.
0 ≤ 𝑖, 𝑗 ≤ 𝑡 and 𝛽(𝑖, 𝑗) is max score alignment between block 𝑖 of 𝐯 and block 𝑗 of 𝐰
![Page 32: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/32.jpg)
Block Alignment – First Attempt: Pre-compute 𝛽 𝑖, 𝑗
32
s[i, j] = max
8>>><
>>>:
0, if i = 0 and j = 0,
s[i� 1, j]� �, if i > 0,
s[i, j � 1]� �, if j > 0,
s[i� 1, j � 1] + �(i, j), if i > 0 and j > 0.
0 ≤ 𝑖, 𝑗 ≤ 𝑛/𝑡 and 𝛽(𝑖, 𝑗) is max score alignment between block 𝑖 of 𝐯 and block 𝑗 of 𝐰
2/7/12 1:59 PMIntroduction to Bioinformatics Algorithms
Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=91cb81…6f194fedd9f1e790119b175e5ad3d121b64d11105dbafa31a487926919095c69
Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 236.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=255Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
t
2t
nt
t 2t nt
![Page 33: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/33.jpg)
Block Alignment – First Attempt: Pre-compute 𝛽 𝑖, 𝑗
33
s[i, j] = max
8>>><
>>>:
0, if i = 0 and j = 0,
s[i� 1, j]� �, if i > 0,
s[i, j � 1]� �, if j > 0,
s[i� 1, j � 1] + �(i, j), if i > 0 and j > 0.
0 ≤ 𝑖, 𝑗 ≤ 𝑛/𝑡 and 𝛽(𝑖, 𝑗) is max score alignment between block 𝑖 of 𝐯 and block 𝑗 of 𝐰
Question: How much time to compute all 𝛽(𝑖, 𝑗)?
2/7/12 1:59 PMIntroduction to Bioinformatics Algorithms
Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=91cb81…6f194fedd9f1e790119b175e5ad3d121b64d11105dbafa31a487926919095c69
Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 236.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=255Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
t
2t
nt
t 2t nt
![Page 34: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/34.jpg)
Block Alignment – First Attempt: Pre-compute 𝛽 𝑖, 𝑗
34
s[i, j] = max
8>>><
>>>:
0, if i = 0 and j = 0,
s[i� 1, j]� �, if i > 0,
s[i, j � 1]� �, if j > 0,
s[i� 1, j � 1] + �(i, j), if i > 0 and j > 0.
0 ≤ 𝑖, 𝑗 ≤ 𝑛/𝑡 and 𝛽(𝑖, 𝑗) is max score alignment between block 𝑖 of 𝐯 and block 𝑗 of 𝐰
Question: How much time to compute all 𝛽(𝑖, 𝑗)?
2/7/12 1:59 PMIntroduction to Bioinformatics Algorithms
Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=91cb81…6f194fedd9f1e790119b175e5ad3d121b64d11105dbafa31a487926919095c69
Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 236.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=255Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
t
2t
nt
t 2t nt
Computing 𝛽 𝑖, 𝑗 takes 𝑂 𝑡( time
There are 𝑛/𝑡 × 𝑛/𝑡 values 𝛽 𝑖, 𝑗
Total: 𝑂 )*× )*× 𝑡( = 𝑂 𝑛( time
![Page 35: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/35.jpg)
Block Alignment – Four Russians Technique
35
2/7/12 1:59 PMIntroduction to Bioinformatics Algorithms
Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=91cb81…6f194fedd9f1e790119b175e5ad3d121b64d11105dbafa31a487926919095c69
Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 236.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=255Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
t
2t
nt
t 2t nt
Pre-compute and store all βij
Pre-compute and store all max weight alignments 𝑆[𝐯′, 𝐰′] of allpairs (𝐯", 𝐰") of length t strings
Algorithm:1. Precompute 𝑆[𝐯′, 𝐰′] where
𝐯", 𝐰" ∈ Σ*2. Compute block alignment
between 𝐯 and 𝐰 using 𝑆
s[i, j] = max
8>>><
>>>:
0, if i = 0 and j = 0,
s[i� 1, j]� �, if i > 0,
s[i, j � 1]� �, if j > 0,
s[i� 1, j � 1] + S[v(i), w(j)], if i > 0 and j > 0.
![Page 36: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/36.jpg)
Block Alignment – Four Russians Technique
36
2/7/12 1:59 PMIntroduction to Bioinformatics Algorithms
Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=91cb81…6f194fedd9f1e790119b175e5ad3d121b64d11105dbafa31a487926919095c69
Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 236.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=255Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
t
2t
nt
t 2t nt
Pre-compute and store all βij
Pre-compute and store all max weight alignments 𝑆[𝐯′, 𝐰′] of allpairs (𝐯", 𝐰") of length t strings
Algorithm:1. Precompute 𝑆[𝐯′, 𝐰′] where
𝐯", 𝐰" ∈ Σ*2. Compute block alignment
between 𝐯 and 𝐰 using 𝑆
s[i, j] = max
8>>><
>>>:
0, if i = 0 and j = 0,
s[i� 1, j]� �, if i > 0,
s[i, j � 1]� �, if j > 0,
s[i� 1, j � 1] + S[v(i), w(j)], if i > 0 and j > 0.
Question: How to choose 𝑡 for DNA?
![Page 37: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/37.jpg)
Block Alignment – Four Russians Technique
37
Question: How to choose 𝑡 for DNA?
![Page 38: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/38.jpg)
Fastest Subquadratic Alignment* Algorithm
38*for edit distance
Edit distance in O(𝑛M/ log 𝑛) time
Barely subquadratic!
Want: O(𝑛MNO) time where 𝜀 > 0
![Page 39: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/39.jpg)
Fastest Subquadratic Alignment* Algorithm
39*for edit distance
Edit distance in O(𝑛M/ log 𝑛) time
Barely subquadratic!
Want: O(𝑛MNO) time where 𝜀 > 0
Question: Is 𝑛MNO in O(𝑛M/ log 𝑛) for any 𝜀 > 0?
![Page 40: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/40.jpg)
Hardness Result for Edit Distance [STOC 2015]
40
![Page 41: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/41.jpg)
41
![Page 42: CS 466 Introduction to Bioinformatics - El-Kebir · Introduction to Bioinformatics Lecture 5 Mohammed El-Kebir September 11, 2019. Course Announcements Instructor: •Mohammed El-Kebir](https://reader034.vdocuments.us/reader034/viewer/2022043007/5f93da948dd46072c628ee1e/html5/thumbnails/42.jpg)
Take Home Messages1. Global alignment in O(𝑚𝑛) time and O(𝑚) space• Hirschberg algorithm
2. Block alignment can be done in subquadratic time• Four Russians Technique: O(𝑛M/ log 𝑛) time
3. Global alignment cannot be done in O(𝑛#$%) time under SETH
Reading:• Jones and Pevzner. Chapters 7.1-7.4• Lecture notes
42