multiple sequence alignment
DESCRIPTION
Multiple Sequence Alignment. Vasileios Hatzivassiloglou University of Texas at Dallas. Center star algorithm for multiple sequence global alignment. T is the set of strings that we want to align Pick S T that minimizes The initial alignment starts with S ( ≡ S 1 ) - PowerPoint PPT PresentationTRANSCRIPT
Multiple Sequence Alignment
Vasileios Hatzivassiloglou
University of Texas at Dallas
2
Center star algorithm for multiple sequence global alignment
• T is the set of strings that we want to align• Pick ST that minimizes
• The initial alignment starts with S (≡S1)
• Suppose we have already aligned S1, S2, ..., Si as S′1, S′2, ..., S′i. Then we add the remaining strings one at a time by aligning Si+1 with S′1, obtaining S′i+1 and S′′1. We replace S′1 with S′′1 and add spaces to S′2, ..., S′i wherever spaces were added to S′1.
ki
iSSd1
),(
3
Finding S
• S is the best representative of the set T in terms of the distance metric d
• If T is considered as a cluster of strings, then S is the centroid of the cluster
• To find S, align each string with every other ( pairs) and calculate the sum for each candidate. Pick the choice that minimizes this sum
2
k
0),( assuming ),(argmin1
xxdSSdSki
ijTS j
4
Example
• Three strings: GTA, CGT, CAG• Step 1: Calculate all three pairwise
similarities and pick the string that minimizes total distance; let’s say it’s CGT
• Step 2-1: Align CGT with GTACGT--GTA
• Step 2-2: Extend uninvolved, processed strings with spaces (not needed now)
5
Example (continued)
• Step 3-1: Align CGT- with CAGC-GT-CAG--
• Step 3-2: Extend uninvolved, processed strings with spaces (-GTA)C-GT---GTACAG--
6
Algorithm complexity – Finding S
• To find S, we consider k candidates
• For each candidate, we calculate the sum of k-1 terms – O(k2) such terms total
• If the maximum string length is n, then each term can be calculated in O(n2) time
• Total for finding S is O(k2n2)
7
Algorithm complexity – Subsequent alignments
• Each subsequent alignment at step i+1 aligns a string S′1 of length at most in with a string Si+1 of length at most n
• Each alignment can be found in time O(in∙n)
• Total time for these alignments is
1
1
221
1
2 )(O)(O)(Ok
i
k
i
nkinnin
8
Algorithm complexity – Extensions with spaces
• At step i+1 there is an extension of i-1 strings each of length at most in
• For each such string, we need to consider a total of n new space positions
• Time required is
• Overall total time for the algorithm is O(k2n2)
1
2
22
1
)(O)(O))(O)1((k
i
k
i
nkinni
9
Error bounds
• It is useful to know how far the solution found by an approximate algorithm is from the true optimal solution
• Sometimes (but not always) it is possible to provide error bounds, that is give upper and lower bounds for the quantity
• Bounds may depend on n and ksolution) malscore(opti
solution) oximatescore(appr
10
Error analysis assumptions
• Sometimes we need additional assumptions in order to derive useful bounds
• For the approximate algorithm for multiple string alignment, we assume the triangle inequality for measure d:
),(),(),(,, jkkiji SSdSSdSSdkji
11
Background on distances
• A distance or metric d is formally defined as a function A×A→ℜ on a set A (called a metric space) with the following properties:– d(x,y)≥0 (non-negativity)
– d(x,y)=0 iff x=y (identity of indiscernibles)
– d(x,y)=d(y,x) (symmetry)
– d(x,y)≤d(x,z)+d(z,y) (triangle inequality)
• Metric spaces include ℜ (with d(x,y)=|x-y|), all Euclidean spaces, the Lp spaces, and inner product spaces.
12
Background on distances
• A distance or metric d is formally defined as a function A×A→ℜ on a set A (called a metric space) with the following properties:– d(x,y)≥0
– d(x,y)=0 iff x=y
– d(x,y)=d(y,x)
– d(x,y)≤d(x,z)+d(z,y)
• Metric spaces include ℜ (with d(x,y)=|x-y|), all Euclidean spaces, the Lp spaces, and inner product spaces.
follows from 2, 3, and 4
pseudometric
quasimetric
semimetric
13
Deriving an error bound
• Let v0 be the score for the optimal alignment and v* the score for the alignment produced by the center star algorithm
• Let d0(i,j) (d*(i,j)) be the corresponding induced distances on strings Si and Sj
k
i
k
ijj
jidv1 1
021
0 ),(
k
i
k
ijj
jidv1 1
*21
* ),(
14
Lower bound for v0
k
jj
k
i
k
jj
k
i
k
ijj
ji
k
i
k
ijj
SSdk
SSd
SSd
jidv
212
1
1 212
1
1 121
1 102
10
),(
),(
),(
),(
Because the induced distance can be no less than the distance between the strings themselves
Choice of S1
15
Upper bound for v*
k
ii
k
i
k
i
k
i
k
i
k
i
k
j
k
i
k
ijj
k
i
k
ijj
SSdk
idkidk
idkidk
idjdidk
jdid
jidv
21
2*
2*
1*
1*2
1
1*
1**2
1
1 1**2
1
1 1*2
1*
),()1(
),1()1(),1(2
22
),1(),1()2(
),1(),1(),1()1(
),1()1,(
),(
Triangle inequality
Symmetry
Each string is aligned with S1 optimally (there may be additional spaces in matching positions, which do not change the distance)
0)1,1(* d
16
Combining the bounds
• Better bound for low k
2)1(2
),(
),()1(
212
1
21
0
*
k
k
SSdk
SSdk
v
vk
ii
k
ii
17
Motif data notation• A motif is denoted by three parameters
– Its length l– The number of allowed spaces g– The number of allowed changes d– (l, d, g) notation
• Changes and gaps allowed because of mutations across organisms
• In a “good” motif, g and d are small compared to l• Most work assumes g = 0
18
Finding the motif consensus
• Assume known motif instance positions and length (e.g., via multiple alignment)
• Also known as the known site problem• Input: A set of motif instances• Output: What is the motif consensus?• Further, is the consensus a valid motif, or is
it statistically indistinguishable from what we would expect from other randomly chosen regions?
19
Statistical estimation
• An important approach to many data mining and machine learning tasks
• Requirement: The problem must be expressed as a probability function that depends on a number of modeled parameters whose value is unknown
• The estimation task: Find the optimal values for these parameters
20
Estimation example
• Can be performed without an explicit probabilistic model
• Example: Future markets are exchanges where contracts are traded for future execution
• Contract price reflects probabilities of events
21
Obama contract at intrade.com