multiple sequence alignment

21
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas

Upload: xena-gross

Post on 02-Jan-2016

75 views

Category:

Documents


2 download

DESCRIPTION

Multiple Sequence Alignment. Vasileios Hatzivassiloglou University of Texas at Dallas. Center star algorithm for multiple sequence global alignment. T is the set of strings that we want to align Pick S  T that minimizes The initial alignment starts with S ( ≡ S 1 ) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multiple Sequence Alignment

Multiple Sequence Alignment

Vasileios Hatzivassiloglou

University of Texas at Dallas

Page 2: Multiple Sequence Alignment

2

Center star algorithm for multiple sequence global alignment

• T is the set of strings that we want to align• Pick ST that minimizes

• The initial alignment starts with S (≡S1)

• Suppose we have already aligned S1, S2, ..., Si as S′1, S′2, ..., S′i. Then we add the remaining strings one at a time by aligning Si+1 with S′1, obtaining S′i+1 and S′′1. We replace S′1 with S′′1 and add spaces to S′2, ..., S′i wherever spaces were added to S′1.

ki

iSSd1

),(

Page 3: Multiple Sequence Alignment

3

Finding S

• S is the best representative of the set T in terms of the distance metric d

• If T is considered as a cluster of strings, then S is the centroid of the cluster

• To find S, align each string with every other ( pairs) and calculate the sum for each candidate. Pick the choice that minimizes this sum

2

k

0),( assuming ),(argmin1

xxdSSdSki

ijTS j

Page 4: Multiple Sequence Alignment

4

Example

• Three strings: GTA, CGT, CAG• Step 1: Calculate all three pairwise

similarities and pick the string that minimizes total distance; let’s say it’s CGT

• Step 2-1: Align CGT with GTACGT--GTA

• Step 2-2: Extend uninvolved, processed strings with spaces (not needed now)

Page 5: Multiple Sequence Alignment

5

Example (continued)

• Step 3-1: Align CGT- with CAGC-GT-CAG--

• Step 3-2: Extend uninvolved, processed strings with spaces (-GTA)C-GT---GTACAG--

Page 6: Multiple Sequence Alignment

6

Algorithm complexity – Finding S

• To find S, we consider k candidates

• For each candidate, we calculate the sum of k-1 terms – O(k2) such terms total

• If the maximum string length is n, then each term can be calculated in O(n2) time

• Total for finding S is O(k2n2)

Page 7: Multiple Sequence Alignment

7

Algorithm complexity – Subsequent alignments

• Each subsequent alignment at step i+1 aligns a string S′1 of length at most in with a string Si+1 of length at most n

• Each alignment can be found in time O(in∙n)

• Total time for these alignments is

1

1

221

1

2 )(O)(O)(Ok

i

k

i

nkinnin

Page 8: Multiple Sequence Alignment

8

Algorithm complexity – Extensions with spaces

• At step i+1 there is an extension of i-1 strings each of length at most in

• For each such string, we need to consider a total of n new space positions

• Time required is

• Overall total time for the algorithm is O(k2n2)

1

2

22

1

)(O)(O))(O)1((k

i

k

i

nkinni

Page 9: Multiple Sequence Alignment

9

Error bounds

• It is useful to know how far the solution found by an approximate algorithm is from the true optimal solution

• Sometimes (but not always) it is possible to provide error bounds, that is give upper and lower bounds for the quantity

• Bounds may depend on n and ksolution) malscore(opti

solution) oximatescore(appr

Page 10: Multiple Sequence Alignment

10

Error analysis assumptions

• Sometimes we need additional assumptions in order to derive useful bounds

• For the approximate algorithm for multiple string alignment, we assume the triangle inequality for measure d:

),(),(),(,, jkkiji SSdSSdSSdkji

Page 11: Multiple Sequence Alignment

11

Background on distances

• A distance or metric d is formally defined as a function A×A→ℜ on a set A (called a metric space) with the following properties:– d(x,y)≥0 (non-negativity)

– d(x,y)=0 iff x=y (identity of indiscernibles)

– d(x,y)=d(y,x) (symmetry)

– d(x,y)≤d(x,z)+d(z,y) (triangle inequality)

• Metric spaces include ℜ (with d(x,y)=|x-y|), all Euclidean spaces, the Lp spaces, and inner product spaces.

Page 12: Multiple Sequence Alignment

12

Background on distances

• A distance or metric d is formally defined as a function A×A→ℜ on a set A (called a metric space) with the following properties:– d(x,y)≥0

– d(x,y)=0 iff x=y

– d(x,y)=d(y,x)

– d(x,y)≤d(x,z)+d(z,y)

• Metric spaces include ℜ (with d(x,y)=|x-y|), all Euclidean spaces, the Lp spaces, and inner product spaces.

follows from 2, 3, and 4

pseudometric

quasimetric

semimetric

Page 13: Multiple Sequence Alignment

13

Deriving an error bound

• Let v0 be the score for the optimal alignment and v* the score for the alignment produced by the center star algorithm

• Let d0(i,j) (d*(i,j)) be the corresponding induced distances on strings Si and Sj

k

i

k

ijj

jidv1 1

021

0 ),(

k

i

k

ijj

jidv1 1

*21

* ),(

Page 14: Multiple Sequence Alignment

14

Lower bound for v0

k

jj

k

i

k

jj

k

i

k

ijj

ji

k

i

k

ijj

SSdk

SSd

SSd

jidv

212

1

1 212

1

1 121

1 102

10

),(

),(

),(

),(

Because the induced distance can be no less than the distance between the strings themselves

Choice of S1

Page 15: Multiple Sequence Alignment

15

Upper bound for v*

k

ii

k

i

k

i

k

i

k

i

k

i

k

j

k

i

k

ijj

k

i

k

ijj

SSdk

idkidk

idkidk

idjdidk

jdid

jidv

21

2*

2*

1*

1*2

1

1*

1**2

1

1 1**2

1

1 1*2

1*

),()1(

),1()1(),1(2

22

),1(),1()2(

),1(),1(),1()1(

),1()1,(

),(

Triangle inequality

Symmetry

Each string is aligned with S1 optimally (there may be additional spaces in matching positions, which do not change the distance)

0)1,1(* d

Page 16: Multiple Sequence Alignment

16

Combining the bounds

• Better bound for low k

2)1(2

),(

),()1(

212

1

21

0

*

k

k

SSdk

SSdk

v

vk

ii

k

ii

Page 17: Multiple Sequence Alignment

17

Motif data notation• A motif is denoted by three parameters

– Its length l– The number of allowed spaces g– The number of allowed changes d– (l, d, g) notation

• Changes and gaps allowed because of mutations across organisms

• In a “good” motif, g and d are small compared to l• Most work assumes g = 0

Page 18: Multiple Sequence Alignment

18

Finding the motif consensus

• Assume known motif instance positions and length (e.g., via multiple alignment)

• Also known as the known site problem• Input: A set of motif instances• Output: What is the motif consensus?• Further, is the consensus a valid motif, or is

it statistically indistinguishable from what we would expect from other randomly chosen regions?

Page 19: Multiple Sequence Alignment

19

Statistical estimation

• An important approach to many data mining and machine learning tasks

• Requirement: The problem must be expressed as a probability function that depends on a number of modeled parameters whose value is unknown

• The estimation task: Find the optimal values for these parameters

Page 20: Multiple Sequence Alignment

20

Estimation example

• Can be performed without an explicit probabilistic model

• Example: Future markets are exchanges where contracts are traded for future execution

• Contract price reflects probabilities of events

Page 21: Multiple Sequence Alignment

21

Obama contract at intrade.com