multiple sequence alignments. the global alignment problem agtgccctggaaccctgacggtgggtcacaaaacttctgga...
Post on 19-Dec-2015
217 views
TRANSCRIPT
Multiple Sequence Multiple Sequence AlignmentsAlignments
The Global Alignment problem
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
x
y
z
Definition
• Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that
• All sequences have the same length L
• Score of the global map is maximum
• A faint similarity between two sequences becomes significant if present in many
• Multiple alignments can help improve the pairwise alignments
Scoring Function: Sum Of Pairs
Definition: Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example:
x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG
Induces:
x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
Sum Of Pairs (cont’d)
• Heuristic way to incorporate evolution tree:
Human
Mouse
Chicken• Weighted SOP:
S(m) = k<l wkl s(mk, ml)
wkl: weight decreasing with distance
Duck
A Profile Representation
• Given a multiple alignment M = m1…mn
Replace each column mi with profile entry pi
• Frequency of each letter in • # gaps• Optional: # gap openings, extensions, closings
- A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G
A 1 1 .8 C .6 1 .4 1 .6 .2G 1 .2 .2 .4 1T .2 1 .6 .2O .2 .8 .4 .4E .4C .2 .8 .4 .2
Multiple Sequence Alignments
Algorithms
1. Multidimensional Dynamic Programming
Generalization of Needleman-Wunsh:
S(m) = i S(mi)
(sum of column scores)
F(i1,i2,…,iN): Optimal alignment up to (i1, …, iN)
F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))
• Example: in 3D (three sequences):
• 7 neighbors/cell
F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk),F(i-1,j-1,k )+S(xi, xj, - ),F(i-1,j ,k-1)+S(xi, -, xk),F(i-1,j ,k )+S(xi, -, - ),F(i ,j-1,k-1)+S( -, xj, xk),F(i ,j-1,k )+S( -, xj, xk),F(i ,j ,k-1)+S( -, -, xk) }
1. Multidimensional Dynamic Programming
Running Time:
1. Size of matrix: LN;
Where L = length of each sequence
N = number of sequences
2. Neighbors/cell: 2N – 1
Therefore………………………… O(2N LN)
1. Multidimensional Dynamic Programming
• How do affine gaps generalize?
• VERY badly! Require 2N states, one per combination of
gapped/ungapped sequences Running time: O(2N 2N LN) = O(4N LN)
XY XYZ Z
Y YZ
X XZ
2. Progressive Alignment
• When evolutionary tree is known:
Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new
alignment with associated profile presult
Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles
x
w
y
z
Example
Profile: (A, C, G, T, -)px = (0.8, 0.2, 0, 0, 0)py = (0.6, 0, 0, 0, 0.4)
s(px, py) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -)
Result: pxy = (0.7, 0.1, 0, 0, 0.2)
s(px, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -)
Result: px- = (0.4, 0.1, 0, 0, 0.5)
2. Progressive Alignment
• When evolutionary tree is unknown:
Perform all pairwise alignments Define distance matrix D, where D(x, y) is a measure of evolutionary
distance, based on pairwise alignment Construct a tree (we will describe more in detail later in the course) Align on the tree
x
w
y
z?
Aligning two alignments
• Given two alignments, m1, m2, can we find the optimal alignment under SOP scoring, with affine gaps?
GTAGTCAGTCG x m1
---GTCACGTG y
GTCGTCAGTCG z m2
--CGCCAGGGG w--CGCCAGGGA v
m1 x GGGCACTGCATy GGTTACGTC--
m2 z GGGAACTGCAG w GGACGTACC--
v GGACCT-----
Aligning two alignments
• Given two alignments, m1, m2, can we find the optimal alignment under SOP scoring, with affine gaps?
NP-hard!
GTAGTCAGTCG x m1
---GTCACGTG y
GTCGTCAGTCG z m2
--CGCCAGGGG w--CGCCAGGGA v
m1 x GGGCACTGCATy GGTTACGTC--
m2 z GGGAACTGCAG w GGACGTACC--
v GGACCT-----
Optimistic: assume no gap – don’t pay gap-open penaltyPessimistic: assume gap – pay gap-open penalty
Heuristics to improve multiple alignments
• Iterative refinement schemes
• A*-based search
• Consistency
• Simulated Annealing
• …
Iterative Refinement
One problem of progressive alignment:• Initial alignments are “frozen” even when new evidence comes
Example:
x: GAAGTTy: GAC-TT
z: GAACTGw: GTACTG
Frozen!
Now clear correct y = GA-CTT
Iterative Refinement
Algorithm (Barton-Stenberg):
1. Align most similar xi, xj
2. Align xk most similar to (xixj)3. Repeat 2 until (x1…xN) are aligned
4. For j = 1 to N,Remove xj, and realign to x1…xj-1xj+1…xN
5. Repeat 4 until convergence
Note: Guaranteed to converge
Iterative Refinement
For each sequence y1. Remove y2. Realign y
(while rest fixed)x
y
z
x,z fixed projection
allow y to vary
Iterative Refinement
Example: align (x,y), (z,w), (xy, zw):
x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA
After realigning y:
x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA
Iterative Refinement
Example not handled well:
x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA
z: GAACTGAw: GTACTGA
Realigning any single yi changes nothing
A* for Multiple Alignments
Review of the A* algorithm
v
START
GOAL
• Say that we have a gigantic graph G• START: start node• GOAL: we want to reach this node with the minimum path
Dijkstra: O(VlogV + E) – too slow if the number of edges is huge
A*: a way of finding the optimal solution faster in practice
A* for Multiple Alignments
Review of the A* algorithm
v
START
GOAL
g(v)h(v)
• g(v) is the cost so far• h(v) is an estimate of the minimum cost from v to GOAL• f(v) ≥ g(v) + h(v) is the minimum cost of a path passing by v
1. Expand v with the smallest f(v)2. Never expand v, if f(v) ≥ shortest path to the goal found so
far
LemmaGiven sequences x, y, z, …The sum-of pairs score of multiple alignment M is lower (worse) than the sum of the optimalpairwise alignments
ProofM induces projected pairwise alignments axy,ayz, axz, …, and Score(M) = d(axy) + d(axz) + d(ayz) +…
Each of d(.) is smaller than the optimal edit distance
A* for Multiple Alignments
• Nodes: Cells in the DP matrix• g(v): alignment cost so far• h(v): sum-of-pairs of individual pairwise alignments
• Initial minimum alignment cost estimate: sum-of-pairs of global pairwise alignments
v
START
GOAL
g(v)h(v)
To compute h(v)
For each pair of sequences x, y,Compute FR(x, y), the DP matrix of scores of aligninga suffix of x to a suffix of y
Then, at position (i1, i2, …, iN), h(v) becomes thesum of (N choose 2) FR scores
Consistency
z
x
y
xi
yj yj’
zk
Consistency
Basic method for applying consistency
• Compute all pairs of alignments xy, xz, yz, …
• When aligning x, y during progressive alignment,
For each (xi, yj), let s(xi, yj) = function_of(xi, yj, axz, ayz) Align x and y with DP using the modified s(.,.) function
z
x
y
xi
yj yj’
zk
Some Resources
Genome Resources
Annotation and alignment genome browser at UCSChttp://genome.ucsc.edu/cgi-bin/hgGateway
Specialized VISTA alignment browser at LBNLhttp://pipeline.lbl.gov/cgi-bin/gateway2
ABC—Nice Stanford tool for browsing alignmentshttp://encode.stanford.edu/~asimenos/ABC/
Protein Multiple Aligners
http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used
http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable
http://probcons.stanford.edu/ PROBCONS – most accurate
Whole-genome alignment Rat—Mouse—Human
Next 2 years: 20+ mammals, & many other animals, will be sequenced & aligned