multiple sequence alignments. the global alignment problem agtgccctggaaccctgacggtgggtcacaaaacttctgga...

Multiple Sequence Multiple Sequence AlignmentsAlignments

The Global Alignment problem

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

x

y

z

Definition

• Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that

• All sequences have the same length L

• Score of the global map is maximum

• A faint similarity between two sequences becomes significant if present in many

• Multiple alignments can help improve the pairwise alignments

Scoring Function: Sum Of Pairs

Definition: Induced pairwise alignment

A pairwise alignment induced by the multiple alignment

Example:

x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum Of Pairs (cont’d)

• Heuristic way to incorporate evolution tree:

Human

Mouse

Chicken• Weighted SOP:

S(m) = k<l wkl s(mk, ml)

wkl: weight decreasing with distance

Duck

A Profile Representation

• Given a multiple alignment M = m1…mn

Replace each column mi with profile entry pi

• Frequency of each letter in • # gaps• Optional: # gap openings, extensions, closings

- A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G

A 1 1 .8 C .6 1 .4 1 .6 .2G 1 .2 .2 .4 1T .2 1 .6 .2O .2 .8 .4 .4E .4C .2 .8 .4 .2

Multiple Sequence Alignments

Algorithms

1. Multidimensional Dynamic Programming

Generalization of Needleman-Wunsh:

S(m) = i S(mi)

(sum of column scores)

F(i1,i2,…,iN): Optimal alignment up to (i1, …, iN)

F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))

• Example: in 3D (three sequences):

• 7 neighbors/cell

F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk),F(i-1,j-1,k )+S(xi, xj, - ),F(i-1,j ,k-1)+S(xi, -, xk),F(i-1,j ,k )+S(xi, -, - ),F(i ,j-1,k-1)+S( -, xj, xk),F(i ,j-1,k )+S( -, xj, xk),F(i ,j ,k-1)+S( -, -, xk) }


Running Time:

1. Size of matrix: LN;

Where L = length of each sequence

N = number of sequences

2. Neighbors/cell: 2N – 1

Therefore………………………… O(2N LN)


• How do affine gaps generalize?

• VERY badly! Require 2N states, one per combination of

gapped/ungapped sequences Running time: O(2N 2N LN) = O(4N LN)

XY XYZ Z

Y YZ

X XZ

2. Progressive Alignment

• When evolutionary tree is known:

Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new

alignment with associated profile presult

Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles

x

w

y

z

Example

Profile: (A, C, G, T, -)px = (0.8, 0.2, 0, 0, 0)py = (0.6, 0, 0, 0, 0.4)

s(px, py) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -)

Result: pxy = (0.7, 0.1, 0, 0, 0.2)

s(px, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -)

Result: px- = (0.4, 0.1, 0, 0, 0.5)

2. Progressive Alignment

• When evolutionary tree is unknown:

Perform all pairwise alignments Define distance matrix D, where D(x, y) is a measure of evolutionary

distance, based on pairwise alignment Construct a tree (we will describe more in detail later in the course) Align on the tree

x

w

y

z?

Aligning two alignments

• Given two alignments, m1, m2, can we find the optimal alignment under SOP scoring, with affine gaps?

GTAGTCAGTCG x m1

---GTCACGTG y

GTCGTCAGTCG z m2

--CGCCAGGGG w--CGCCAGGGA v

m1 x GGGCACTGCATy GGTTACGTC--

m2 z GGGAACTGCAG w GGACGTACC--

v GGACCT-----

Aligning two alignments

• Given two alignments, m1, m2, can we find the optimal alignment under SOP scoring, with affine gaps?

NP-hard!

GTAGTCAGTCG x m1

---GTCACGTG y

GTCGTCAGTCG z m2

--CGCCAGGGG w--CGCCAGGGA v

m1 x GGGCACTGCATy GGTTACGTC--

m2 z GGGAACTGCAG w GGACGTACC--

v GGACCT-----

Optimistic: assume no gap – don’t pay gap-open penaltyPessimistic: assume gap – pay gap-open penalty

Heuristics to improve multiple alignments

• Iterative refinement schemes

• A*-based search

• Consistency

• Simulated Annealing

• …

Iterative Refinement

One problem of progressive alignment:• Initial alignments are “frozen” even when new evidence comes

Example:

x: GAAGTTy: GAC-TT

z: GAACTGw: GTACTG

Frozen!

Now clear correct y = GA-CTT


Algorithm (Barton-Stenberg):

1. Align most similar xi, xj

2. Align xk most similar to (xixj)3. Repeat 2 until (x1…xN) are aligned

4. For j = 1 to N,Remove xj, and realign to x1…xj-1xj+1…xN

5. Repeat 4 until convergence

Note: Guaranteed to converge


For each sequence y1. Remove y2. Realign y

(while rest fixed)x

y

z

x,z fixed projection

allow y to vary


Example: align (x,y), (z,w), (xy, zw):

x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA

After realigning y:

x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA


Example not handled well:

x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA

z: GAACTGAw: GTACTGA

Realigning any single yi changes nothing

A* for Multiple Alignments

Review of the A* algorithm

v

START

GOAL

• Say that we have a gigantic graph G• START: start node• GOAL: we want to reach this node with the minimum path

Dijkstra: O(VlogV + E) – too slow if the number of edges is huge

A*: a way of finding the optimal solution faster in practice


Review of the A* algorithm

v

START

GOAL

g(v)h(v)

• g(v) is the cost so far• h(v) is an estimate of the minimum cost from v to GOAL• f(v) ≥ g(v) + h(v) is the minimum cost of a path passing by v

1. Expand v with the smallest f(v)2. Never expand v, if f(v) ≥ shortest path to the goal found so

far

LemmaGiven sequences x, y, z, …The sum-of pairs score of multiple alignment M is lower (worse) than the sum of the optimalpairwise alignments

ProofM induces projected pairwise alignments axy,ayz, axz, …, and Score(M) = d(axy) + d(axz) + d(ayz) +…

Each of d(.) is smaller than the optimal edit distance


• Nodes: Cells in the DP matrix• g(v): alignment cost so far• h(v): sum-of-pairs of individual pairwise alignments

• Initial minimum alignment cost estimate: sum-of-pairs of global pairwise alignments

v

START

GOAL

g(v)h(v)

To compute h(v)

For each pair of sequences x, y,Compute FR(x, y), the DP matrix of scores of aligninga suffix of x to a suffix of y

Then, at position (i1, i2, …, iN), h(v) becomes thesum of (N choose 2) FR scores

Consistency

z

x

y

xi

yj yj’

zk

Consistency

Basic method for applying consistency

• Compute all pairs of alignments xy, xz, yz, …

• When aligning x, y during progressive alignment,

For each (xi, yj), let s(xi, yj) = function_of(xi, yj, axz, ayz) Align x and y with DP using the modified s(.,.) function

z

x

y

xi

yj yj’

zk

Some Resources

Genome Resources

Annotation and alignment genome browser at UCSChttp://genome.ucsc.edu/cgi-bin/hgGateway

Specialized VISTA alignment browser at LBNLhttp://pipeline.lbl.gov/cgi-bin/gateway2

ABC—Nice Stanford tool for browsing alignmentshttp://encode.stanford.edu/~asimenos/ABC/

Protein Multiple Aligners

http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used

http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable

http://probcons.stanford.edu/ PROBCONS – most accurate

Whole-genome alignment Rat—Mouse—Human

Next 2 years: 20+ mammals, & many other animals, will be sequenced & aligned

multiple sequence alignments. the global alignment problem agtgccctggaaccctgacggtgggtcacaaaacttctgga...

Documents

sequence x

n sequences x

acgcggc x

acgcgag y

gccgcgag z

gccgcgag slide

multiple alignments

multiple alignment example