multiple alignment by aligning alignments

61
Multiple alignment by aligning alignments Travis Wheeler John Kececioglu Department of Computer Science University of Arizona Extended from talk given at ISMB ‘07

Upload: yardley-kim

Post on 30-Dec-2015

29 views

Category:

Documents


1 download

DESCRIPTION

Multiple alignment by aligning alignments. Travis Wheeler John Kececioglu Department of Computer Science University of Arizona. Extended from talk given at ISMB ‘07. Multiple sequence alignment. Sequence alignment central to computational biology: Functional conservation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multiple alignment by aligning alignments

Multiple alignment by aligning alignments

Travis WheelerJohn Kececioglu

Department of Computer ScienceUniversity of Arizona

Extended from talk given at ISMB ‘07

Page 2: Multiple alignment by aligning alignments

Multiple sequence alignment

Sequence alignment central to computational

biology:

• Functional conservation

• Phylogenetic analysis

• Signals of selection

• Prediction of structure

• Comparative genomics

• and many others …

Page 3: Multiple alignment by aligning alignments

Aligning two sequences

A two-sequence alignment, with affine gap-costs,

is scored,

SCYAGNSSTEPYAVAHHQLLAHAKVVDLYRK-----------AHNSST

SYCA-------EAVAHHQLLAH---VDQYRKHAKVVDLYRKVAHNSST... ...

()

+()+ ( )substituti

onscore

total gap length

number of gaps

columns

Page 4: Multiple alignment by aligning alignments

Sum-of-pairs:

SCYAGNSSTEPYAVA--QLLAHAKV--------

--YAGNSSTEPYAVA---LLAHAKVVDSCYAGN

SCYAGNSSTEPYAVA--QLLA-AKVVDSCY---

-----NSSTEPYAVA--QLLAHAKVVDSCY---

SCYAGNSSTE----AHHQLLAHAKVVDSCY---

SCYAGNSSTEPYAVAHHQLLA--KV--------

-------STEPYAVAHHQLLAHAKVVDSCYAGN

wi,j

...

score( i, j )i,j

Scoring a multiple alignment

Carrillo, Lipman 1988

iji

j’

i

j”

score( i, j’ )score( i, j” )

Optimal alignment of multiple sequences is NP-complete

Page 5: Multiple alignment by aligning alignments

Form-and-polish strategy

1. Choosing parameters

2. Constructing the merge treea. Grouping sequences

b. Measuring distances

3. Weighting sequence pairs

4. Merging alignments

5. Polishing the alignment

Page 6: Multiple alignment by aligning alignments

Form-and-polish strategy

1. Choosing parameters

2. Constructing the merge treea. Grouping sequences

b. Measuring distances

3. Weighting sequence pairs

4. Merging alignments

5. Polishing the alignment

Page 7: Multiple alignment by aligning alignments

Constructing merge tree

aagccgtt

aattt

aatccgtt

catccgtt

aacctacg

gacagg

gatac

Feng, Doolittle 1987

Page 8: Multiple alignment by aligning alignments

Merging alignments

Feng, Doolittle 1987

aagccgtt

aattt

aatccgtt

catccgtt

aacctacg

gacagg

gatac

Page 9: Multiple alignment by aligning alignments

Merging alignments

aagccgtt

aattt

aatccgtt

catccgtt

aacctacg

gacagg

gatac gatac-gacagg

aagccgttaa---ttt

aatccgttcatccgtt

Feng, Doolittle 1987

Page 10: Multiple alignment by aligning alignments

Merging alignments

-

-

-

-

-

-

-

-

-

-

-

-

Page 11: Multiple alignment by aligning alignments

Merging alignments

aagccgttaa---ttt

aatccgttcatccgtt

aatccgttcatccgtt

aagccgttaa---ttt

aagccgttaa---ttt

aatccgttcatccgtt

aagccgttaa---ttt

Page 12: Multiple alignment by aligning alignments

Polishing the alignment

A

B

• Split alignment into groups• Realign groups• Repeat

Berger, Munson 1991

CA’

B’

Page 13: Multiple alignment by aligning alignments

Summary of main stages

• Construct tree • Polish• Merge alignments

Page 14: Multiple alignment by aligning alignments

Alignment Quality

Correct alignment

?

Computed alignment

SPS :# substitutions recovered

# substitutions in correct alignment

TC : # columns correctly recovered

# columns in correct alignment

Page 15: Multiple alignment by aligning alignments

Benchmark datasets• Benchmark suites

• BAliBase [Thompson et al. 1999; Bahr et al. 2001]• PALI [Balaji et al 2001]• SABmark [Van Walle et al 2004]

• All based on structural alignment of proteins

• Characteristics• 900 alignments • 10 sequences per alignment, on average• 400 columns per alignment, on average

• Core blocks• Regions with high support

Page 16: Multiple alignment by aligning alignments

Form-and-polish review

1. Choosing parameters

2. Constructing the merge treea. Grouping sequences

b. Measuring distances

3. Weighting sequence pairs

4. Merging alignments

5. Polishing the alignment

Page 17: Multiple alignment by aligning alignments

Grouping methods

• Neighbor joining (NJ) [Saitou, Nei 1987]

• Unweighted-pair group method with arithmetic mean (UPGMA) [Sneath, Sokal 1973]

• Minimum spanning tree (MST)

• Dynamic alignment distance (DAD)

Page 18: Multiple alignment by aligning alignments

Grouping sequences

aagccgtt

aattt

aatccgtt

catccgtt

aacctacg

gacagg

gatac

• Methods differ in measuring distances for new groups

Page 19: Multiple alignment by aligning alignments

Comparing grouping methods

Grouping method

BAliBaseSABmar

kPALI Average

MST 79.4 44.1 -0.7 67.8

UPGMA -1.4 -1.4 80.5 -0.7

NJ -2.0 -2.0 -3.3 -2.2

DAD -1.2 -0.6 -7.5 -2.9

• Best grouping method ≠ best phylogeny method

Page 20: Multiple alignment by aligning alignments

Form-and-polish review

1. Choosing parameters

2. Constructing the merge treea. Grouping sequences

b. Measuring distances

3. Weighting sequence pairs

4. Merging alignments

5. Polishing the alignment

Page 21: Multiple alignment by aligning alignments

ANEH--TRAHDHHSSQ

Measuring distances

Percent identity

Normalized alignment cost

ANEH--TRAHDHHSSQ

ANEH--TRAHDHHSSQ

ANEH--TRAHDHHSSQ

Compressed identity

Page 22: Multiple alignment by aligning alignments

Comparing distance methods

Tree method BAliBase SABmark PALI Average

Normalized cost 81.6 48.2 83.0 70.9

Compressed identity

-2.2 -4.1 -3.2 -3.1

Percent identity -3.1 -4.7 -3.1 -3.6

• Normalized cost is very simple, and gives greatest gains

Page 23: Multiple alignment by aligning alignments

Form-and-polish review

1. Choosing parameters

2. Constructing the merge treea. Grouping sequences

b. Measuring distances

3. Weighting sequence pairs

4. Merging alignments

5. Polishing the alignment

Page 24: Multiple alignment by aligning alignments

gapcoun

t

Aligning alignments

gapcoun

t

()

+()+ ( )substituti

onscore

gap length

columns

wi,j i,j

( )

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

Page 25: Multiple alignment by aligning alignments

Merging methods

Exact gap counts [Gotoh 1993 ; Kececioglu, Starrett 2004]

• k sequences, n columns• O(5k n2) worst case• O(k2 n2) time in practice

Pessimistic gap counts [Altschul 1989; Kececioglu, Zhang 1998]

• Overestimates gap startups• O(kn + n2) worst case• 100-fold speedup with 20 sequences per alignment

Page 26: Multiple alignment by aligning alignments

Comparing merging methods

Merging method BAliBase SABmark PALI Average

Exact 82.4 48.4 84.0 71.6

Pessimistic -0.8 -0.2 -1.0 -0.7

• Pessimistic heuristic may be sufficient for large inputs

Page 27: Multiple alignment by aligning alignments

Form-and-polish review

1. Choosing parameters

2. Constructing the merge treea. Grouping sequences

b. Measuring distances

3. Weighting sequence pairs

4. Merging alignments

5. Polishing the alignment

Page 28: Multiple alignment by aligning alignments

Polishing methods

Two-cut method• Random partition [Probcons Do et al. 2005]

• Tree-based partition

Page 29: Multiple alignment by aligning alignments

Polishing methods

Two-cut method• Random partition [Probcons Do et al. 2005]

• Tree-based partition Randomly cut edges [MAFFT Katoh et al. 2005]

Exhaustively cut edges [Muscle Edgar 2004]

• Can’t fix two small misaligned groups

Page 30: Multiple alignment by aligning alignments

Polishing methods

Two-cut method• Random partition [Probcons Do et al. 2005]

• Tree-based partition Randomly cut edges [MAFFT Katoh et al. 2005]

Exhaustively cut edges [Muscle Edgar 2004]

• Can’t fix two small misaligned groups

Page 31: Multiple alignment by aligning alignments

Polishing methods

Two-cut method• Random partition [Probcons Do et al. 2005]

• Tree-based partition Randomly cut edges [MAFFT Katoh et al. 2005]

Exhaustively cut edges [Muscle Edgar 2004]

• Can’t fix two small misaligned groups

Page 32: Multiple alignment by aligning alignments

Polishing methods

Two-cut method• Random partition [Probcons Do et al. 2005]

• Tree-based partition Randomly cut edges [MAFFT Katoh et al. 2005]

Exhaustively cut edges [Muscle Edgar 2004]

• Can’t fix two small misaligned groups

Three-cut method• Tree-based, random

Page 33: Multiple alignment by aligning alignments

Polishing methods

Two-cut method• Random partition [Probcons Do et al. 2005]

• Tree-based partition Randomly cut edges [MAFFT Katoh et al. 2005]

Exhaustively cut edges [Muscle Edgar 2004]

• Can’t fix two small misaligned groups

Three-cut method• Tree-based, random

On-the-fly method [Subbiah, Harrison 1989]

Page 34: Multiple alignment by aligning alignments

Comparing polishing methods

Polishing method BAliBaseSABmar

kPALI Average

3-cut + on-the-fly -0.1 50.2 -0.2 73.1

3-cut -0.2 -0.5 84.8 -0.2

2-cut 84.4 -0.4 -0.1 -0.2

2-cut + on-the-fly -0.8 -0.2 -0.3 -0.4

On-the-fly -1.1 -0.6 -0.4 -0.7

None -2.0 -1.8 -0.8 -1.5

• 3-cut achieves 2-cut quality in less time

• On-the-fly speeds up 2-cut convergence

Page 35: Multiple alignment by aligning alignments

Form-and-polish review

1. Choosing parameters

2. Constructing the merge treea. Grouping sequences

b. Measuring distances

3. Weighting sequence pairs

4. Merging alignments

5. Polishing the alignment

Page 36: Multiple alignment by aligning alignments

Weighting sequence pairs

A B C20 seqs 2 seqs 50 seqs

Page 37: Multiple alignment by aligning alignments

Weighting sequence pairs

A

B

C

40 A-B pair alignments

(20)

(2)

(50)

100 B-C pair alignments

1000 A-C pair alignments

Page 38: Multiple alignment by aligning alignments

Weighting methods

Covariance weights [Altschul, et al. 1989]

• Based on correlation between paths• Approximated in practice [Gotoh 1995]

• Used in MAFFT

Division weights [Thompson, et al. 1994]

• Edge lengths divided among leaves• Used in ClustalW, Muscle

Page 39: Multiple alignment by aligning alignments

Weighting anomalies

Covariance weights

v

xy

i

zl vx

l vy

l vz

if l vx = 2 l vy

then wyz 2wxz

wyz ≈ wxz

even for very long l vz

(expect: )

Page 40: Multiple alignment by aligning alignments

Weighting anomalies

Division weights

v

x y

i

z

l vx l vy

l vz

if l vx = l vy

wxz = wyz

<< l vz

(expect: )

then wxz = wyz ≈ 2wxy

≈wxy

2 l vz

Page 41: Multiple alignment by aligning alignments

Weighting methods

Covariance weights [Altschul, et al. 1989]

• Based on correlation between paths• Approximated in practice [Gotoh 1995]

• Used in MAFFT

Division weights [Thompson, et al. 1994]

• Edge lengths divided among leaves• Used in ClustalW, Muscle

Influence weights• Based on the influence of leaf j on i, i,j

Page 42: Multiple alignment by aligning alignments

Influence weights

i j

B CA

Page 43: Multiple alignment by aligning alignments

Influence weights

j

i

A’ A”

BC

Page 44: Multiple alignment by aligning alignments

x

y z

i

Influence weights

Page 45: Multiple alignment by aligning alignments

x

y z

iSx(y) : l (x,y) + ( l (e) )

e T(y)

T(y) : tree under y

L(y) : set of leaves under y

Hx(y) : avg path length from x to L(y)

Nx(y) =Sx(y)

Hx(y)(effective # sequences)

(size)

(height)

Influence weights

(subtree)

(leaf set)

Hx(y)

Page 46: Multiple alignment by aligning alignments

x

z

i

Nx(y) ≈ 1Nx(y) ≈ k

k seqs

y

y

Sx(y) : l (x,y) + ( l (e) ) e T(y)

T(y) : tree under y

L(y) : set of leaves under y

Hx(y) : avg path length from x to L(y)

Nx(y) =Sx(y)

Hx(y)

Influence weights

(effective # sequences)

(size)

(height)

(subtree)

(leaf set)

Page 47: Multiple alignment by aligning alignments

Influence weights

x

y z

i

Split wx = wy + wz according to the ratio:

Nx(y)

Nx(z)

wy

wz

=Hi(z)

Hi(y)

wx

Sx(y) : l (x,y) + ( l (e) ) e T(y)

T(y) : tree under y

L(y) : set of leaves under y

Hx(y) : avg path length from x to L(y)

Nx(y) =Sx(y)

Hx(y)(effective # sequences)

(size)

(height)

(subtree)

(leaf set)

Page 48: Multiple alignment by aligning alignments

Influence weights

x

i

wx

y

z

Nx(z) ≈ 1Nx(y) ≈ 7

7

8wx

1wx8

Split wx = wy + wz according to the ratio:

Nx(y)

Nx(z)

wy

wz

=Hi(z)

Hi(y)

Sx(y) : l (x,y) + ( l (e) ) e T(y)

T(y) : tree under y

L(y) : set of leaves under y

Hx(y) : avg path length from x to L(y)

Nx(y) =Sx(y)

Hx(y)(effective # sequences)

(size)

(height)

(subtree)

(leaf set)

Hi(y) ≈ Hi(z)

Page 49: Multiple alignment by aligning alignments

Influence weights

x

y z

i

wx

Hi(y) = 5

2

3wx

Hi(z) = 10

1wx3

Split wx = wy + wz according to the ratio:

Nx(y)

Nx(z)

wy

wz

=Hi(z)

Hi(y)

Sx(y) : l (x,y) + ( l (e) ) e T(y)

T(y) : tree under y

L(y) : set of leaves under y

Hx(y) : avg path length from x to L(y)

Nx(y) =Sx(y)

Hx(y)(effective # sequences)

(size)

(height)

(subtree)

(leaf set)

Nx(y) ≈ Nx(z)

Page 50: Multiple alignment by aligning alignments

Influence weights

• Influence i,j) is the weight wj

wi,j score( i, j )i,j

SP score =

• Define wiji,j) j,i)

Page 51: Multiple alignment by aligning alignments

Comparing weighting methods

Weighting methodAverag

eBAliBase

references 2 & 3

Influence 71.6 83.3

None 71.6 -0.5

Division 71.6 -0.8

Covariance -0.3 -1.8

• Weights have little impact on the complete suite

Weighting methodAverag

e

Influence 71.6

None 71.6

Division 71.6

Covariance -0.3

• … but influence weights show promise

Page 52: Multiple alignment by aligning alignments

Form-and-polish review

1. Choosing parameters

2. Constructing the merge treea. Grouping sequences

b. Measuring distances

3. Weighting sequence pairs

4. Merging alignments

5. Polishing the alignment

Page 53: Multiple alignment by aligning alignments

Choosing parameters

Default parameter selection:• Seed value by inverse alignment

• InverseAlign [Kececioglu, Kim 2006] on BAliBase

• Substitution matrix fixed at BLOSUM62

• Evaluated 800 parameter choices near seed

Default can be poor on some sequences:• SABmark superfamily group 287:

Default parameters: 20%

Best parameters: 75%

Page 54: Multiple alignment by aligning alignments

Choosing parameters

Parameter choice BAliBase SABmark PALI Average

Default 84.3 50.2 84.6 73.1

Parameter choice BAliBase SABmark PALI Average

Default 84.3 50.2 84.6 73.1

Oracle (4 options) +1.9 +2.7 +1.6 +2.0

Parameter choice BAliBase SABmark PALI Average

Default 84.3 50.2 84.6 73.1

Oracle (4 options) +1.9 +2.7 +1.6 +2.0

Advisor (4 options) +0.4 +0.3 +0.3 +0.3

SCYAGNSSTEPYAVG--QLLA-AKVVDSCY---

--YAGNSSTEPYAVA---LLAHAKVVDSCYAGN

-----NSSTEPYAVA--QLLAHAKVVDSCY---

SCYAGNSSTEPYAVA--QLLAHAKV--------

-------STEPYAVAHHQLLAHAKVVDSCYAGN

SCYAGNSSTEPYAVAHHQLLA--KV--------

SCYAGNSSTE----PHHQLLAHAKVVDSCY---

Core column: >90% identity (compressed alphabet)

• Effect of the advisor is small, but shows significant potential

Page 55: Multiple alignment by aligning alignments

Best-of-breed methods

Stage Average Method

(Baseline) 67.1

Tree +0.7minimum spanning tree

Distance +3.1 normalized cost

Merge +0.7 exact counts

Polish +1.5 3-cut

Parameters +0.3 advisor

(Combined) 73.4

Page 56: Multiple alignment by aligning alignments

Best-of-breed methods

Stage Average Method

(Baseline) 67.1

Tree +0.7minimum spanning tree

Distance +3.1 normalized cost

Merge +0.7 exact counts

Polish +1.5 3-cut

Parameters +0.3 advisor

(Combined) 73.4

Page 57: Multiple alignment by aligning alignments

Best-of-breed methods

Stage Average Method

(Baseline) 67.1

Tree +0.7minimum spanning tree

Distance +3.1 normalized cost

Merge +0.7 exact counts

Polish +1.5 3-cut

Parameters +0.3 advisor

(Combined) 73.4

Opal

Page 58: Multiple alignment by aligning alignments

Comparing to other tools

Tool Average

Opal with advisor 73.4

Opal with defaults 73.1

Probcons 73.1

MAFFT 72.9

T-Coffee 69.4

Muscle 69.0

ClustalW 63.9Hydrophobicity

Consistency

4% gain

5% gain

Page 59: Multiple alignment by aligning alignments

Conclusion

• Best-of-breed methods identified

• Opal achieves state-of-the-art accuracy • Does not use consistency or hydrophobicity

• Greatest gains from:• normalized alignment cost for distances

• 3-cut for polishing

• Promising new ideas:• influence weighting

• parameter advising

Page 60: Multiple alignment by aligning alignments

Future work

• Incorporate hydrophobicity in aligning alignments

• Design unbiased recovery measures for alignments with overrepresented groups

• Investigate parameter advisor methods

Page 61: Multiple alignment by aligning alignments

Acknowledgements

• Eagu Kim• David Maddison• Marcy McClure• Dean Starrett

• Research supported in part by • NSF IGERT in Genomics Fellowship• NSF Grant DBI-0317498

• Travel fellowship from• US National Science Foundation