overview - vital-it · 2013. 6. 28. · 8 globins =>150 000 years how to align many sequences?...

29
CN+LF-2005.02 An introduction to multiple alignments © Cédric Notredame Swiss Institute of Bioinformatics CN+LF-2005.02 Overview Multiple alignments How-to, Goal, problems, use Patterns PROSITE database, syntax, use PSI-BLAST BLAST, matrices, use [ Profiles/HMMs ] …

Upload: others

Post on 16-Mar-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

An introduction to multiple alignments

© Cédric Notredame

Swiss Institute of Bioinformatics

CN+LF-2005.02

Overview

Multiple alignmentsHow-to, Goal, problems, use

PatternsPROSITE database, syntax, use

PSI-BLASTBLAST, matrices, use

[ Profiles/HMMs ] …

Page 2: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

Overview

What are multiple alignments?How can I use my alignments?How does the computer align the sequences?

The progressive alignment algorithmWhat are the difficulties?Pre-requisite?

How can we compare sequences?How can we align sequences?

CN+LF-2005.02

Sometimes two sequences are not enough

The man with TWO watches NEVER knows the exact time

Page 3: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

What is a multiple sequence alignment?

What can it do for me?How can I produce one of these?How can I use it?

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

CN+LF-2005.02

What is a multiple sequence alignment?

Structural/biochemical criteriaResidues playing a similar role end up in the same column.

Evolution criteriaResidues having the same ancestor end up in the same column.

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

Page 4: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

CN+LF-2005.02

How can I use a multiple alignment?chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPunknown -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------unknown AKDDRIRYDNEMKSWEEQMAE

* : .* . :

Extrapolation

SwissProt

Unkown Sequence

Homology?

Less Than 30 % idBUT

Conserved where it MATTERS

Page 5: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

How can I use a multiple alignment?chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

Extrapolation

Prosite Patterns

P-K-R-[PA]-x(1)-[ST]…

CN+LF-2005.02

How can I use a multiple alignment?

Extrapolation

Prosite Patterns

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-IQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

L?K>R

Prosite Profiles -More Sensitive-More Specific

AFDEFGHQIVLW

Page 6: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

PROSITE profile (see also HMMs)

A Substitution Cost For Every Amino Acid, At Every Position

CN+LF-2005.02

How can I use a multiple alignment?

Phylogeny

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

chite

wheat

trybr

mouse

-Evolution-Paralogy/Orthology

Page 7: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

How can I use a multiple alignment?

Phylogeny

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

Struc. Prediction

Column Constraint

Evolution Constraint

Structure Constraint

CN+LF-2005.02

How can I use a multiple alignment?

Phylogeny

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

Struc. Prediction

PsiPred or PhDFor secondary Structure Prediction: 75% Accurate.Threading: is improving but is not yet as good.

Page 8: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

How can I use a multiple alignment?

Phylogeny

Struc. Prediction

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

Caution!

Automatic MultipleSequence Alignment methodsare not always perfect…

CN+LF-2005.02

Page 9: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

The problem

why is it difficult to compute a multiple sequence alignment?

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

Computation

What is the good alignment?

Biology

What is a good alignment?

CN+LF-2005.02

The problem

why is it difficult to compute a multiple sequence alignment?

CIRCULAR PROBLEM....

GoodSequences

GoodAlignment

Page 10: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

The problem

Same as pairwise alignment problemWe do NOT know how sequences evolve.We do NOT understand the relation between structures and sequences.

We would NOT recognize the “correct” alignment if we had it IN FRONT of our eyes…

CN+LF-2005.02

The Charlie Chaplin paradox

Page 11: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

What do I need to know to make a good multiple alignment?

How do sequences evolve?How does the computer align the sequences?How can I choose my sequences?What is the best program?How can I use my alignment?

CN+LF-2005.02

An alignment is a story

ADKPKRPLSAYMLWLN

ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN

ADKPRRPLS-YMLWLNADKPKRPKPRLSAYMLWLN

Mutations+

Selection

ADKPRRP---LS-YMLWLNADKPKRPKPRLSAYMLWLN

InsertionDeletion

Mutation

Page 12: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

Homology

Same sequences -> same origin? -> same function? -> same 3D fold?

Length

%Sequence Identity

30%

100

Same 3D Fold

Twilight Zone

CN+LF-2005.02

Convergent evolution

AFGP with (ThrAlaAla)nSimilar To Trypsinogen

AFGP with (ThrAlaAla)nNOT

Similar to Trypsinogen

N

S

Chen et al, 97, PNAS, 94, 3811-16

Page 13: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

Residues and mutations

All residues are equal, but some more than others…

PG

SC

LI

T

V A

W

YF QH

K

R

ED N

Aliphatic

Aromatic

Hydrophobic

Polar

SmallM

Accurate matrices are data driven rather than knowledge driven

G

C

CN+LF-2005.02

Substitution matrices

Different Flavors:

• Pam: 250, 350• Blosum: 45, 62• …

Page 14: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

What is the best substition matrix?

Mutation rates depend on families

Choosing the right matrix may be trickyGonnet250 > BLOSUM62 > PAM250 Depends on the family, the program used and its tuning

Family S N Histone3 6.4 0Insulin 4.0 0.1Interleukin I 4.6 1.4α−Globin 5.1 0.6Apolipoprot. AI 4.5 1.6Interferon G 8.6 2.8

Rates in Substitutions/site/Billion Years as measured on Mouse Vs Human (0.08 Billion years)

CN+LF-2005.02

Insertions and deletions?

Indel Cost

L

Cost

L

Cost

L

Affine Gap PenaltyCost=GOP+GEP*L

Page 15: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

How to align many sequences?

Exact algorithms are computing time consumingNeedlemann & WunschSmith & Waterman

2 Globins =>1 sec

CN+LF-2005.02

3 Globins =>2 mn

How to align many sequences?

Exact algorithms are computing time consumingNeedlemann & WunschSmith & Waterman

Page 16: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

4 Globins =>5 hours

How to align many sequences?

Exact algorithms are computing time consumingNeedlemann & WunschSmith & Waterman

-> heuristic wished

CN+LF-2005.02

5 Globins =>3 weeks

How to align many sequences?

Exact algorithms are computing time consumingNeedlemann & WunschSmith & Waterman

-> heuristic really wished!

Page 17: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

6 Globins =>9 years

How to align many sequences?

Exact algorithms are computing time consumingNeedlemann & WunschSmith & Waterman

-> heuristic required!

CN+LF-2005.02

How to align many sequences?

Exact algorithms are computing time consumingNeedlemann & WunschSmith & Waterman

-> heuristic definitely required!

7 Globins =>1000 years

Page 18: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

8 Globins =>150 000 years

How to align many sequences?

Exact algorithms are computing time consumingNeedlemann & WunschSmith & Waterman

-> heuristic please!…

CN+LF-2005.02

Existing methods1-Carillo and Lipman:

-MSA, DCA.

-Few Small Closely Related Sequence.

2-Segment Based:

-DIALIGN, MACAW.

-May Align Too Few Residues

-Do Well When They Can Run.

3-Iterative:-HMMs, HMMER, SAM.

-Slow, Sometimes Inacurate

-Good Profile Generators

4-Progressive:

-ClustalW, Pileup, Multalign…

-Fast and Sensitive

Page 19: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

Progressive alignmentFeng and Dolittle, 1980; Taylor 1981

Dynamic Programming Using A Substitution Matrix

CN+LF-2005.02

Progressive alignmentFeng and Dolittle, 1980; Taylor 1981

-Depends on the ORDER of the sequences (Tree).

-Depends on the CHOICE of the sequences.

-Depends on the PARAMETERS:

•Substitution Matrix.

•Penalties (Gop, Gep).

•Sequence Weight.

•Tree making Algorithm.

Page 20: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

Progressive alignment

Works well when phylogeny is denseNo outlayer sequenceExample: river crossing

CN+LF-2005.02

Selecting sequences from a BLAST output

Page 21: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

A common mistake

Sequences too closely related

Identical sequences brings no informationMultiple sequence alignments thrive on diversity

PRVA_MACFU SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEEPRVA_HUMAN SMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIEEPRVA_GERSP SMTDLLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKTPDDVKKVFHILDKDKSGFIEEPRVA_MOUSE SMTDVLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKNPDEVKKVFHILDKDKSGFIEEPRVA_RAT SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEEPRVA_RABIT AMTELLNAEDIKKAIGAFAAAESFDHKKFFQMVGLKKKSTEDVKKVFHILDKDKSGFIEE

:**::*.*******:***:* :****************..::******:***********

PRVA_MACFU DELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAESPRVA_HUMAN DELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAESPRVA_GERSP DELGFILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSESPRVA_MOUSE DELGSILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAESPRVA_RAT DELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAESPRVA_RABIT EELGFILKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES

:*** ******.******.**** *:************.:******:**

CN+LF-2005.02

Page 22: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

Respect information!PRVA_MACFU ------------------------------------------SMTDLLN----AEDIKKAPRVA_HUMAN ------------------------------------------SMTDLLN----AEDIKKAPRVA_GERSP ------------------------------------------SMTDLLS----AEDIKKAPRVA_MOUSE ------------------------------------------SMTDVLS----AEDIKKAPRVA_RAT ------------------------------------------SMTDLLS----AEDIKKAPRVA_RABIT ------------------------------------------AMTELLN----AEDIKKATPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM

: :*. .*::::

PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFIPRVA_HUMAN VGAFSATDS--FDHKKFFQMVG------LKKKSADDVKKVFHMLDKDKSGFIEEDELGFIPRVA_GERSP IGAFAAADS--FDHKKFFQMVG------LKKKTPDDVKKVFHILDKDKSGFIEEDELGFIPRVA_MOUSE IGAFAAADS--FDHKKFFQMVG------LKKKNPDEVKKVFHILDKDKSGFIEEDELGSIPRVA_RAT IGAFTAADS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGSIPRVA_RABIT IGAFAAAES--FDHKKFFQMVG------LKKKSTEDVKKVFHILDKDKSGFIEEEELGFITPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM

:. . * .*..:*: *: * *. :::..:*:::**: .*:*: :** :

PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-PRVA_HUMAN LKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES-PRVA_GERSP LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES-PRVA_MOUSE LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES-PRVA_RAT LKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES-PRVA_RABIT LKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES-TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE

*: . .. :: .: : *: ***:.**:*. :** ::

-This alignment is not informative about the relation between TPCC MOUSE and the rest of the sequences.

-A better spread of the sequences is needed

CN+LF-2005.02

Selecting diverse sequences

PRVB_CYPCA -AFAGVLNDADIAAALEACKAADSFNHKAFFAKVGLTSKSADDVKKAFAIIDQDKSGFIEPRVB_BOACO -AFAGILSDADIAAGLQSCQAADSFSCKTFFAKSGLHSKSKDQLTKVFGVIDRDKSGYIEPRV1_SALSA MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIEPRVB_LATCH -AVAKLLAAADVTAALEGCKADDSFNHKVFFQKTGLAKKSNEELEAIFKILDQDKSGFIEPRVB_RANES -SITDIVSEKDIDAALESVKAAGSFNYKIFFQKVGLAGKSAADAKKVFEILDRDKSGFIEPRVA_MACFU -SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEPRVA_ESOLU --AKDLLKADDIKKALDAVKAEGSFNHKKFFALVGLKAMSANDVKKVFKAIDADASGFIE

: *: .: . .* .:*. * ** *: * : * :* * **:**

PRVB_CYPCA EDELKLFLQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA-PRVB_BOACO EDELKKFLQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKGPRV1_SALSA VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ-PRVB_LATCH DEELELFLQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA-PRVB_RANES QDELGLFLQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA-PRVA_MACFU EDELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAESPRVA_ESOLU EEELKFVLKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA

:** .*:.* .* *: ** :: .* **** **::** **

-A REASONABLE model now exists.

-Going further:remote homologues.

Page 23: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

Aligning remote homologuesPRVA_MACFU ------------------------------------------SMTDLLNA----EDIKKAPRVA_ESOLU -------------------------------------------AKDLLKA----DDIKKAPRVB_CYPCA ------------------------------------------AFAGVLND----ADIAAAPRVB_BOACO ------------------------------------------AFAGILSD----ADIAAGPRV1_SALSA -----------------------------------------MACAHLCKE----ADIKTAPRVB_LATCH ------------------------------------------AVAKLLAA----ADVTAAPRVB_RANES ------------------------------------------SITDIVSE----KDIDAATPCS_RABIT -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAITPCS_PIG -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAITPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM

: ::

PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFIPRVA_ESOLU LDAVKAEGS--FNHKKFFALVG------LKAMSANDVKKVFKAIDADASGFIEEEELKFVPRVB_CYPCA LEACKAADS--FNHKAFFAKVG------LTSKSADDVKKAFAIIDQDKSGFIEEDELKLFPRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKFPRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLFPRVB_LATCH LEGCKADDS--FNHKVFFQKTG------LAKKSNEELEAIFKILDQDKSGFIEDEELELFPRVB_RANES LESVKAAGS--FNYKIFFQKVG------LAGKSAADAKKVFEILDRDKSGFIEQDELGLFTPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEITPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEITPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM

: . .: .. . *: * : * :* : .*:*: :** .

PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-PRVA_ESOLU LKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA-PRVB_CYPCA LQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA--PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG-PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ--PRVB_LATCH LQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA--PRVB_RANES LQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA--TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQTPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQTPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE

:: .. :: : :: .* :.** *. :** ::

CN+LF-2005.02

Going further…

PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFIPRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKFPRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLFTPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEITPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEITPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMMTPC_PATYE SDEMDEEATGRLNCDAWIQLFER---KLKEDLDERELKEAFRVLDKEKKGVIKVDVLRWI

. : .. . :: . : * :* : .* *. : * .

PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES--PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG--PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ---TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ-TPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ-TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE-TPC_PATYE LS---SLGDELTEEEIENMIAETDTDGSGTVDYEEFKCLMMSSDA

: . :: : :: * :..* :. :** ::

Page 24: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

What makes a good alignment…

The more divergent the sequences, the betterThe fewer indels, the betterNice ungapped blocks separated with indelsDifferent classes of residues within a block:

Completely conserved (*)Size and hydropathy conserved (:)Size or hydropathy conserved (.)

The ultimate evaluation is a matter of personaljudgment and knowledge

CN+LF-2005.02

Avoiding pitfalls

Page 25: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

Naming your sequences the right way

Never use white spaces in your sequence namesNever use special symbols. Stick to plain letters, numbers and the underscore sign (_) to replace spaces. Avoid ALL other signs, especially the mosttempting ones like @, #, |, *, >, <…Never use names longer than 15 charactersNever give the same name to 2 different sequencesin your set. Some programs accept it, others likeClustalW don’t.

CN+LF-2005.02

Do not use too many sequences!

Page 26: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

Beware of RepeatsThere is a problem when two sequences do not contain the same number ofrepeats

It is then better to manually extarct the repeats and to align them separately. Individual repeats can be recognized using Dotlet or Dotter.

CN+LF-2005.02

Keep a biological perspectivechite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

chite AD--K----PKR-PLYMLWLNS-ARESIKRENPDFK-VT-EVAKKGGELWRGL-wheat -DPNK----PKRAP-FFVFMGE-FREEFKQKNPKNKSVA-AVGKAAGERWKSLStrybr -K--KDSNAPKR-AMT-MFFSSDFR-S-KH-S-DLS-IV-EMSKAAGAAWKELG mouse ----K----PKR-PRYNIYVSESFQEA-K--D-D-S-AQGKL-KLVNEAWKNLS

* *** .:: ::... : * . . . : * . *: *

chite KSEWEAKAATAKQNY-I--RALQE-YERNG-G-wheat KAPYVAKANKLKGEY-N--KAIAA-YNK-GESAtrybr RKVYEEMAEKDKERY----K--RE-M-------mouse KQAYIQLAKDDRIRYDNEMKSWEEQMAE-----

: : * : .* :

DIFFERENTPARAMETERS

Page 27: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

Do not overtune!!!

DO NOT PLAY WITHPARAMETERS!

IF YOU KNOW THE ALIGNMENT YOU

WANT: MAKE IT YOURSELF!

chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. ::: .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

chite ---ADKPKRPL-SAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAP-SAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS-----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPR-SAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

***. * .: .. . : . . * . *: *

chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

* : .* . :

CN+LF-2005.02

BaliBase classification and benchmarkDescriptionPROBLEM

EvenPhylogenicSpread.

One OutlayerSequence

Two Distantlyrelated Groups

Long InternalIndel

Long Terminal Indel

Page 28: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

Choosing the right method

Source: BaliBase

Thompson et al, NAR, 1999

PROBLEM Strategy Strategy

ClustalW, T-coffee,MSA, DCA

PrrP,T-Coffee

Dialign II

T-Coffee

T-Coffee

Dialign II

T-Coffee

CN+LF-2005.02

Some interesting links

Page 29: Overview - Vital-IT · 2013. 6. 28. · 8 Globins =>150 000 years How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman ->

CN+LF-2005.02

Conclusion

The best alignment method:Your brainThe right data

The best evaluation method:Your eyesExperimental information (SwissProt)

Choosing the sequences well isimportantBeware of repeated elements

What can I conclude?Homology -> information extrapolation

How can I go further?PatternsProfilesHMMs…