multiple sequence alignment (msa) usean sekvenssin rinnastus petri törönen help contributed by:...

30
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

Upload: collin-owens

Post on 11-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

Multiple sequence alignment(MSA)

Usean sekvenssin rinnastus

Petri TörönenHelp contributed by: Liisa Holm & Ari Löytynoja

Page 2: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

What is MSA?• MSA is an alignment generated from three

or more sequences.

• MSA is usually a more global alignment, i.e., the aim is to align homologous residues (nucleotides or amino acids) in columns across the length of the whole sequences.

GA--GTACA

CAC-GTATA

CACGGTAT-

G-CGGTCTA

Page 3: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

What is MSA?

Picture shows protein multiple sequence alignmenthttp://en.wikipedia.org/wiki/Multiple_sequence_alignment

Page 4: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

Why MSA• ”MSA emphasises signal observed in the

pairwise alignment” (Liisa Holm)

• Improved alignments!!

• Alignment of more distant sequences with the help from intermediate sequences

• Highlight the conserved regions in sequences

http://ekhidna.biocenter.helsinki.fi/users/petri/public/opetus_jutut/Bioinf_Per_Lects/urease_output.txt

Page 5: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

Why MSAMSA is input to many analysis tasks:

•Detection of active site

•Generation sequence profiles

•Detection of protein domains and motifs

•Phylogenetics

Page 6: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

Remember• First step of MSA:

• Good selection of sequences to the analysis

• Sequences need to be functionally/evolutionarily related

• Sometimes it is good to have some variation in the sequences (depends on the analysis task)

• Alternative: Rubbish in → Rubbish out

Page 7: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

MSA methods

• Finding optimal multiple sequence alignment is computationally hard task

• “Correct” answer would always come by extending dynamic algorithm to multiple sequences

• In practice dynamic algorithm cannot be applied to MSA problems

• We need approximate solutions (heuristics)

http://en.wikipedia.org/wiki/Multiple_sequence_alignment#Dynamic_programming_and_computational_complexity

Page 8: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

MSA methods: heuristics

• Progressive Alignment (not much used)

• Iterative Alignment (most popular)

• Hidden Markov Models

• Pattern Based methods

Page 9: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

Progressive alignment

• Divide unsolvable task into subtasks that can be solved

• Align first most similar pairs of sets of sequences– Sequence sets can have 1 or many sequences– First the sets include only single sequences

• Move progressively to more bigger sets and to more difficult pairs of sets

• Always align only two pairs of sets at the time

Page 10: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

Progressive alignment

• Produce pairwise alignments between all the sequences you want to align with MSA.– Dynamic programming, ktup-methods..

• Produce a “guide tree” on the basis of the pairwise distances calculated from pairwise alignments– UPGMA, neighbor joining

• Produce an MSA using the “guide tree”.– Sequences are aligned in the same order as the

guide tree instructs.

Page 11: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

Set of sequences All against all pairwise alignment Here demonstrated for 1. sequence

Get pairwise similarities from alignmentsCreate a cluster tree from similarities Join sequences in the order obtained

From the cluster tree

Page 12: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

Guide tree construction: UPGMA

• Unweighted Pair Group Method with Arithmetic mean

• One of the fastest tree construction methods

Page 13: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

An example: Pairwise alignments

Page 14: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

Pairwise distances, based on pairwise alignments

Number of nucleotide differences

Absolute distances, used in Pileup/

Clustal

JC-distance

Page 15: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

UPGMA based on JC-distances*

0,107 / 2

JC-distances = Jukes-Cantor distances. The observed distances, D, are corrected for multiple substitutions via correction function –(3/4)*ln(1-(4/3)D)

Page 16: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

UPGMA, distance updatesd(human,chimp),gorilla = [d(human, gorilla) + d(chimp, gorilla)] / 2 =

[0,383 + 0,232] / 2 = 0,3075

Page 17: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

UPGMA

Page 18: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

UPGMA

Page 19: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

UPGMA

U

d(human & chimp),U =

0,3923/2 = 0,1962

d(gorilla & orangutan),U

= 0,3923/2 = 0,1962

0,1962 - 0,0537 = 0,1426

0,1962 - 0,116 = 0,080

Page 20: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

UPGMA

0.7083 / 2

0,3541 - 0,1426 - 0,0537

0,3541 - 0,080 - 0,116or

Page 21: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

Constructing MSA

human ACGTACGTCCchimp ACCTACGTCCgorilla ACCACCGTCCorangutan ACCCCCCTCCmaqaque CCCCCCCCCC

human ACGTACGTCCchimp ACCTACGTCC

gorilla ACCACCGTCCorangutan ACCCCCCTCC

human ACGTACGTCC

chimp ACCTACGTCC

gorilla ACCACCGTCC

orangutan ACCCCCCTCC

Page 22: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

Alignment score• 1234• ACGT match=1• ACGA mismatch=0• AGGA

• 1: A-A + A-A + A-A = 1+1+1 = 3

• 2: C-C + C-G + C-G =1+0+0 = 1

• 3: G-G + G-G + G-G = 1+1+1 = 3

• 4: T-A + T-A + A-A = 0+0+1 =1

• S(alignment) = S(1) + S(2) + S(3) + S(4) = 3+1+3+1 = 8

• The higher the score, the better the alignment

Page 23: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

Progressive alignment - pros and cons

• Pros:– Fast

• Cons:– Once gaps are opened they can never be closed– Errors in the alignment of the first few

sequences can have catastrophic effects on the whole alignment

– Not much used (to my knowledge)

Page 24: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

Iterative alignment

• Create a progressive alignment

• After obtaining the alignment calculate a quality score

• REPEAT THE FOLLOWING STEPS:– Redo the cluster tree– Realign the sequences using the new cluster

tree– Calculate a quality score

• Loop above can be stopped when a maximum number is reached or when quality score is not improved

Page 25: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

Iterative alignment

• Allows correction of errors that was not possible in progressive alignment

• Very popular among the MSA methods

• Increases the running time of the method

Page 26: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

Diagram of typical iterative MSA program workflow. Figure from Do & Katoh 2008 http://ai.stanford.edu/~chuongdo/papers/alignment_review.pdf

Iterative alignment

Iteration loop

Page 27: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

What MSA program(s) to use?• Depends on the application

– Phylogenetic studies– Structure based studies

• Depends on the size of the data– Some programs cannot handle large dataset

• Remember to evaluate the alignment by eye

Page 28: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

What MSA program(s) to use?

• Collection of MSA programs at EBI

• http://www.ebi.ac.uk/Tools/msa/

Page 29: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

Summary of MSA

• MSA is relevant for many analysis tasks– Improved signal from the alignment

• Solving MSA requires heuristics

• Selection of MSA methods depends on the application

• Results should be evaluated by eye– And the errors should be corrected with MSA

editors

Page 30: Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja

Manual editing of MSAs?

• Let’s say that your performed an MSA witn computer. However, biologically, it has some faults - needs manual editing ->

• Editors: Jalview and Seaview http://www.csc.fi/english/research/sciences/bioscience/programs/index_html

• Input data can be in any of the most common MSA formats (Mase, Phylip, Clustal, MSF, Fasta, NEXUS, PIR and BCL)