1 global pairwise alignment global alignment of: 2 nucleotide sequences or 2 amino-acid sequences

128
1 GLOBAL GLOBAL PAIRWISE ALIGNMENT PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES 2 NUCLEOTIDE SEQUENCES OR OR 2 AMINO-ACID SEQUENCES 2 AMINO-ACID SEQUENCES

Upload: magdalen-mckenzie

Post on 22-Dec-2015

224 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

1

GLOBAL GLOBAL PAIRWISE ALIGNMENTPAIRWISE ALIGNMENT

GLOBAL ALIGNMENT OF:GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES 2 NUCLEOTIDE SEQUENCES

OR OR 2 AMINO-ACID SEQUENCES2 AMINO-ACID SEQUENCES

Page 2: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

2

Assumptions:Assumptions:

Life is monophyleticLife is monophyleticBiological entities (sequences, Biological entities (sequences, taxa) share common ancestrytaxa) share common ancestry

Page 3: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

3

Any two organisms share a common ancestor in their past

ancestor

descendant 1 descendant 2

Page 4: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

4

ancestor (~5 MYA)

Page 5: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

5

ancestor (~120 MYA)

Page 6: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

6

ancestor (~1,500 MYA)

Page 7: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

7

(1) Speciation events(2) Gene duplication (3) Duplicative transposition

Homologoussequences

Page 8: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

8

HomologHomolog

y:y: A term

coined by Richard Owen in 1843.

Definition: Similarity resulting from common ancestry.

Page 9: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

9

Homology

There are three main types of

molecular homology: orthology,

paralogy (including ohnology) and

xenology.

Page 10: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

10

Homology: General Definition

• Homology designates a qualitative relationship of common descent between entities

• Two genes are either homologous or they are not!– it doesn’t make sense to say “two

genes are 43% homologous.”– it doesn’t make sense to say “Linda is

43% pregnant.”

Page 11: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

11

Orthology & Paralogy

• Two genes are orthologs if they originated from a single ancestral gene in the most recent common ancestor of their respective genomes

• Two genes are paralogs if they are related by gene duplication. Two genes are ohnologs if they are related by gene duplication due to genome duplication

Page 12: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

12

Page 13: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

13

= Gene death

Page 14: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

14

Xenology is due to horizontal (lateral) gene transfer (HGT or

LGT)

XA and XB are xenologsDistinguishing orthologs from xenologs is impossible in pairwise genomic comparisons, but possible when multiple genomes are compared

Page 15: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

15

Orthology, Paralogy, Xenology(Fitch, Trends in Genetics, 2000. 16(5):227-231)

Page 16: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

16

By comparing homologous characters, we can reconstruct the evolutionary events that have led to the formation of the extant sequences from the common ancestor.

Homology

Page 17: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

17

When comparing sequences, we are interested in POSITIONAL HOMOLOGY. We identify POSITIONAL HOMOLOGY through SEQUENCE ALIGNMENT.

Homology

Page 18: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Alignment:Alignment: A hypothesis concerning positional homology among residues from two or more sequence.Positional homologyPositional homology = In

pairwise alignment, a pair of nucleotides from two

homologous sequences that have descended from one

nucleotide in the ancestor of the two sequences.

Page 19: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

19

Sequence alignment involves the identification of the correct location of deletions and insertions that have occurred in either of the two lineages since their divergence from a common ancestor.

Page 20: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

20

Page 21: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

21

Unknown sequence

Unknown events & unknown sequence of events

Unknown events & unknown sequence of

events

The true alignment is unknown.

Page 22: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

There are two modes of alignment.

Global alignment: each residue of sequence A is compared with each residue in sequence B. Global alignment algorithms are used in comparative and evolutionary studies.

Local alignment: Determining if sub-segments of one sequence are present in another. Local alignment methods have their greatest utility in database searching and retrieval (e.g., BLAST).

Page 23: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

For reasons of computational complexity, sequence alignment is divided into two categories:

Pairwise alignment (i.e., the alignment of two sequences).

Multiple-sequence alignment (i.e., the alignment of three or more sequences).

Pairwise alignment problems have exact solutions.

Multiple-sequence alignment problems only have approximate (heuristic) solutions.

Page 24: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

24

A pairwise alignment consists of a series of paired bases, one base from each sequence. There are three types of pairs:

(1) matches = the same nucleotide appears in both sequences. (2) mismatches = different nucleotides are found in the two sequences. (3) gaps = a base in one sequence and a null

base in the other. GCGGCCCATCAGGTAGTTGGTG-GGCGTTCCATC--CTGGTTGGTGTG

Page 25: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

25

-Two DNA sequences: A and B.-Two DNA sequences: A and B.-Lengths are -Lengths are mm and and nn, respectively. , respectively.

-The number of matched pairs is -The number of matched pairs is xx. .

-The number of mismatched pairs -The number of mismatched pairs is is yy. . - Total number of bases in gaps is - Total number of bases in gaps is zz..

Page 26: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

26

There are internal internal and terminal terminal gaps.

GCGG-CCATCAGGTAGTTGGTG--GCGTTCCATC--CTGGTTGGTGTG

Page 27: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

27

A terminal gap may indicate missing data.

GCGG-CCATCAGGTAGTTGGTG--GCGTTCCATC--CTGGTTGGTGTG

Page 28: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

28

An internal gap indicates that a deletiondeletion or an insertioninsertion has occurred in one of the two lineages.

GCGG-CCATCAGGTAGTTGGTG--GCGTTCCATC--CTGGTTGGTGTG

Page 29: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

29

When sequences are compared through alignment, it is impossible to tell whether a deletion has occurred in one sequence or an insertion has occurred in the other. Thus, deletions and insertions are collectively referred to as indels (short for insertion or deletion).

GCGG-CCATCAGGTAGTTGGTG--GCGTTCCATC--CTGGTTGGTGTG

Page 30: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

30

The alignment is the first step in many functional and evolutionary studies.

Errors in alignment tend to amplify in later stages of the study.

Page 31: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

31

Motivation for sequence alignment

Function– Similarity may be indicative of

similar function.

Evolution– Similarity may be indicative of

common ancestry.

Page 32: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

32

Some definitions

Page 33: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

34

Methods of alignment:

1. Manual2. Dot matrix3. Distance Matrix4. Combined (Distance +

Manual)

Page 34: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

35

Manual aliManual aliggnmentnment. When there are few gaps and the two sequences are not too different from each other, a reasonable alignment can be obtained by visual inspection.

GCG-TCCATCAGGTAGTTGGTGTGGCGATCCATCAGGTGGTTGGTGTG

Page 35: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

36

Advantages of manual alignment:

(1) use of a powerful and trainable tool (the brain, well… some brains).

(2) ability to integrate additional data, e.g., domain structure, biological function.

Page 36: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

37

Page 37: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

38

Protein Alignment may be Protein Alignment may be guided by Secondary and guided by Secondary and

Tertiary StructuresTertiary Structures

Homo sapiens

DjlA protein

Escherichia coli

DjlA protein

Page 38: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

39

Disadvantages of manual alignment: subjectivitysubjectivity (the algorithm is unspecified)

irreproducibility irreproducibility (the results cannot be independently reproduced)

unscalabilityunscalability (inapplicable to long sequences)

incommensurabilityincommensurability (the results cannot be compared to those obtained by other methods)

Page 39: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

40

The dot-matrix method (Gibbs and McIntyre, 1970): The two sequences are written out as column and row headings of a two-dimensional matrix. A dot is put in the dot-matrix plot at a position where the nucleotides in the two sequences are identical.

Page 40: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

41

The alignment is defined by a path from the upper-left element to the lower-right element.

Page 41: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

42

There are 4 possible steps in the There are 4 possible steps in the path: path:

(1) a diagonal step through a dot = match.

(2) a diagonal step through an empty element of the matrix = mismatch.

(3) a horizontal step = a gap in the sequence on the left of the matrix.

(4) a vertical step = a gap in the sequence on the top of the matrix.

Page 42: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

43

A dot matrix may become cluttered. With DNA sequences, ~25% of the elements will be occupied by dots by chance alone.

Page 43: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

44

The number of spurious matches is determined by: window size (how many

residues are compared), stringency (the minimum number of matches for a hit), & alphabet size (number of characters states). Window size must be an odd number.

window size =1stringency = 1alphabet size = 4

Page 44: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

45

window size =1stringency = 1alphabet size = 4

window size = 3stringency = 2alphabet size = 4

Page 45: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

46

window size = 1stringency = 1alphabet size = 20

Page 46: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

47

Dot-matrix methods:Dot-matrix methods:

Advantages: By being a visual Advantages: By being a visual representation, and humans representation, and humans being visual animals, the being visual animals, the method may unravel method may unravel information on the evolution of information on the evolution of sequences that cannot easily sequences that cannot easily be gleaned from a line be gleaned from a line alignment.alignment.

Disadvantages: May not Disadvantages: May not identify the best possible identify the best possible alignment.alignment.

Page 47: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

48

Advantages:Highlighting Information

The vertical gap indicates The vertical gap indicates that a coding region that a coding region corresponding to ~75 corresponding to ~75 amino acids has either amino acids has either been deleted from the been deleted from the human gene or inserted human gene or inserted into the bacterial gene. into the bacterial gene.

Window size = 60 amino acids; Stringency = 24 matches

Page 48: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

49

The two pairs of The two pairs of diagonally oriented diagonally oriented parallel lines most parallel lines most probably indicate that two probably indicate that two small internal duplications small internal duplications occurred in the bacterial occurred in the bacterial gene. gene.

Window size = 60 amino acids; Stringency = 24 matches

Advantages:Highlighting Information

Page 49: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

50

Disadvantages:

Not possible to identify the best alignment.

Page 50: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

51

Scoring Matrices & Gap Penalties

Page 51: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

The true alignment between two sequences is the one that reflects accurately the evolutionary relationships between the sequences.

Since the true alignment is unknown, in practice we look for the optimal alignment, which is the one in which the numbers of mismatches and gaps are minimized according to certain criteria.

Page 52: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

53

Unfortunately, reducing the number of mismatches results in an increase in the number of gaps, and vice versa.

Page 53: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

54

= matches = mismatches = nucleotides in gaps = gaps

Page 54: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

55

The scoring scheme comprises a gap penalty and a scoring matrix, M(a,b), that specifies the score for each type of match (a = b) or mismatch (a b).

The units in a scoring matrix may be the nucleotides in the DNA or RNA sequences, the codons in protein-coding regions, or the amino acids in protein sequences.

Page 55: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

56

DNA scoring matrices are usually simple. In the simplest scheme all mismatches are given the same penalty.

M(a,b) is positive if a = b and negative otherwise.

In more complicated matrices a distinction may be made between transition and transversion mismatches or each type of mismatch may be penalized differently.

M(a,b) 0 if ab 0 if ab

Page 56: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

57

Further complications: Distinguishing among different matches and mismatches.

For example, a mismatched pair consisting of LeuLeu && IleIle, which are very similar biochemically to each other, may be given a lesser penalty than a mismatched pair consisting of ArgArg && GluGlu, which are very dissimilar from each other.

Page 57: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

58

Lesser penalty than

Page 58: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

59

BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

Page 59: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

60

BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

B = asx (asp or asn) X = unknownZ = glx (glu or gln) * = termination codon

Page 60: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

61

BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

The matrix is symmetrical

Page 61: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

62

BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

Positive numbers on the diagonal

Page 62: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

63

BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

Mismatches are usually penalized

Page 63: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

64

BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

Some mismatches are not penalized

Page 64: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

65

BLOSUM62 (BLOcks of amino acid SUbstitution Matrix

A few mismatches are even rewarded

Page 65: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

66

Gap penalty (or cost) is a factor (or a set of factors) by which the gap values (numbers and lengths of gaps) are mathematically manipulated to make the gaps equivalent in value to the mismatches.

The gap penalties are based on our assessment of how frequent different types of insertions and deletions occur in evolution in comparison with the frequency of occurrence of point substitutions.

Page 66: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

MismatchesGaps

Page 67: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

68

The gap penalty has two components: a gap-opening penalty and a gap-extension penalty.

Page 68: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

69

Three main gap-penalty systems:

(1) Fixed gap-penalty system = 0 gap-extension costs.

Page 69: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

70

Three main gap-penalty systems:

(2) Linear gap-penalty system = the gap-extension cost is calculated by multiplying the gap length minus 1 by a constant representing the gap-extension penalty for increasing the gap by 1.

Page 70: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

71

Three main gap-penalty systems:

(3) Logarithmic gap-penalty system = the gap-extension penalty increases with the logarithm of the gap length, i.e., slower.

Page 71: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

72

Alignment algorithms

Page 72: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

73

Aim: Given a predetermined set of criteria, find the alignment associated with the best score from among all possible alignments.

The OPTIMAL ALIGNMENT

Page 73: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

74

The number of possible alignments may be astronomical.

nmmin(n,m)

(nm)!

n!m!

nm2nm

(nm)nm

nn mm

where n and m are the lengths of the two sequences to be aligned.

Page 74: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

75

The number of possible alignments may be astronomical.

For example, when two DNA sequences 200 residues long each are compared, there are more than 10153 possible alignments.

In comparison, the number of protons in the universe is only ~1080.

Page 75: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

76

FORTUNATELY:

There are computer algorithms for finding the optimal alignment between two sequences that do not require an exhaustive search of all the possibilities.

Page 76: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

77

The Needleman-Wunsch (1970) Needleman-Wunsch (1970)

algorithmalgorithm

uses Dynamic Dynamic

ProgrammingProgramming

Page 77: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

78

Dynamic programming = a computational technique. It is applicable when large searches can be divided into a succession of small stages, such that (1) the solution of the initial search stage is trivial, (2) each partial solution in a later stage can be calculated by reference to only a small number of solutions in an earlier stage, and (3) the last stage contains the overall solution.

Page 78: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

79

Dynamic programming can be applied to problems of alignment because ALIGNMENT SCORES obey the following rules:

S1 x, 1 ySx1, y1S1 x1, 1 y1

Page 79: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

80

Path Graph for aligning two Path Graph for aligning two sequencessequences

Page 80: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

81

allowedallowed

Page 81: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

82

not allowednot allowed

Page 82: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES
Page 83: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

84

Scoring scheme

match = +5mismatch = –3gap-opening penalty = –4gap-extension penalty = 0

Page 84: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Matrix initialization

match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0

Page 85: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Matrix initialization0 + match = 5

match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0

Page 86: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Matrix initialization0 + gap = –4

match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0

Page 87: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Matrix initialization0 + gap = –4

match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0

Page 88: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Matrix fill

match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0

0 + match = 5

Page 89: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Matrix fill

match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0

5 + gap = 1

Page 90: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Matrix fill

match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0

0 + gap = –4

Page 91: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

… and so on and so forth

match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0

Page 92: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Complete matrix fill

match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0

Page 93: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Trace back

match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0

Page 94: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

95

The alignment is produced by either starting at the highest score in either the rightmost column or the bottom row, and proceeding from right to left by following the best pointers, or at the bottom rightmost cell.

This stage is called the tracebacktraceback. The graph of pointers in the traceback is also referred to as the path graphpath graph because it defines the paths through the matrix that correspond to the optimal alignment or alignments.

Page 95: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Trace back (if we DO allow terminal gaps)

match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0

Page 96: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Trace back (if we DO NOT allow terminal gaps)

match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0

Page 97: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Trace back (if we DO NOT allow terminal gaps)

match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0

10 + gap ≠ 11 14 + mismatch = 1110 + gap ≠ 11

Page 98: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Trace back (if we DO NOT allow terminal gaps)

match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0

10 + gap ≠ 14 9 + match = 145 + gap ≠ 14

Page 99: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Trace back (if we DO NOT allow terminal gaps)

match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0

4 + mismatch ≠ 9 13 + gap= 90 + gap ≠ 9

Page 100: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Trace back (if we DO NOT allow terminal gaps)

match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0

8 + match = 13 4 + gap ≠ 139 + gap ≠ 13

Page 101: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Trace back (if we DO NOT allow terminal gaps)

match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0

–1 + gap ≠ 812 + gap = 8 3 + match = 8

Page 102: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Trace back (if we DO NOT allow terminal gaps)

match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0

7 + gap = 3 –6 + gap ≠ 3–2 + mismatch ≠ 37 + gap ≠ 12 7 + match = 123 + gap ≠ 12

Page 103: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Trace back (if we DO NOT allow terminal gaps)

match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0

Page 104: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Trace back (complete)

match = +5, mismatch = –3, gap-opening penalty = –4, gap-extension penalty = 0

high road/low road/middle roadhigh road/low road/middle road

Page 105: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Two possible alignments:

GAATTCAGTGGA-TC-GA* * ** *

GAATTCAGTGGAT-C-GA* ** * *

Page 106: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

107

Scoring Matrices

Mismatch and gap penalties should be inversely proportional to the frequencies with which changes occur.

Page 107: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

108

To A To T To C To G Row totals

From A3.4 0.7

(3.6 0.7)4.5 0.8

(4.8 0.9)12.5 1.1

(13.3 1.1)20.3

(21.6)

From T3.3 0.6

(3.5 0.6)

13.8 1.9

(14.7 2.0)

3.3 0.6

(3.5 0.6)20.4

(21.7)

From C4.2 0.5

(4.2 0.5)

20.7 1.3

(16.4 1.3)

4.6 0.6

(4.4 0.6)29.5

(25.1)

From G20.4 1.4

(21.9 1.5)

4.4 0.6

(4.6 0.6)

4.9 0.7

(5.2 0.8)29.7

(31.6)

Column

totals

27.9

(29.5)

28.5

(24.6)

23.2

(23.2)

20.5

(21.3)

Transitions (68%) occur more frequently than transversions (32%).Mismatch penalties for transitions should be smaller than those for transversions.

Page 108: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

109

Empirical substitution matrices

PAM (Percent/Point Accepted Mutation)

BLOSUM (BLOcks SUbstitution Matrix)

Page 109: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

110

PAM

• Developed by Margaret Dayhoff in 1978.

• Based on comparisons of very similar protein sequences.

Page 110: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

111

• A scoring matrix is a table of values that describe the probability of a residue (amino acid or base) pair occurring in an alignment.

• The values in a scoring matrix are log ratios of two probabilities.

One is the random probability. The other is the probability of a empirical pair occurrence.

• Because the scores are logarithms of probability ratios, they can be added to give a meaningful score for the entire alignment. The more positive the score, the better the alignment!

Log-odds ratios

Page 111: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

112

• Align sequences that are at least 85% identical.

– Minimizes ambiguity in alignments and the number of coincident mutations.

• Reconstruct phylogenetic trees and infer ancestral sequences.

• Tally replacements "accepted" by natural selection, in all pairwise comparisons.

– Meaning, the number of times j was replaced by i in all comparisons.

• Compute amino acid mutability (i.e., the propensity of a given amino acid, j, to be replaced).

The PAM matrices(Percent accepted mutations)

Page 112: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

113

• Combine data to produce a Mutation Probability Matrix for one PAM of evolutionary distance, which is used to calculate the Log Odds Matrix for similarity scoring.

• Thus, depending on the protein family used, various PAM matrices result - some of which are “good” at locating evolutionary distant conserved mutations and some that are good at locating evolutionary close conserved mutations.

The PAM matrices

Page 113: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

114

More on log-odds ratios

In PAM log-odds scores are multiplied by 10 to avoid decimals. Therefore, a PAM score of 2 actually corresponds to a log-odds ratio of 0.2.

0.2 = substitioni to j = log10 { (observed ij mutation rate) / (expected rate) }

The value 0.2 is log10 of the relative expectation value of the mutation. Therefore, the expectation value is 100.2 = 1.6.

So, a PAM score of 2 indicates that (in related sequences) the mutation would be expected to occur 1.6 times more frequently than random.

Page 114: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

115

PAM250– Calculated for families of related proteins

(>85% identity)– 1 PAM is the amount of evolutionary

change that yields, on average, one substitution in 100 amino acid residues

– A positive score signifies a common replacement whereas a negative score signifies an unlikely replacement

– PAM250 matrix assumes/is optimized for sequences separated by 250 PAM, i.e. 250 substitutions in 100 amino acids (longer evolutionary time)

Page 115: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

116

Sequence alignment matrix that allows 250 accepted point mutations per 100 amino acids. PAM250 is suitable for comparing distantly related sequences, while a lower PAM is suitable for comparing more closely related sequences.

PAM250

Page 116: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

117

Selecting a PAM Matrix

• Low PAM numbers: short sequences, strong local similarities.

• High PAM numbers: long sequences, weak similarities.– PAM60 for close relations (60% identity)

– PAM120 recommended for general use (40% identity)

– PAM250 for distant relations (20% identity)

• If uncertain, try several different matrices– PAM40, PAM120, PAM250 recommended.

Page 117: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

118

BLOSUM• Blocks Substitution Matrix

– Steven and Jorga G. Henikoff (1992).• Based on BLOCKS database (www.blocks.fhcrc.org)

– Families of proteins with identical function.– Highly conserved protein domains.

• Ungapped local alignment to identify motifs– Each motif is a block of local alignment.– Counts amino acids observed in same column.– Symmetrical model of substitution.

Page 118: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

119

BLOSUM62

• BLOSUM matrices are based on local alignments (“blocks” or conserved amino acid patterns).

• BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than 62% divergence.

• All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins.

• BLOSUM 62 is the default matrix in BLAST 2.0.

Page 119: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

120

BLOSUM Matrices

• Different BLOSUMn matrices are calculated independently from BLOCKS

• BLOSUMn is based on sequences that are at most n percent identical.

Page 120: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

121

The procedure for calculating a BLOSUM matrix is based on a likelihood method estimating the occurrence of each possible pairwise substitution. Only aligned blocks are used to calculate the BLOSUMs.

The higher the scoreThe more closely related sequences.

BLOSUM62

Page 121: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

122

Because all blocks whose members shared at least 62% identity with ANY other member of that block were averaged and represented as 1 sequence.

Why is BLOSUM62 called

BLOSUM62?

Page 122: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

123

Selecting a BLOSUM Matrix

• For BLOSUMn, higher n suitable for sequences which are more similar– BLOSUM62 recommended for general

use– BLOSUM80 for close relations– BLOSUM45 for distant relations

Page 123: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

124

Equivalent PAM and Blosum matrices

The following matrices are roughly equivalent...

•PAM100 ==> Blosum90 •PAM120 ==> Blosum80 •PAM160 ==> Blosum60 •PAM200 ==> Blosum52 •PAM250 ==> Blosum45

Generally speaking... •The Blosum matrices are best for detecting local alignments. •The Blosum62 matrix is the best for detecting the majority of weak protein similarities. •The Blosum45 matrix is the best for detecting long and weak alignments.

Less divergent

More divergent

Page 124: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

125

Comparison of PAM250 and BLOSUM62

The relationship between BLOSUM and PAM substitution matrices:

BLOSUM matrices with higher numbers and PAM matrices with low numbers are both designed for comparisons of closely related sequences.

BLOSUM matrices with low numbers and PAM matrices with high numbers are designed for comparisons of distantly related proteins.

If distant relatives of the query sequence are specifically being sought, the matrix can be tailored to that type of search.

Page 125: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

126

Scoring matrices commonly used

• PAM250 – Shown to be appropriate for searching for

sequences of 17-27% identity.

• BLOSUM62– Though it is tailored for comparisons of

moderately distant proteins, it performs well in detecting closer relationships.

• BLOSUM50– Shown to be better for FASTA searches.

Page 126: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

127

Effect of gap penalties on amino-acid alignment Human pancreatic hormone precursor versus chicken pancreatic hormone

(a) Penalty for gaps is 0(b) Penalty for a gap of size k nucleotides is wk = 1 + 0.1k(c) The same alignment as in (b), only the similarity between the two sequences is further enhanced by showing pairs of biochemically similar amino acids

Page 127: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Alignments: things to keep in mind

“Optimal alignment” means “having the highest possible score, given a substitution matrix and a set of gap penalties”

This is NOT necessarily the most meaningful alignment

The assumptions of the algorithm are often wrong:

- substitutions are not equally frequent at all positions,

- it is very difficult to realistically model insertions and deletions.

Pairwise alignment programs ALWAYS produce an alignment (even when it does not make sense to align sequences)

Page 128: 1 GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES