in-class assignment #1: research cd2 follow instructions on distributed assignment sheet

In-Class Assignment #1: Research CD2In-Class Assignment #1: Research CD2

• Follow instructions on distributed assignment sheet

Biology 4900Biology 4900

Biocomputing

Chapter 3Chapter 3

Pairwise Sequence Alignment

Pairwise AlignmentPairwise Alignment

• Potential relationships between proteins or nucleic acids can be explored by comparing 2 or more sequences of amino acids or nucleotides.

• Difficult to do visually.• Computer algorithms help us by:

– Accelerating the comparison process– Allowing for “gaps” or indels in sequences (i.e., insertions, deletions)– Identifying substituted amino acids that are structurally or functionally

similar (D and E).

Pevsner, Bioinformatics and Functional Genomics, 2009

One way to do this is with BLAST (Basic Local Alignment Search Tool)

•Allows rapid sequence comparison of a query sequence against a database.•The BLAST algorithm is fast, accurate, and web-accessible.•BLAST lets user select from a variety of scoring matrices to evaluate sequence relatedness.

Sequence Analyses: RNASequence Analyses: RNA

• Codons (3 RNA bases in sequence) determine each amino acid that will build the protein expressed

• Many amino acids are encoded by more than 1 codon (change in 3rd base). Change of single base may not be significant.

Comparing protein sequencesComparing protein sequences

• Comparing protein sequences usually more informative than nucleotide sequences.– Changing base at 3rd position in codon does not always

change AA (Ex: Both UUU and UUC encode for phenylalanine)

– Different AAs may share similar chemical properties (Ex: hydrophobic residues A, V, L, I)

– Relationships between related but mismatched AAs in sequence analysis can be accounted for using scoring systems (matrices).

– Protein sequence comparisons can ID sequence homologies from proteins sharing a common ancestor as far back as 1 × 109 years ago (vs. 600 × 106 for DNA).

Amino acids by similar biophysical propertiesAmino acids by similar biophysical properties

http://kimwootae.com.ne.kr/apbiology/chap2.htm



These have useful fluorescent properties

Sequence Identity and SimilaritySequence Identity and Similarity• Identity: How closely two sequences match one another.

– Unlike homology, identity can be measured quantitatively

• Similarity: Pairs of residues that are structurally or functionally related (conservative substitutions).


>lcl|28245 3CLN:A|PDBID|CHAIN|SEQUENCELength=148

Score = 268 bits (684), Expect = 3e-97, Method: Compositional matrix adjust. Identities = 130/148 (88%), Positives = 143/148 (97%), Gaps = 0/148 (0%)

Query 1 AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60 A+QLTEEQIAEFKEAF+LFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNSbjct 1 ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60

Query 61 GTIDFPEFLSLMARKMKEQDSEEELIEAFKVFDRDGNGLISAAELRHVMTNLGEKLTDDE 120 GTIDFPEFL++MARKMK+ DSEEE+ EAF+VFD+DGNG ISAAELRHVMTNLGEKLTD+ESbjct 61 GTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEE 120

Query 121 VDEMIREADIDGDGHINYEEFVRMMVSK 148 VDEMIREA+IDGDG +NYEEFV+MM +KSbjct 121 VDEMIREANIDGDGQVNYEEFVQMMTAK 148

88% of sequences include the same amino acids (Identities). This increases to 97% (Positives) when you include amino acids that are different, but with similar properties.

Sequence HomologySequence Homology• Homology: Two sequences are homologous if they share a

common ancestor.• No “degrees of homology”: only homologous or not• Almost always share similar 3D structure

– Ex. myoglobin and beta globin– Sequences can change significantly over time, but 3D

structure changes more slowly


Beta-globin sub-unit of adult hemoglobin (2H35.pdb, in blue), superimposed over myoglobin (3RGK.pdb, in red).These sequences probably separated 600 million years ago.

Percent Identity and HomologyPercent Identity and Homology

• For an alignment of 70 amino acids, 40% sequence identity is a reasonable threshold for homology.

• Above 20% (more than 70 amino acids) may indicate homology.

• Below 20% probably indicates chance alignment.


Orthologs and ParalogsOrthologs and Paralogs

• Orthologs: Homologous sequences in different species that arose from a common ancestral gene during speciation.– Ex. Humans and rats diverged around 80 million years ago

divergence of myoglobin genes occurred.– Orthologs frequently have similar biological functions.

• Human and rat myoglobin (oxygen transport)• Human and rat CaM

• Paralogs: Homologous sequences that arose by a mechanism such as gene duplication.

• Within same organism/species• Ex. Myoglobin and beta globin are paralogs

– Have distinct but related functions.


Conservative Substitutions in MatricesConservative Substitutions in Matrices

Scoring may also vary based on conserved substitutions of amino acids: i.e., amino acids with similar properties will not lose as many points as AAs with very different properties.

Basic AAs: K, R, HAcidic AAs: D, EHydroxylated AAs: S, THydrophobic AAs: G, A, V, L, I, M, F, P, W, Y


These relationships would be considered when calculating “Positives” in BLAST alignment.

Dayhoff Model: Building a Scoring MatrixDayhoff Model: Building a Scoring Matrix 1978, Margaret Dayhoff provided one of the first models of a scoring matrix Model was based on rules by which evolutionary changes occur in proteins Catalogued 1000’s of proteins, considered which specific amino acid

substitutions occurred when 2 homologous proteins aligned Assumes substitution patterns in closely-related proteins can be

extrapolated to more distantly-related proteins An accepted point mutation (PAM) is an AA replacement accepted by

natural selection Based on observed mutations, not necessarily on related AA properties Probable mutations are rewarded, while unlikely mutations are penalized Scores for comparison of 2 residues (i, j) based on the following equation:

Here, qi,j is the probability of an observed substitution (from mutation probability matrix), while p is the likelihood of observing the replacement AA (i) as a result of chance (normalized frequency of AA table).


PAM250 Mutation Probability MatrixPAM250 Mutation Probability Matrix

Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y VAla A 13 6 9 9 5 8 9 12 6 8 6 7 7 4 11 11 11 2 4 9Arg R 3 17 4 3 2 5 3 2 6 3 2 9 4 1 4 4 3 7 2 2Asn N 4 4 6 7 2 5 6 4 6 3 2 5 3 2 4 5 4 2 3 3Asp D 5 4 8 11 1 7 10 5 6 3 2 5 3 1 4 5 5 1 2 3Cys C 2 1 1 1 52 1 1 2 2 2 1 1 1 1 2 3 2 1 4 2Gln Q 3 5 5 6 1 10 7 3 7 2 3 5 3 1 4 3 3 1 2 3Glu E 5 4 7 11 1 9 12 5 6 3 2 5 3 1 4 5 5 1 2 3Gly G 12 5 10 10 4 7 9 27 5 5 4 6 5 3 8 11 9 2 3 7His H 2 5 5 4 2 7 4 2 15 2 2 3 2 2 3 3 2 2 3 2Ile I 3 2 2 2 2 2 2 2 2 10 6 2 6 5 2 3 4 1 3 9Leu L 6 4 4 3 2 6 4 3 5 15 34 4 20 13 5 4 6 6 7 13Lys K 6 18 10 8 2 10 8 5 8 5 4 24 9 2 6 8 8 4 3 5Met M 1 1 1 1 0 1 1 1 1 2 3 2 6 2 1 1 1 1 1 2Phe F 2 1 2 1 1 1 1 1 3 5 6 1 4 32 1 2 2 4 20 3Pro P 7 5 5 4 3 5 4 5 5 3 3 4 3 2 20 6 5 1 2 4Ser S 9 6 8 7 7 6 7 9 6 5 4 7 5 3 9 10 9 4 4 6Thr T 8 5 6 6 4 5 5 6 4 6 4 6 5 3 6 8 11 2 3 6Trp W 0 2 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 55 1 0Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1 2 15 1 2 2 3 31 2Val V 7 4 4 4 4 4 4 4 5 4 15 10 4 10 5 5 5 72 4 17

Think of these values as percentages (columns sum to 100).For example, there is an 18% (0.18) probability of R being replaced by K.This probability matrix needs to be converted into a scoring matrix.

Original AA

Repl

acem

ent

AA

http://www.icp.ucl.ac.be/~opperd/private/pam250.html

Normalized Frequencies of Amino AcidsNormalized Frequencies of Amino Acids

Normalized Frequencies of Amino AcidsAla 0.096 Asn 0.042 Gly 0.090 Pro 0.041 Lys 0.085 Ile 0.035 Leu 0.085 His 0.034 Val 0.078 Arg 0.034 Thr 0.062 Gin 0.032 Ser 0.057 Tyr 0.030 Asp 0.053 Cys 0.025 Glu 0.053 Met 0.012 Phe 0.045 Trp 0.012


**How often a given amino acid appears in a protein (determined by empirical analyses)

Purpose of PAM MatricesPurpose of PAM Matrices

• Derive a scoring system to determine relatedness of 2 sequences.

• PAM mutation probability matrix must be converted to a scoring matrix (log odds matrix).

PAM250 Log-Odds MatrixPAM250 Log-Odds MatrixCys C 12Ser S 0 2Thr T -2 1 3Pro P -3 1 0 6Ala A -2 1 1 1 2Gly G -3 1 0 -1 1 5Asn N -4 1 0 -1 0 0 2Asp D -5 0 0 -1 0 1 2 4Glu E -5 0 0 -1 0 0 1 3 4Gln Q -5 -1 -1 0 0 -1 1 2 2 4His H -3 -1 -1 0 -1 -2 2 1 1 3 6Arg R -4 0 -1 0 -2 -3 0 -1 -1 1 2 8Lys K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5Met M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6Ile I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5Leu L -8 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 8Val V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4Phe F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9Tyr Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10Trp W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 C S T P A G N D E Q H R K M I L V F Y W Cys Ser Thr Pro Ala Gly Asn Asp Glu Gln His Arg Lys Met Ile Leu Val Phe Tyr Trp

This is the PAM250 scoring matrix, calculated as follows:


Pairwise Alignment and HomologyPairwise Alignment and HomologyPAM Value Distance(%)

80 50 100 60 200 75

250 85 <- Twilight zone

300 92

Think of PAM value as total number of mutations. This included multiple mutations over time at a single position.Currently, we accept that once the percent distance reaches ~85%, homology is indeterminate.PAM250 works best for more distantly related protein sequences.

http://www.icp.ucl.ac.be/~opperd/private/pam.html

Seq1 AGDFWYGGDGEYLLVSeq2 AGQFWYGGEGEKLLVSeq3 AGEFWYGGEGEKLLV

Seq1 and Seq2 separated by 3 units, while Seq1 and Seq3 separated by 4 PAM units

Practical Lessons from the Dayhoff ModelPractical Lessons from the Dayhoff Model

Less mutable amino acids likely play more important structural and functional roles

Mutable amino acids fulfill functions that can be filled by other amino acids with similar properties

Common substitutions tend to require only a single nucleotide change in codon

Amino acids that can be created from more than 1 codon are more likely to be created as a substitute (See p. 63, textbook)

Changes to sequence that do not alter structure and function of protein likely to be more tolerated in nature


BLOSUM62 Scoring MatrixBLOSUM62 Scoring Matrix BLOck SUbstitution Matrix By Henikoff and Henikoff (1992) Default scoring matrix for pairwise alignment

of sequences using BLAST (local alignments) Based on empirical observations of distantly-

related proteins organized into blocks


A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -1 1 1 -2 -1 -3 -2 5 M -1 -2 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 A R N D C Q E G H I L K M F P S T W Y V

In BLOSUM62, proteins are arranged in blocks sharing at least 62% identity

General Trends in Scoring MatricesGeneral Trends in Scoring Matrices

Less divergent

More divergent

BLOSUM90PAM30

BLOSUM45PAM250

BLOSUM62PAM120

Human vs. chimp

Human vs. bacteria

Choose a matrix that is consistent with the level sequence identity you are investigating. I.E., if you are looking at/for more closely related sequences, use BLOSUM90. If you are not sure, use BLOSUM62.

Sequence Alignments: General ConceptsSequence Alignments: General Concepts

• Global Alignment: Tries to match the entire length of the sequence.

• Local Alignment: Tries to find the longest section that matches.

Both are examples of dynamic programming: precise but slow

Global AlignmentGlobal Alignment

Input: two sequences over the same alphabet (either nucleotide or amino acid sequences)

Output: The alignment of the sequencesExample:• GADEGYFGPVILAADGEVA and GGAEGDYFGPAIAEGEVA• A possible alignment might look like this:

ins

ins

del

del

del

mut

mut

-GADEG-YFGPVILAADGEVAGGA-EGDYFGPAI--AEGEVA

Each position is scored independently:• Match: +1• Mismatch: -1• Insertions or deletions (gaps): -2

The alignment score is the sum of the position scores

Global Alignment – A Simple Scoring SchemeGlobal Alignment – A Simple Scoring Scheme

-GADEG-YFGPVILAADGEVAGGA-EGDYFGPAI--AEGEVAGlobal Alignment Score: (14 ×(+1)) + (5 × (-2)) + (2 × (-1)) = 2

-----GADEG-YFGPVILAADGEVA---DLGNVGA-EGDYFGPAI--AEGEVARPLGlobal Alignment Score: (14 ×(+1)) + (12 × (-2)) + (2 × (-1)) = -12

-----GADEG-YFGPVILAADGEVA---dlgnvGA-EGDYFGPAI--AEGEVArpl

Local Alignment Score: (14 ×(+1)) + (4 × (-2)) + (2 × (-1)) = 4

Matrices and Gap CostsMatrices and Gap Costs

Query Length

Substitution Matrix

Gap Costs

<35 PAM-30 (9,1)35-50 PAM-70 (10,1)50-85 BLOSUM-80 (10,1)

85 BLOSUM-62 (10,1)

The raw score of an alignment is the sum of the scores for aligning pairs of residues and the scores for gaps. Gapped BLAST and PSI-BLAST use "affine gap costs" which charge the score -a for the existence of a gap, and the score -b for each residue in the gap. Thus a gap of k residues receives a total score of -(a+bk); specifically, a gap of length 1 receives the score -(a+b).Your total raw score for the alignment is reduced when you introduce gaps into the query sequence.

Calculate the score in BLOSUM-62 for a gap with 7 residues…

http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/

Global Sequence AlignmentsGlobal Sequence Alignments

• Global Alignment: Entire sequence of each protein or DNA.• Needleman and Wunsch (1970)• Reduces problem to series of smaller alignments on a residue-

by-residue basis. • How this approach works

1. Setting up a matrix2. Score the matrix3. ID the optimal alignment

Global Sequence Alignments: Setting up a MatrixGlobal Sequence Alignments: Setting up a Matrix• Create 2D Matrix of 2 sequences to align

D P M E

D

P

L

E

D P E

D

P

L

E

D P L E

D

L

E

D P L E

D

P

L

E

Perfect Alignment Mismatch Alignment (lower score)

Seq

1Seq 2 Seq 2

Seq

1

Seq

1

Seq

1

Seq 2 Seq 2

Deletion, Seq 2 Insertion, Seq 2

Global Sequence Alignments: Setting up a MatrixGlobal Sequence Alignments: Setting up a Matrix• In simple identity matrix, matches scored as (+1), everything else is (0)• Here you can see how BLOSUM62 Scoring Matrix is applied to replace to simple matrix

Simple Identity Matrix

Seq

1

F M D T P L N E

F 1

K

H

M 1

E 1

D 1

P 1

L 1

E 1

Seq 2F M D T P L N E

F 6 0 -3 -2 -4 0 -3 -3

K -3 -1 -1 -1 -1 -2 0 1

H -1 -2 -1 -2 -2 -3 1 0

M 0 5 -3 -1 -2 2 -2 -2

E -3 -2 2 -1 -1 -3 0 5

D -3 -3 6 -1 -1 -4 1 2

P -4 -2 -1 -1 7 -3 -2 -1

L 0 2 -4 -1 -3 4 -3 -3

E -3 -2 2 -1 -1 -3 0 5

Seq 2

Seq

1

BLOSUM62 Scoring Matrix

Global Seq. Alignments: Identity to Scoring MatrixGlobal Seq. Alignments: Identity to Scoring Matrix• We need to find a way to convert the identity matrix into a meaningful

scoring system (match, mismatch, gap in 1 or 2)

Simple Identity Matrix

Seq

1

F M D T P L N E

F 1

K

H

M 1

E 1

D 1

P 1

L 1

E 1

Seq 2

Seq 2 (j)

Seq

1 (i)

Needleman-Wunsch-Sellers Scoring Matrix

F M D T P L N E

0 -2 -4 -6 -8 -10 -12 -14 -16

F -2

K -4

H -6

M -8

E -10

D -12

P -14

L -16

E -18

F M D T P L N E

0 -2 -4 -6 -8 -10 -12 -14 -16

F -2

K -4

H -6

M -8

E -10

D -12

P -14

L -16

E -18

Global Seq. Alignments: Identity to Scoring MatrixGlobal Seq. Alignments: Identity to Scoring Matrix• Gap penalty values, matches, coordinate system

Seq 2 (j)

Seq

1 (i)


Gap penalty

Gap penalty

Matches

F M D

(i-1, j-1) (i-1, j) (i-1, j+1) (i-1, j+2)

F (i, j-1) (i, j) (i, j+1) (i, j+2)

K (i+1, j-1) (i+1, j) (i+1, j+1) (i+1, j+2)

H (i+2, j-1) (i+2, j) (i+2, j+1) (i+2, j+2)

Match = +1Else = -2

F M D

(i-1, j-1) (i-1, j) (i-1, j+1) (i-1, j+2)

F (i, j-1) (i, j) (i, j+1) (i, j+2)

K (i+1, j-1) (i+1, j) (i+1, j+1) (i+1, j+2)

H (i+2, j-1) (i+2, j) (i+2, j+1) (i+2, j+2)

Global Seq. Alignments: Scoring Matrix CalculationsGlobal Seq. Alignments: Scoring Matrix Calculations

• Calculate Mi,j = MAXIMUM[ Mi-1, j-1 + Si,j (match/mismatch in the diagonal), Mi,j-1 + w (gap in sequence #1), Mi-1,j + w (gap in sequence #2)]

– Note that in the example, Mi-1,j-1 will be red, Mi,j-1 will be blue and Mi-1,j will be green.

– Using this information, the score at position 1,1 (i, j) in the matrix can be calculated. Since the first residue in both sequences is an F, S1,1 = +1, and by the assumptions stated earlier, w = -2. Thus, Mi,j = MAX[Mi-1,j-1 + 1, Mi,j-1 - 2, Mi-1,j - 2] = MAX[+1, -4, -4].

– MAX function means we retain the highest (MAX) score of all possible scores.

Seq 2 (j)

Seq

1 (i)


F M D

0 -2 -4 -6

F -2

K -4

H -6

+1

F M D

(i-1, j-1) (i-1, j) (i-1, j+1) (i-1, j+2)

F (i, j-1) (i, j) (i, j+1) (i, j+2)

K (i+1, j-1) (i+1, j) (i+1, j+1) (i+1, j+2)

H (i+2, j-1) (i+2, j) (i+2, j+1) (i+2, j+2)


• Calculate Mi,j+1 = MAXIMUM[ Mi, j-1 + Si,j+1 (match/mismatch in the diagonal), Mi,j + w1 (gap in sequence #1), Mi-1,j+1 + w2 (gap in sequence #2)]

• The score at position 1,2 (i, j+1) in the matrix can be calculated. Since the residues are mismatched, Si+1,j = -2, and by the assumptions stated earlier, w = -2. Thus, Mi,j = MAX[Mi-1,j-1 + 1, Mi,j-1 - 2, Mi-1,j - 2] = MAX[-4, -1, -6].

Seq 2 (j)

Seq

1 (i)


F M D

0 -2 -4 -6

F -2

K -4

H -6

+1 -1

F M D

(i-1, j-1) (i-1, j) (i-1, j+1) (i-1, j+2)

F (i, j-1) (i, j) (i, j+1) (i, j+2)

K (i+1, j-1) (i+1, j) (i+1, j+1) (i+1, j+2)

H (i+2, j-1) (i+2, j) (i+2, j+1) (i+2, j+2)


• Calculate Mi+1,j = MAXIMUM[ Mi, j-1 + Si,j (match/mismatch in the diagonal), Mi+1,j-1 + w1 (gap in sequence #1), Mi-1,j + w2 (gap in sequence #2)]

• The score at position 2,1 (i+1, j) in the matrix can be calculated. Since the residues are mismatched, Si+1,j = -2, and by the assumptions stated earlier, w = -2. Thus, Mi,j = MAX[Mi,j-1 - 2, Mi+1,j-1 - 2, Mi,j - 2] = MAX[-4, -6, -1].

Seq 2 (j)

Seq

1 (i)


F M D

0 -2 -4 -6

F -2

K -4

H -6

+1

-1

Scored MatrixScored Matrix

F M D T P L N E

0 -2 -4 -6 -8 -10 -12 -14 -16

F -2 +1 -1 -3 -5 -7 -9 -11 -13

K -4 -1 -1 -3 -5 -7 -9 -11 -13

H -6 -3 -3 -3 -5 -7 -9 -11 -13

M -8 -5 -2 -4 -5 -7 -9 -11 -13

E -10 -7 -4 -4 -6 -7 -9 -11 -10

D -12 -9 -6 -3 -5 -7 -9 -11 -12

P -14 -11 -8 -5 -5 -4 -6 -8 -10

L -16 -13 -10 -7 -7 -6 -3 -5 -7

E -18 -15 -12 -9 -9 -8 -5 -5 -4

Red Arrows indicate Pathways to calculated Max values

Overall score of optimal alignment

Seq 2 (j)

Seq

1 (i)

Optimal Alignment: Trace-back ProcedureOptimal Alignment: Trace-back Procedure

F M D T P L N E

0 -2 -4 -6 -8 -10 -12 -14 -16

F -2 +1 -1 -3 -5 -7 -9 -11 -13

K -4 -1 -1 -3 -5 -7 -9 -11 -13

H -6 -3 -3 -3 -5 -7 -9 -11 -13

M -8 -5 -2 -4 -5 -7 -9 -11 -13

E -10 -7 -4 -4 -6 -7 -9 -11 -10

D -12 -9 -6 -3 -5 -7 -9 -11 -12

P -14 -11 -8 -5 -5 -4 -6 -8 -10

L -16 -13 -10 -7 -7 -6 -3 -5 -7

E -18 -15 -12 -9 -9 -8 -5 -5 -4

Trace-back arrows can only follow pathways identified when calculating Max values

Start here

Seq 2 (j)

Seq

1 (i)

Completed Global Pairwise AlignmentCompleted Global Pairwise Alignment

F M D T P L N E

0 -2 -4 -6 -8 -10 -12 -14 -16

F -2 +1 -1 -3 -5 -7 -9 -11 -13

K -4 -1 -1 -3 -5 -7 -9 -11 -13

H -6 -3 -3 -3 -5 -7 -9 -11 -13

M -8 -5 -2 -4 -5 -7 -9 -11 -13

E -10 -7 -4 -4 -6 -7 -9 -11 -10

D -12 -9 -6 -3 -5 -7 -9 -11 -12

P -14 -11 -8 -5 -5 -4 -6 -8 -10

L -16 -13 -10 -7 -7 -6 -3 -5 -7

E -18 -15 -12 -9 -9 -8 -5 -5 -4

F M D T P L N E

0 -2 -4 -6 -8 -10 -12 -14 -16

F -2 +1 -1 -3 -5 -7 -9 -11 -13

K -4 -1 -1 -3 -5 -7 -9 -11 -13

H -6 -3 -3 -3 -5 -7 -9 -11 -13

M -8 -5 -2 -4 -5 -7 -9 -11 -13

E -10 -7 -4 -4 -6 -7 -9 -11 -10

D -12 -9 -6 -3 -5 -7 -9 -11 -12

P -14 -11 -8 -5 -5 -4 -6 -8 -10

L -16 -13 -10 -7 -7 -6 -3 -5 -7

E -18 -15 -12 -9 -9 -8 -5 -5 -4

F K H M E D - P L - EF - - M - D T P L N E

Seq 2 (j)

Seq

1 (i)

Seq 2 (j)

Seq

1 (i)

Seq 2 (j)Seq 1 (i)

Note that final pairwise alignment score (-4) is equal to the value calculated based on total numbers of matches, mismatches, insertions and deletions

Global Alignment Score: (6 ×(+1)) + (5 × (-2)) = -4

Local Sequence AlignmentLocal Sequence Alignment

• Local Alignment: Longest matching regions (subsets) between 2 sequences.

• Smith and Waterman Algorithm (1981)• Scoring is similar to global alignment

1. Set up a matrix2. Score the matrix

• No negative values allowed: If negative values are the only choices, then answer defaults to zero (0).

• Mismatches and gaps at ends score 0.

3. ID the optimal alignment

• More sensitive but much slower than heuristic methods (FASTA, BLAST)

Smith and Waterman Local Sequence AlignmentSmith and Waterman Local Sequence Alignment

G A A G A

0 0 0 0 0 0

G 0 1 0 0 1 0

T 0 0 0 0 0 0

T 0 0 0 0 0 0

T 0 0 0 0 0 0

A 0 0 1 0 0 0

A 0 0 1 2 0 0

G 0 1 0 0 3 0

Can use any scoring matrix you want (ex. Substitute BLOSUM62)

No negative values allowed: Default is 0

Alignment can start anywhere in sequence: not restricted to ends and no penalties at ends

Trace-back starts with the highest number, works backwards the same as with global alignment

Seq 1

Seq

2

G A A G AG T T T A A G

Heuristic (word or k-tuple based) algorithmsHeuristic (word or k-tuple based) algorithms

• Uses initial query to make reasonable guesses about sequence alignments, then evaluates those considered “most likely”

• Alignment then extended until:– One of the sequences ends– Score falls below some threshold

• In BLAST, search depends on word size

KENFDKARFSGTWYAMAKKDPEG 50 RBP (query)

MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit)

Hit!extendextend

FASTA (Pearson and Lippman 1988)FASTA (Pearson and Lippman 1988)

• Combines Smith and Waterman algorithm with word (k-tup) search faster, heuristic approach

• Query sequence divided into small words (usually k=2 for proteins)– Words used to initially compare and match sequences– If words located on same diagonal, surrounding region is then

selected for analysis

Seq 1 FYGKLHMEGDSeq 2 FWGKLHMEGSNE

Seq 1 Search words (k-tup = 2) FY YG GK KL LH HM ME EG GD

http://www.incogen.com/bioinfo_tutorials/Bioinfo-Lecture_2-pairwise-align.html

FASTA (Pearson and Lippman 1988)FASTA (Pearson and Lippman 1988)

(a) Identify common k-words between sequences A and B(b) Score diagonals with k-word matches, identify 10 best diagonals (dense regions of k-word overlap)Rescore initial regions with a substitution score matrix(c) Join initial regions using gaps, penalize for gaps(d) Perform dynamic programming to find final alignments

http://www.incogen.com/bioinfo_tutorials/Bioinfo-Lecture_2-pairwise-align.html

Statistical Significance of Pairwise AlignmentsStatistical Significance of Pairwise Alignments

• Is an alignment similar based on statistical significance, or are similarities due to chance?

• How do we define significant? Statistics.• Start with Null Hypothesis (H0) that 2 sequences are not

related.• Suggest an alternative hypothesis (H1) that 2 sequences are

related.• Select an arbitrary value defining statistical significance

(α=0.05): This is the probability that the Null hypothesis can be rejected (i.e., there is less than 5% probability that a match occurs as a result of chance).

Statistical Mean and Standard DeviationStatistical Mean and Standard Deviation

Mean (average) is the sum of a set of numbers (x1 + x2 + … xn), divided by the total instances in the set (n)

Standard Deviation (s) is the square root of the squared sum of the difference between a given value (xi) and the sample mean (x-bar) divided by the total instances in the set (n)

1 2 3 mean sdA 2.00 3.00 4.00 3.00 1.00B 4.00 4.00 8.00 5.33 2.31C 6.00 2.00 12.00 6.67 5.03D 8.00 8.00 8.00 8.00 0.00E 10.00 6.00 13.00 9.67 3.51

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

A B C D E

samples

samples

Statistical Measures of AlgorithmsStatistical Measures of Algorithms• Objective of alignment algorithms is to maximize sensitivity and specificity of alignments. • Sensitivity: Measure of how well algorithm correctly predicts sequences that are related.• Specificity: Measure of how well algorithm correctly predicts sequences that are unrelated.

Statistical Comparison of 2 SequencesStatistical Comparison of 2 Sequences

• Compare a large number of “random” sequences• Many different proteins• Randomly generated sequences• Scrambled variations of 1 of your 2 sequences• Calculate a Z score from the difference between the

score of your aligned sequences (x) and the mean of the random sequences (μ), divided by the standard deviation of the random sequences (σ).

Convert Z Score to Probability of Chance AlignmentConvert Z Score to Probability of Chance Alignment

• Z score represents distance between sequence alignment score and population mean (per SD) estimated from random sequences

• The Z score can be converted to probability.

• Example: For Z = 2.0 (at α = 0.05), 97.98% of all values fall within 2.0 standard deviations (Z=2.0), therefore your sequence score could occur by chance only 2.02% of the time.

in-class assignment #1: research cd2 follow instructions on distributed assignment sheet

Documents

sequences of amino acids

query sequence

htmamino acids

sequence homologies

sequence analyses

sequence relatedness

sequence analysis

nucleic acids