a7e66scoring matrices
DESCRIPTION
stress ManagementTRANSCRIPT
![Page 1: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/1.jpg)
COMPUTATIONAL BIOLOGY
B.Tech – BioTech (VIth Semester)
Module 2
![Page 2: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/2.jpg)
Scoring Matrices
![Page 3: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/3.jpg)
INTRODUCTION• It is assummed that the sequences being sought have an
evolutionary ancestral sequence in common with the query sequence.
• The best guess at the actual path of evolution is the path that requires the fewest evolutionary events.
• All substitutions are not equally likely and should be weighted to account for this.
• Insertions and deletions are less likely than substitutions and should be weighted to account for this.
![Page 4: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/4.jpg)
INTRODUCTION• A substitution is more likely to occur between amino acids
with similar biochemical properties.• For example the hydrophobic amino acids Isoleucine(I) and
valine(V) get a positive score on matrices adding weight to the likeliness that one will substitute for another.
• While the hydrophobic amino acid isoleucine has a negative score with the hydrophilic amino acid cystine(C) as the likeliness of this substitution occurring in the protein is far less.
• Thus matrices are used to estimate how well two residues of given types would match if they were aligned in a sequence alignment.
![Page 5: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/5.jpg)
IMPORTANCE OF SCORING MATRICES
• Scoring matrices appear in all analysis involving sequence comparison.
• The choice of matrix can strongly influence the outcome of the analysis.
• Scoring matrices implicitly represent a particular theory of evolution.
• Understanding theories underlying a given scoring matrix can aid in making proper choice.
![Page 6: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/6.jpg)
TYPES OF SCORING MATRICES• An amino-acid scoring matrix is a 20x20 table such that position
indexed with amino-acids so that position X,Y in the table gives the score of aligning amino-acid X with amino-acid Y
• Identity matrix – Exact matches receive one score and non-exact matches a different score (1 on the diagonal 0 everywhere else)
• Mutation data matrix – a scoring matrix compiled based on observation of protein mutation rates: some mutations are observed more often then other (PAM, BLOSUM).
• Physical properties matrix – amino acids with with similar biophysical properties receive high score.
• Genetic code matrix – amino acids are scored based on similarities in the coding triple.
![Page 7: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/7.jpg)
Matrices used
PSSM = Position Specific Scoring Matrices
![Page 8: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/8.jpg)
PAM matrices
![Page 9: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/9.jpg)
![Page 10: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/10.jpg)
![Page 11: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/11.jpg)
![Page 12: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/12.jpg)
BLOSUM (BLOck Substitution Matrices)
• Publication– Henikoff and Henikoff, 1992
• Motivation– PAM matrices do not capture the difference between
short and long time mutations • Method
– For several degrees of sequence divergence, derive mutations from set of related proteins
– BLOSUM-k is based on related proteins with k% identity or less
![Page 13: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/13.jpg)
BLOSUM METHOD
• Use Blocks – collections of multiple alignments of similar segments without gaps
• Cluster together sequences whenever more than k% identical residues are shared
• Count number of substitutions across different clusters (in the same family)
• Estimate frequencies using the counts
![Page 14: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/14.jpg)
BLOCKS
Each BLOCK represents a conserved region in a group of proteins
1 5 n
sequence 1 ABPEDG… …FGW
sequence 2 ABSEDQ… …QGW
sequence 3 SBPEDQ… …FGD
: : :
: : :
sequence m ABAEDS… …QGD
![Page 15: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/15.jpg)
BLOSUM = BLOCK SUBSTITUTION MATRIX
![Page 16: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/16.jpg)
![Page 17: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/17.jpg)
The relationship between BLOSUM and PAM substitution matrices
• BLOSUM matrices with higher numbers and PAM matrices with low numbers are both designed for comparisons of closely related sequences.
• BLOSUM matrices with low numbers and PAM matrices with high numbers are designed for comparisons of distantly related proteins.
![Page 18: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/18.jpg)
Position-Specific Scoring Matrix
• A weight matrix or position-specific scoring matrix (PSSM) is a table of numbers containing scores for each residue at each position of a fixed-length (gap-free) motif.
• There are two types of numerical representations:• frequency matrix: reflects position-dependent frequencies
of residues • Scoring matrix: contains additive weights for computing a
match score• Weigh matrices or PSSMs are quantitative, fixed-length motif
descriptors. Unlike regular expressions, they can distinguish between mild and severe mismatches.
![Page 19: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/19.jpg)
Position-Specific Scoring Matrix
• A PSSM is a motif descriptor• The descriptor includes a weight (score, probability) for each
symbol occurring at each position along the motif• Examples of motifs:
– Protein active sites, – structural elements, – zinc finger, – intron/exon boundaries, – transcription-factor binding sites, etc.
![Page 20: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/20.jpg)
Position-Specific Scoring Matrix
Construction of PSSM is a multi-stage process:1. Architecture of matrix2. Create multiple alignment from which the matrix is
derived3. Calculate frequencies for each position4. Applying BLAST to PSSM
![Page 21: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/21.jpg)
Position-Specific Scoring Matrix• 10 vertebrate donor site sequences aligned at
exon/intron boundaryseq 1 GAGGTAAAC
seq 2 TCCGTAAGT
seq 3 CAGGTTGGA
seq 4 ACAGTCAGT
seq 5 TAGGTCATT
seq 6 TAGGTACTG
seq 7 ATGGTAACT
seq 8 CAGGTATAC
seq 9 TGTGTGAGT
seq 10 AAGGTAAGT
![Page 22: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/22.jpg)
Position-Specific Scoring Matrix• Calculate the absolute frequency of each
nucleotide at each positionseq 1 GAGGTAAAC
seq 2 TCCGTAAGT
seq 3 CAGGTTGGA
seq 4 ACAGTCAGT
seq 5 TAGGTCATT
seq 6 TAGGTACTG
seq 7 ATGGTAACT
seq 8 CAGGTATAC
seq 9 TGTGTGAGT
seq 10 AAGGTAAGT
1 2 3 4 5 6 7 8 9
A 3 6 1 0 0 6 7 2 1
C 2 2 1 0 0 2 1 1 2
G 1 1 7 10 0 1 1 5 1
T 4 1 1 0 10 1 1 2 6
![Page 23: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/23.jpg)
Position-Specific Scoring Matrix• Calculate the relative frequency of each
nucleotide at each positionseq 1 GAGGTAAAC
seq 2 TCCGTAAGT
seq 3 CAGGTTGGA
seq 4 ACAGTCAGT
seq 5 TAGGTCATT
seq 6 TAGGTACTG
seq 7 ATGGTAACT
seq 8 CAGGTATAC
seq 9 TGTGTGAGT
seq 10 AAGGTAAGT
1 2 3 4 5 6 7 8 9
A 3 6 1 0 0 6 7 2 1
C 2 2 1 0 0 2 1 1 2
G 1 1 7 10 0 1 1 5 1
T 4 1 1 0 10 1 1 2 6
1 2 3 4 5 6 7 8 9
A 0.3 0.6 0.1 0 0 0.6 0.7 0.2 0.1
C 0.2 0.2 0.1 0 0 0.2 0.1 0.1 0.2
G 0.1 0.1 0.7 1 0 0.1 0.1 0.5 0.1
T 0.4 0.1 0.1 0 1 0.1 0.1 0.2 0.6
![Page 24: a7e66Scoring Matrices](https://reader030.vdocuments.us/reader030/viewer/2022032701/563dbafe550346aa9aa962b2/html5/thumbnails/24.jpg)
Position-Specific Scoring Matrix• What is the probability of finding CAGGTTGGA?
– The product of the frequency of each nucleotide at each position:
– C is 0.2 at position 1, A is 0.6 at position 2, etc ->• 0.2 * 0.6 * 0.7 * 1 * 1 * 0.1 * 0.1 * 0.5 * 0.1
1 2 3 4 5 6 7 8 9
A 0.3 0.6 0.1 0 0 0.6 0.7 0.2 0.1
C 0.2 0.2 0.1 0 0 0.2 0.1 0.1 0.2
G 0.1 0.1 0.7 1 0 0.1 0.1 0.5 0.1
T 0.4 0.1 0.1 0 1 0.1 0.1 0.2 0.6