cap5510 – bioinformatics substitution patterns
DESCRIPTION
CAP5510 – Bioinformatics Substitution Patterns. Tamer Kahveci CISE Department University of Florida. Goals. Understand how mutations occur Learn models for predicting the number of mutations Understand why scoring matrices are used and how they are derived Learn major scoring matrices. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/1.jpg)
1
CAP5510 – BioinformaticsSubstitution Patterns
Tamer Kahveci
CISE Department
University of Florida
![Page 2: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/2.jpg)
2
Goals
• Understand how mutations occur
• Learn models for predicting the number of mutations
• Understand why scoring matrices are used and how they are derived
• Learn major scoring matrices
![Page 3: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/3.jpg)
3
Why Substitute Patterns ?
• Mutations happen because of mistakes in DNA replication and repair.
• Our genetic code changes due to mutations– Insert, delete, replace
• Three types of mutations– Advantageous– Disadvantageous– Neutral
• We only observe substitutions that passed selection process
![Page 4: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/4.jpg)
4
Mutation Rates
Organism A Organism B
Parent Organism
K: number of substitutions
T time R = K/(2T)
![Page 5: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/5.jpg)
5
Functional Constraints
• Functional sites are less likely to mutate– Noncoding = 3.33 (subs/109 yr)– Coding = 1.58 (subs/109 yr)
• Indels about 10 times less likely than substitutions
![Page 6: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/6.jpg)
6
Nucleotide Substitutions and Amino Acids
• Synonymous substitutions do not change amino acids• Nonsynonymous do change• Degeneracy
– Fourfold degenerate: gly = {GGG, GGA, GGU, GGC}– Twofold degenerate: asp = {GAU, GAC}, glu = {GAA, GAG}– Non-degenerate: phe = UUU, leu = CUU, ile = AUU, val = GUU
• Example substitution rates in human and mouse– Fourfold degenerate: 2.35– Twofold degenerate: 1.67– Non-degenerate: 0.56
![Page 7: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/7.jpg)
7
Predicting Substitutions
How can we count the true number of substitutions ?
![Page 8: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/8.jpg)
8
Jukes-Cantor Model
• Each nucleotide can change into another one with the same probability
A C
G T
x
xx
P(A->A’, 1) = x, for each A’P(A->A, 1) = 1 – 3xCompute P(A->A’, 2) & P(A->A, 2) P(A->A, t+1) = 3 P(A->A’, t) P(A’->A, 1) + P(A->A, t) P(A->A, 1)
P(A->A, t) ~ ¼ + (3/4)e-4ft
K = num. subst. = -¾ ln(1 – f4/3), f = fraction of observed substitutions
Oversimplification
![Page 9: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/9.jpg)
9
Two Parameter Model
• Transition: – purine->purine (A, G),
pyrimidine->pyrimidine (C, T)
• Transversion: – purine <-> pyrimidine
• Transitions are more likely than transversions.
• Use different probabilities for transitions and transversions.
Purine
Pyrimidine
![Page 10: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/10.jpg)
10
Two Parameter Model
A C
G T
y
yx
•P(AA,1) = 1-x-2y•Compute P(AA,2)
P(AA,2) = (1-x-2y) P(AA,1) + x P(AG,1) + y P(AC,1) + y P(AT,1)
P(AA,t) = ¼ + ¼ e-4yt + ½ e-2(x+y)t
K = ½ ln(1/(1-2P-Q)) + ¼ ln(1/(1-2Q))
P,Q: fraction of transitions and transversions observed.
![Page 11: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/11.jpg)
11
More Parameters ?
• Assign a different probability for each pair of nucleotides
• Not harder to compute than simpler models
• Not necessarily better than simpler models
![Page 12: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/12.jpg)
12
Amino Acid substitutions (1)
• Harder to model than nucleotides– An amino acid can be substituted for another in more
than one ways– The number of nucleotide substitutions needed to
transform one amino acid to another may differ• Pro = CCC, leu = CUC, ile = AUC
– The likelihood of nucleotide substitutions may differ• Asp = GAU, asn = AAU, his = CAU
– Amino acid substitutions may have different effects on the protein function
![Page 13: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/13.jpg)
13
Amino Acid substitutions (2)
• Mutation rates may vary greatly among genes– Nonsynonymous substitution may affect
functionality with smaller probability in some genes
• Molecular clock (Zuckerlandl, Paulding)– Mutation rates may be different for different
organisms, but it remains almost constant over the time.
![Page 14: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/14.jpg)
14
Scoring Matrices
![Page 15: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/15.jpg)
15
What is it & why ?
• Let alphabet contain N letters – N = 4 and 20 for nucleotides and amino acids
• N x N matrix• (i,j) shows the relationship between ith and jth
letters.– Positive number if letter i is likely to mutate into letter j– Negative otherwise– Magnitude shows the degree of proximity
• Symmetric
![Page 16: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/16.jpg)
16
A R N D C Q E G H I L K M F P S T W Y VA 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -2 -2 0 R -2 7 0 -1 -3 1 0 -2 0 -3 -2 3 -1 -2 -2 -1 -1 -2 -1 -2 N -1 0 6 2 -2 0 0 0 1 -2 -3 0 -2 -2 -2 1 0 -4 -2 -3 D -2 -1 2 7 -3 0 2 -1 0 -4 -3 0 -3 -4 -1 0 -1 -4 -2 -3 C -1 -3 -2 -3 12 -3 -3 -3 -3 -3 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1 Q -1 1 0 0 -3 6 2 -2 1 -2 -2 1 0 -4 -1 0 -1 -2 -1 -3 E -1 0 0 2 -3 2 6 -2 0 -3 -2 1 -2 -3 0 0 -1 -3 -2 -3 G 0 -2 0 -1 -3 -2 -2 7 -2 -4 -3 -2 -2 -3 -2 0 -2 -2 -3 -3H -2 0 1 0 -3 1 0 -2 10 -3 -2 -1 0 -2 -2 -1 -2 -3 2 -3 I -1 -3 -2 -4 -3 -2 -3 -4 -3 5 2 -3 2 0 -2 -2 -1 -2 0 3 L -1 -2 -3 -3 -2 -2 -2 -3 -2 2 5 -3 2 1 -3 -3 -1 -2 0 1 K -1 3 0 0 -3 1 1 -2 -1 -3 -3 5 -1 -3 -1 -1 -1 -2 -1 -2 M -1 -1 -2 -3 -2 0 -2 -2 0 2 2 -1 6 0 -2 -2 -1 -2 0 1 F -2 -2 -2 -4 -2 -4 -3 -3 -2 0 1 -3 0 8 -3 -2 -1 1 3 0 P -1 -2 -2 -1 -4 -1 0 -2 -2 -2 -3 -1 -2 -3 9 -1 -1 -3 -3 -3 S 1 -1 1 0 -1 0 0 0 -1 -2 -3 -1 -2 -2 -1 4 2 -4 -2 -1 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -1 -1 2 5 -3 -1 0 W -2 -2 -4 -4 -5 -2 -3 -2 -3 -2 -2 -2 -2 1 -3 -4 -3 15 3 -3 Y -2 -1 -2 -2 -3 -1 -2 -3 2 0 0 -1 0 3 -3 -2 -1 3 8 -1V 0 -2 -3 -3 -1 -3 -3 -3 -3 3 1 -2 1 0 -3 -1 0 -3 -1 5
The BLOSUM45 Matrix
![Page 17: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/17.jpg)
17
Scoring Matrices for DNA
A C G T
A 1 0 0 0
C 0 1 0 0
G 0 0 1 0
T 0 0 0 1
A C G T
A 1 -3 -3 -3
C -3 1 -3 -3
G -3 -3 1 -3
T -3 -3 -3 1
A C G T
A 1 -5 -1 -5
C -5 1 -5 -1
G -1 -5 1 -5
T -5 -1 -5 1
Transitions & transversions
identity BLAST
![Page 18: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/18.jpg)
18
Scoring Matrices for Amino Acids
• Chemical similarities– Non-polar, Hydrophobic (G, A, V, L, I, M, F, W, P)– Polar, Hydrophilic (S, T, C, Y, N, Q)– Electrically charged (D, E, K, R, H)– Requires expert knowledge
• Genetic code: Nucleotide substitutions– E: GAA, GAG– D: GAU, GAC– F: UUU, UUC
• Actual substitutions– PAM– BLOSUM
![Page 19: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/19.jpg)
19
Scoring Matrices: Actual Substitutions
• Manually align proteins
• Look for amino acid substitutions
• Entry ~ log(freq(observed)/freq(expected))
• Log-odds matrices
![Page 20: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/20.jpg)
20
PAM Matrices
(Dayhoff 1972)
![Page 21: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/21.jpg)
21
PAM
• PAM = “Point Accepted Mutation” interested only in mutations that have been “accepted” by natural selection
• An accepted mutation is a mutation that occurred and was positively selected by the environment; that is, it did not cause the demise of the particular organism where it occurred.
![Page 22: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/22.jpg)
22
Interpretation of PAM matrices
• PAM-1 : one substitution per 100 residues (a PAM unit of time)
• “Suppose I start with a given polypeptide sequence M at time t, and observe the evolutionary changes in the sequence until 1% of all amino acid residues have undergone substitutions at time t+n. Let the new sequence at time t+n be called M’. What is the probability that a residue of type j in M will be replaced by i in M’?”
• PAM-K : K PAM time units
![Page 23: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/23.jpg)
23
• Starts with a multiple sequence alignment of very similar (>85% identity) proteins. Assumed to be homologous
• Compute the relative mutability, mi, of each amino acid– e.g. mA = how many times was alanine
substituted with anything else on the average?
PAM Matrices (1)
![Page 24: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/24.jpg)
24
• ACGCTAFKIGCGCTAFKIACGCTAFKLGCGCTGFKIGCGCTLFKIASGCTAFKLACACTAFKL
• Across all pairs of sequences, there are 28A X substitutions
• There are 10 ALA residues, so mA = 2.8
Relative Mutability
![Page 25: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/25.jpg)
25
ACGCTAFKI
GCGCTAFKI ACGCTAFKL
GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL
AG IL
AG AL CS GA
FG,A = 3
Pam Matrices (2)
• Construct a phylogenetic tree for the sequences in the alignment
• Calculate substitution frequencies FX,X
• Substitutions may have occurred either way, so A G also counts as G A.
![Page 26: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/26.jpg)
26
Mutation Probabilities
• Mi,j represents the probability of J I substitution.
iij
ijjij F
FmM
ACGCTAFKI
GCGCTAFKI ACGCTAFKL
GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL
AG IL
AG AL CS GA
4
38.2,
AGM = 2.1
![Page 27: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/27.jpg)
27
The PAM Matrix
• The entries of the scoring matrix are the Mi,j values divided by the frequency of occurrence, fi, of residue i.
• fG = 10 GLY / 63 residues = 0.1587
• RG,A = log(2.1/0.1587) = log(12.760) = 1.106
• Log-odds matrix
• Diagonal entries are Mjj = 1– mj
![Page 28: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/28.jpg)
28
Computation of PAM-K
• Assume that changes at time T+1 are independent of the changes at time T.
• Markov chain
• P(A-->B) = X P(A->X) P(X->B)
• PAM-K = (PAM-1)K
• PAM-250 is most commonly used
![Page 29: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/29.jpg)
29
PAM - Discussion
• Smaller K, PAM-K is better for closely related sequences, large K is better for distantly related sequences
• Biased towards closely related sequences since it starts from highly similar sequences (BLOSUM solves this)
• If Mi,j is very small, we may not have a large enough sample to estimate the real probability. When we multiply the PAM matrices many times, the error is magnified.
• Mutation rate may change from one gene to another
![Page 30: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/30.jpg)
30
BLOSUM Matrices
Henikoff & Henikoff 1992
![Page 31: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/31.jpg)
31
BLOSUM Matrix• Begin with a set of protein sequences and obtain blocks.
– ~2000 blocks from 500 families of related proteins– More data than PAM
• A block is the ungapped alignment of a highly conserved region of a family of proteins.
• MOTIF program is used to find blocks• Substitutions in these blocks are used to compute BLOSUM matrix
WWYIR CASILRKIYIYGPV GVSRLRTAYGGRKNRGWFYVR … CASILRHLYHRSPA … GVGSITKIYGGRKRNGWYYVR AAAVARHIYLRKTV GVGRLRKVHGSTKNRGWYFIR AASICRHLYIRSPA GIGSFEKIYGGRRRRG
block 1 block 2 block 3
![Page 32: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/32.jpg)
32
• Count the frequency of occurrence of each amino acid. This gives the background distribution pa
• Count the number of times amino acid a is aligned with amino acid b: fab
– A block of width w and depth s contributes ws(s-1)/2 = np pairs• Compute the occurrence probability of each pair
– qab = fab/ np
• Compute the probability of occurrence of amino acid a– pa = qaa + Σ qab /2
• Compute the expected probability of occurrence of each pair– eab = 2papb, if a ≠ b papb otherwise
• Compute the log likelihood ratios, normalize, and round.– 2* log2 qab / eab
a≠b
i
Constructing the Matrix
![Page 33: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/33.jpg)
33
Constructing the Matrix: Example
• fAA = 36, fAS = 9• Observed frequencies of pairs
– qAA = fAA/(fAA+fAS) = 36/45 = 0.8– qAS = 9/45 = 0.2
• Expected frequencies of letters– pA = qAA + qAS/2 = 0.9– pS = qAS/2 = 0.1
• Expected frequencies of pairs– eAA = pA x pA = 0.81– eAS = 2 x pA x pS = 0.18
• Matrix entries– MAA = 2x log2(qAA/eAA) = -0.04 ~ 0– MAS = 2 x log2(qAS/eAS) = 0.3 ~ 0
A A A A S… A … A A A A
9A, 1S
![Page 34: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/34.jpg)
34
a b
Computation of BLOSUM-K
• Different levels of the BLOSUM matrix can be created by differentially weighting the degree of similarity between sequences. For example, a BLOSUM62 matrix is calculated from protein blocks such that if two sequences are more than 62% identical, then the contribution of these sequences is weighted to sum to one. In this way the contributions of multiple entries of closely related sequences is reduced.
• Larger numbers used to measure recent divergence, default is BLOSUM62
![Page 35: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/35.jpg)
35
BLOSUM 62 Matrix
M I L V-small hydrophobic
N D E Q-acid, hydrophilic
H R K-basic
F Y W-aromatic
S T P A G-small hydrophilic
C-sulphydryl
Check scores for
![Page 36: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/36.jpg)
36
Equivalent PAM and BLOSSUM matrices:
PAM100 = Blosum90 PAM120 = Blosum80 PAM160 = Blosum60 PAM200 = Blosum52 PAM250 = Blosum45
BLOSUM62 is the default matrix to use.
PAM vs. BLOSUM
![Page 37: CAP5510 – Bioinformatics Substitution Patterns](https://reader036.vdocuments.us/reader036/viewer/2022062410/5681596c550346895dc6ad4f/html5/thumbnails/37.jpg)
37
PAM BLOSUM
Built from global alignments Built from local alignments
Built from small amout of Data Built from vast amout of Data
Counting is based on minimumreplacement or maximum parsimony
Counting based on groups ofrelated sequences counted as one
Perform better for finding globalalignments and remote homologs
Better for finding localalignments
Higher PAM series means moredivergence
Lower BLOSUM series meansmore divergence
PAM vs. BLOSUM