digital signal processing using...
TRANSCRIPT
![Page 1: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/1.jpg)
Course organization
– Introduction ( Week 1)– Part I: Algorithms for Sequence Analysis (Week 1 to 9)
– Chapter 1, Models and theories» Probability theory and Statistics (Week 2)» Algorithm complexity analysis (Week 3)» Classic algorithms (Week 3)» Lab: Linux and Perl
– Chapter 2, Sequence alignment (week 4-5)– Chapter 3, Hidden Markov Models ( week 5-6)– Chapter 4. Multiple sequence alignment (week 7)– Chapter 5. Motif finding (week 8)– Chapter 6. Sequence binning (week 9)
– Part II: Algorithms for Network Biology (Week 10 to 16)
1
![Page 2: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/2.jpg)
Chapter 1: Probability Theory
for Biological Sequence Analysis
Chaochun Wei
Fall 2011
2
A scientist who has learned how to use probability theory directly as extended logic has a great advantage in power and versatility over one who has learned only a collection of unrelated ad hoc devices.
– E. T. Jaynes, 1996
![Page 3: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/3.jpg)
Contents• Reading materials
• Applications
• Introduction
– Definition
– Conditional, joint, marginal probabilities
– Statistical inference
• Bayesian statistical inference
• Frequentist inference
– Information theory
– Parameter estimation
3
![Page 4: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/4.jpg)
.
Durbin book:
Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998). Biological Sequence
Analysis. Cambridge University Press.
(Errata page: http://selab.janelia.org/cupbook_errata.html)
DeGroot, M., Schervish, M., Probability and Statistics (4th Edition)
.
Reading
Other recommended background Jaynes, E.T.,
Probability Theory: The logic of Science, Cambridge University Press, 2003
4
![Page 5: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/5.jpg)
Probability theory for biological sequence analysis
Applications
BLAST significance tests
The derivation of BLOSUM and PAM scoring matrices
Position Weight Matrix (PWM or PSSM)
Hidden Markov Models (HMM)
Maximum likelihood methods for phylogenetic trees
5
![Page 6: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/6.jpg)
Probability theory
i
ii PP 1;0
Definition
Examples:
A fair dice:
A random nucleotide sequence:
“i.i.d.”: independent, identically distributed
.6,...,2,1,6/1 iPi
4/1 TGCA PPPP
6
1)(;0)( dxxfxf
![Page 7: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/7.jpg)
Classical Terminology• Experiment: E.g. toss a coin 10 times or
sequence a genome
• Outcome: A possible result of an experiment,
E.g HHTHTTHHHT or ACGCTTATC
• Sample space: The set of all possible outcomes of some experiment
E.g. {H; T}10 or {A;C; G; T}*.
• Event: Any subset of the sample space
E.g. 4 heads; DNA seqs w/no run of > 50 As.
7
Probability theory
![Page 8: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/8.jpg)
8
Definitions, axioms, theorems (1)
If S is a sample space and A is an event, then Pr(A) is a
number representing its probability
Axiom 1. For any event A, Pr(A) > 0
Axiom 2. If S is a sample space, Pr(S) = 1
Events A, B are disjoint iff ; The set {A1, A2, …} is
disjoint iff every pair is disjoint. Disjoint events are mutually
exclusive.
Axiom 3. For any finite or infinite collection of disjoint events
A1, A2, …,
BA
i
iii
AA )Pr()Pr(
![Page 9: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/9.jpg)
9
Definitions, axioms, theorems(2)
Theorem 1.
Theorem 2. For any event A where Ac is the complement of A,
Theorem 3. For any event A,
Theorem 4. If , then
Theorem 5.
0)Pr(
)Pr(1)Pr( AAc
1)Pr(0 A
BA )Pr()Pr( BA
)Pr()Pr()Pr()Pr( BABABA
![Page 10: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/10.jpg)
Probability theory
Joint, conditional, and marginal probabilities
Joint probability: P(A,B): “probability of A and B”
Conditional probability: P(A|B) : “probability of A given B”
P(A|B) = P(A, B)/P(B)
Marginal probability:
Examples:
The occasionally dishonest casino. Two types of dice:
99% are fair, 1% are loaded such that
Conditional P(6|loaded), joint P(6, loaded); marginal P(6)
BB
BAPBPBAPAP ),()](*)|([)(
5.06 P
10
![Page 11: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/11.jpg)
11
Independence
If Pr(A|B)=Pr(A), we say A is independent of B.
Pr(A, B) = Pr(A)Pr(B)
If A is independent of B, then B is independent of A.
A and B are independent
![Page 12: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/12.jpg)
12
Four rules for manipulating probability expressions
1. Chain rule
Example:
)Pr()|Pr(),|Pr(
)Pr()Pr(
)Pr()|Pr(
),Pr(
),Pr(),|Pr(
)Pr()Pr(
),Pr(
),Pr(
),,Pr(),,Pr(
332321
3
3
332
32
32321
3
3
32
32
321321
xxxxxx
xx
xxx
xx
xxxxx
xx
xx
xx
xxxxxx
![Page 13: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/13.jpg)
13
Four rules for manipulating probability expressions
2. Bayes rule
Example:
)Pr(
)Pr()|Pr()|Pr(
)Pr()|Pr(),Pr()Pr()|Pr(
2
11221
11221221
x
xxxxx
xxxxxxxx
![Page 14: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/14.jpg)
14
Four rules for manipulating probability expressions
3. Summing out (Marginalizing)
BB
BAPBPBAPAP ),()](*)|([)(
![Page 15: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/15.jpg)
15
Four rules for manipulating probability expressions
4. Exhaustive Conditionalization
yy
yyxyxx )Pr()|Pr(),Pr()Pr(
![Page 16: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/16.jpg)
Course organization
– Introduction ( Week 1)– Part I: Algorithms for Sequence Analysis (Week 1 to 9)
– Chapter 1, Models and theories» Probability theory and Statistics (Week 2)» Algorithm complexity analysis (Week 3)» Classic algorithms (Week 3)» Lab: Linux and Perl
– Chapter 2, Sequence alignment (week 4-5)– Chapter 3, Hidden Markov Models ( week 5-6)– Chapter 4. Multiple sequence alignment (week 7)– Chapter 5. Motif finding (week 8)– Chapter 6. Sequence binning (week 9)
– Part II: Algorithms for Network Biology (Week 10 to 16)
16
![Page 17: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/17.jpg)
Statistical inference
Bayesian statistical inference
Maximum likelihood inference
Frequentist inference
Probability theory
17
![Page 18: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/18.jpg)
Bayesian statistical inference
The probability of a hypothesis, H, given some data, D.
Bayes’ rule: P(H|D) = P(H)*P(D|H)/P(D)
H: hypothesis, D: data
P(H): prior probability
P(D|H) : likelihood
P(H|D): posterior probability
P(D): marginal probability:
Probability theory
18
H
HPHDPDP )()|()(
![Page 19: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/19.jpg)
Bayesian statistical inference
Examples 1. The occasionally dishonest casino. We choose
a die, roll it three times, and every roll comes up a 6. Did we pick a loaded die?
2. Given a triplet, is this an initiator or not?
Probability theory
19
Ans: Let H stand for “picked a loaded die”, then
P(H|6, 6, 6) = P(6, 6, 6|H) P(H)/P(6, 6,6) ~=0.21
![Page 20: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/20.jpg)
Maximum likelihood inference
For a model M, find the best parameter Ѳ={Ѳi} from a set of data D, i.e.,
Assume dataset D is created by model M with parameter Ѳ0 : K observable outcome ωi, i=1, …, K, with frequencies ni, i=1, …, K. Then, the best estimation of P(ωi |Ѳ0, M) is ni/Σnk.
Probability theory
20
),|(maxarg MDPML
![Page 21: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/21.jpg)
Maximum likelihood inference
P(x|y): probability(of x) or likelihood(of y)
Likelihood ratios; log likelihood ratios (LLR)
P(D| Ѳ1 ,M)/P(D/ Ѳ2,M); log(P(D| Ѳ1 ,M)/P(D/ Ѳ2,M))
Substitution matrices are LLRs
Derivation of BLOSUM matrices (Henikoff1992 paper)
Interpretation of arbitrary score matrices as probabilistic models (Altschul 1991 paper)
Probability theory
21
![Page 22: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/22.jpg)
Maximum likelihood inferenceDerivation of BLOSUM matrices (Henikoff 1992 paper)
aa pair frequency table f: {fij }
Compute a LLR matrix
Expected probability of each i,j pair:
substitution matrix:
Probability theory
22
ji
ijiii
i j
ijijij
qqp
ffq
2/
)/(20 20
jipp
jipe
ji
i
ij,2
,2
)/(log 2 ijijij eqs
![Page 23: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/23.jpg)
Frequentist inference
Statistical hypothesis testing and confidence intervals
Examples:
Blast p-values and E-values
P(S >= x)
Expectation value, E=NP(S>=x)
Probability theory
23
![Page 24: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/24.jpg)
Information theory
How to measure the degree of conservation?
Shannon entropy
Relative entropy
Mutual information
Probability theory
24
![Page 25: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/25.jpg)
Shannon entropy: A measure of uncertainty
Probability P(xi) for discrete set of K events x1, …, xk, the Shannon entropy H(X) is defined as
Unit of Entropy: ‘bit’ (use logarithm base 2)
H(X) is maximized when P(xi)=1/K for all i.
H(X) is minimized when P(xk)=1, and P(xi)=0 for all i≠K.
Probability theory
25
i
ii xPxPXH )(log)()(
![Page 26: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/26.jpg)
Information: a measure of reduction of uncertainty
the difference between the entropy before and after a ‘message’ is received
I(X) = Hbefore – Hafter
Probability theory
26
![Page 27: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/27.jpg)
Shannon entropy: A measure of uncertainty
Example: in a DNA sequence aє{A, C, G, T}, Pa=1/4; then
Information: A measure of reduction in uncertainty
Example: measure the degree of conservation of a position in a DNA sequence
In a position of many DNA sequences, if PC=0.5 and PG=0.5, then Hafter= - 0.5log20.5 - 0.5log20.5 = 1 bits.
The information content of this position is
2-1=1 bits
Probability theory
27
a
aa bitsPPXH 2log)(
![Page 28: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/28.jpg)
28
Josep F. Abril et al. Genome Res. 2005; 15: 111-119
Patterns in Splice SitesDonor Sites Acceptor Sites
Sequence data from RefSeq of human, mouse, rat and chicken.
![Page 29: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/29.jpg)
Relative entropy: a measure of uncertainty
a different type of entropy
Property of a relative entropy
H(P||Q) ≠H(Q||P)
H(P||Q) ≥ 0
Can be viewed as the expected LLR.
Probability theory
29
)(
)(log)()||(
i
ii
i xQ
xPxPQPH
![Page 30: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/30.jpg)
Proof of Relative entropy is always nonnegative
Probability theory
30
0)||(
0))()((
)1)(
)()((
)(
)(log)()||(
1)log(
QPH
xPxQ
xP
xQxP
xP
xQxPQPH
xx
iii
i
ii
ii
ii
i
![Page 31: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/31.jpg)
Mutual information M(XY)
Probability theory
31
)()(
),(log),()(
yPxP
yxPyxPXYM
xy
![Page 32: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/32.jpg)
Parameter estimationMaximum likelihood estimation (ML)
Maximum a posterior estimation(MAP)
Expectation maximization (EM)
Probability theory
32
![Page 33: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/33.jpg)
Parameter estimationMaximum likelihood estimation: use the observed
frequencies as probability parameters, i.e.,
Maximum a posterior estimation(MAP)
“Plus-one” prior,
Pseudocounts
Probability theory
33
y
ycount
xcountxP
)(
)()(
![Page 34: Digital Signal Processing Using MATLAB®Vcbb.sjtu.edu.cn/~ccwei/pub/courses/2011/algorithms_in... · 2011-09-22 · –Chapter 2, Sequence alignment (week 4-5) –Chapter 3, Hidden](https://reader030.vdocuments.us/reader030/viewer/2022040307/5ed3217cbda11e6747513c3b/html5/thumbnails/34.jpg)
Parameter estimationEM: A general algorithm for ML estimation with
“missing data”. Iteration of two steps:
E-step: using current parameters, estimate expected counts
M-step: using current expected counts, re-estimate new parameters
Example: Baum-Welch algorithm for HMM parameter estimation.
Convergence guaranteed
Probability theory
34