jun liu - harvard universityjunliu/sequence_analysis.pdf · biological sequence analysis 3 central...
TRANSCRIPT
![Page 1: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/1.jpg)
Biological Sequence Analysis 1
Biological Sequence Analysis and Motif Discovery
Introductory Overview LectureJoint Statistical Meetings 2001, Atlanta
Jun LiuDepartment of Statistics
Harvard Universityhttp://www.fas.harvard.edu/~junliu
![Page 2: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/2.jpg)
Biological Sequence Analysis 2
Topics to be covered• Basic Biology: DNA, RNA, Protein; genetic code.• Biological Sequence Analysis
– Pairwise alignment --- dynamic programming• Needleman-Wunsch• Smith-Waterman• Blast …
– Multiple sequence alignment• Heuristic approaches • The hidden Markov model
– Motif finding in DNA and protein sequence
![Page 3: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/3.jpg)
Biological Sequence Analysis 3
Central Paradigm
Courtesy of Doug Brutlag
![Page 4: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/4.jpg)
Biological Sequence Analysis 4
Buliding Blocks of Biological Systems: nucleotides and amino acids
• DNA (nucleotides, 4 types): information carrier/encoder.• RNA: bridge from DNA to protein.
• Protein (amino acids, 20 types): action molecules.
• Genetic code: deciphering genetic information.atgaatcgta ggggtttgaa cgctggcaat acgatgactt ctcaagcgaa
cattgacgac ggcagctgga aggcggtctc cgagggcgga ……
MNRRGLNAGNTMTSQANIDDGSWKAVSEGG ……
![Page 5: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/5.jpg)
Biological Sequence Analysis 5
genome
gene
intron exon
splicing
protein
DNA sequences
![Page 6: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/6.jpg)
Biological Sequence Analysis 6
From DNA to Protein
![Page 7: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/7.jpg)
Biological Sequence Analysis 7
Genetic Code
![Page 8: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/8.jpg)
Biological Sequence Analysis 8
Main Resources for Data• NCBI --- national center for biotechnology
information.– GenBank maintainance– BLAST searching and servers– Entrez database– Taxonomy database– Structure database– Bankit and SEQUIN submission software– Web access http://www.ncbi.nlm.nih.gov
• EMBL --- European equivalent of NCBI.– Web access http://www.embl-heidelberg.de– EBI --- outstation of EMBL. http://www.ebi.ac.uk/
![Page 9: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/9.jpg)
Biological Sequence Analysis 9
Finding “Patterns” inBiological Sequences
![Page 10: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/10.jpg)
Biological Sequence Analysis 10
Everything is related to everything else in a logical way ….
![Page 11: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/11.jpg)
Biological Sequence Analysis 11
A Motif
![Page 12: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/12.jpg)
Biological Sequence Analysis 12
Brute-forceStringSearch
![Page 13: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/13.jpg)
Biological Sequence Analysis 13
Boyer-Moore String Search
![Page 14: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/14.jpg)
Biological Sequence Analysis 14
Consensus in Sequences• A database for protein families and patterns:
– Prosite http://www.expasy.ch/prosite/
<A-x-[ST](2)-x(0,1)-V-{LI}
This pattern, which must be in the N-terminal of the sequence (`<'), means: Ala-any-[Ser or Thr]-[Ser or Thr]-(any or none)-Val-(any but Leu & Ile)
Pattern: [LIVM]-[ST]-A-[STAG]-H-C
Interpretation:
![Page 15: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/15.jpg)
Biological Sequence Analysis 15
Pattern Finding Programs• Prosite web program (http://www.expasy.ch/prosite/)
– ScanProsite - Scan a sequence against PROSITE or a pattern against SWISS-PROT
– ProfileScan - Scan a sequence against the profile entries in PROSITE
• Motif web programs ( http://motif.stanford.edu/identify/)– IDENTIFY– SCAN
• GCG SEQWEB programs (commercial)– Stringsearch– Findpatterns– Motifs
![Page 16: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/16.jpg)
Biological Sequence Analysis 16
Pairwise Sequence Alignment Methods
![Page 17: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/17.jpg)
Biological Sequence Analysis 17
The Sequence Alignment problem
C A T T G
T C A T GGiven a pair of sequences, how do we view them?
C A T T G
T C A T GA better representation?
C A T T G
T C A T GOr
What determines the choice? Gap & mismatch penalties
- C A T T G
T C A - T G
C A T T G
T C A T G
![Page 18: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/18.jpg)
Biological Sequence Analysis 18
![Page 19: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/19.jpg)
Biological Sequence Analysis 19
Global: Needleman-Wunsch Algorithm• Finding the optimal alignment via dynamic programming• Example alignment
C A T T G …TCATG
o
o
C A T T - - G
T C - - A T G
![Page 20: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/20.jpg)
Biological Sequence Analysis 20
Alignment Recursion
F(i,j)F(i-1,j)
F(i,j-1)F(i-1,j-1)
−−−−
+−−=
)1,( ),1(
),()1,1(max),(
γγ
jiFjiF
yxsjiFjiF
ji C A T T G
TCATG
0 -1 -2 -3 -4 -5-1
1-2-3-4-5
E.g., 1 ,2),( }{ == = γyxIyxs 0-1-2
0 -10332
554
0-12 1
46
![Page 21: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/21.jpg)
Biological Sequence Analysis 21
Substitution/Scoring Matrices• Pam matrices (Dayhoff et al. 1978) --- phylogeny-based.
PAM250 matrix, log-odds representation
PAM1: expected number of mutation = 1%
![Page 22: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/22.jpg)
Biological Sequence Analysis 22
BLO(ck)SU(bstitution)M(atrix) (Henikoff & Henikoff 1992)
• Derived from a set (2000) of aligned and ungapped regions from protein families; emphasizing more on chemical similarities (versus how easy it is to mutate from one residue to another). BLOSUMx is derived from the set of segments of x% identity.
BLOSUM62 Matrix, log-odds representation
![Page 23: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/23.jpg)
Biological Sequence Analysis 23
Gap Penalties• Linear score: �(g)=-gd
– Typically: d=8, in unit of half-bits (=4log2)
• Affine score: �(g)=-d-(g-1)e– d: gap opening penalty; e: gap extension penalty– Typical d=12, e=2 (in unit of half-bits)
• Gap penalty corresponds to log-probability of opening a gaps. For example, under the standard linear score, P(g=k)=exp(-dk)=2-4k
![Page 24: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/24.jpg)
Biological Sequence Analysis 24
Local Alignment: Smith-Waterman Algorithm
−−−−
+−−=
)1,( ),1(
),()1,1( 0
max),(
djiFdjiF
yxsjiFjiF ji
![Page 25: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/25.jpg)
Biological Sequence Analysis 25
Sequence comparison and data base search
![Page 26: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/26.jpg)
Biological Sequence Analysis 26
BLAST(Altschul et al. 1990)
• Create a word list from the querry; – word length =3 for protein and 12 for DNA.
• For each listed word, find “neighboring words” (~ 50),• For each sequence in the database, search exact matches to each
word in the set. • Extend the hits in both directions until score drops below X• No gap allowed; use Karlin-Altschul statistics for significance• New versions (>1.4) of BLAST gives gapped alignments.• Compute Smith-Waterman for “significant” alignments• BLASTP (protein), BLASTN (DNA), BLASTX (pr �DNA).
TWWS >)',(
QTVGLMIVYDDA
![Page 27: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/27.jpg)
Biological Sequence Analysis 27
BLAST 2.0• Two word hits must be found within a window of A
residues in order to trigger extension• Gapped extension from the middle of ungapped HSPs
• Position-specific iterative (PSI-) BLAST.– Profile constructed on the fly and iteratively refined.– Begin with a single query, profile constructed from those
significant hits; use the profile to do another search, and iterate the procedure till “convergence”
![Page 28: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/28.jpg)
Biological Sequence Analysis 28
A Bayesian Model for Pairwise Alignment
Missing data --- Alignment matrix
A i,j = 1 if residue i of sequence 1 aligns with residue j of sequence 2, 0 otherwise.
Observed data pair of sequence R(1), R(2)
ΨR R1 2, PAM or Blosum A Ai j
ii j
j, ,≤ ≤∑ ∑1 1
)2()1()2()1( ,,)2()1( ),|,(
jiji RRjiRRji AARRP Ψ+Θ+Θ=Θ
![Page 29: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/29.jpg)
Biological Sequence Analysis 29
Multiple Alignment
![Page 30: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/30.jpg)
Biological Sequence Analysis 30
ClustalW Step 2: bBuild the Tree
![Page 31: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/31.jpg)
Biological Sequence Analysis 31
![Page 32: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/32.jpg)
Biological Sequence Analysis 32
However ….
• No explicit model to guide for the alignment --- heuristic driven.
• Tree construction has problems.• Overall, not sensitive enough for remotely
related sequences.
![Page 33: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/33.jpg)
Biological Sequence Analysis 33
The Hidden Markov Model
• For given zs, ys ~ f(ys| zs, θ), and the zs follow a Markov process with transition ps(zs | zs-1, φ).
z1 z2 z3 ztzs…... …...
y1 y2 y3 ys yt
π φ θt t t tz p z z y y( ) ( , , | , , ; , )==== 1 1l l
“The State Space Model”
![Page 34: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/34.jpg)
Biological Sequence Analysis 34
What Are Hidden in Sequence Alignment?
•• HMMHMM Architecture: Architecture: transition diagram for the underlying Markov chain.
m1 mEm0 m2 m3 m4
I0 I1 I2 I3 I4
D1 D2 D4D3
![Page 35: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/35.jpg)
Biological Sequence Analysis 35
Gi ====
01
if generated by insertion if generated by a match
δ i = o f d e le t io n s#
Dashed path:
(δ1,G1)
r1 r6r2
(δ2,G2)
r3
(δ3,G3) (δ6,G6)
(0,1) (2,0) (2,0) (2,0)
r4
(δ4,G4)
r5
(δ5,G5)
(2,1) (2,0)Solid path:(0,1) (0,1) (0,1) (0,1) (0,0) (0,0)
The PathPath
m0
m1
m2
m3
m4
r0 r1 r2 r3 r4 r5 r6
![Page 36: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/36.jpg)
Biological Sequence Analysis 36
Pfam alignment view (http://pfam.wustl.edu/browse.shtml)
![Page 37: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/37.jpg)
Biological Sequence Analysis 37
More Restricted ModelsMore Restricted Models(Liu et al. 99, (Liu et al. 99, Neuwald Neuwald et al. 97)et al. 97)
•• Block Motif Propagation Model:Block Motif Propagation Model: limit the total number of gaps but no deletions allowed.
Motif 1 Motif 2 Motif 3 Motif 4
![Page 38: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/38.jpg)
Biological Sequence Analysis 38
Representative Alignment
![Page 39: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/39.jpg)
Biological Sequence Analysis 39
Finding Repetitive Patterns
![Page 40: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/40.jpg)
Biological Sequence Analysis 40
Motif Alignment ModelMotif
width = w
length nk
a1
a2
ak
The missing data: Alignment variable: A={a1, a2, …, ak}
• Every non-site positions follows a common multinomialwith p0=(p0,1 ,…, p0,20)
• Every position i in the motif element follows probabilitydistribution pi=(pi,1 ,…, pi,20)
![Page 41: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/41.jpg)
Biological Sequence Analysis 41
The Algorithm• Initialized by choosing random starting positions
• Iterate the following steps many times:– Randomly or systematically choose a sequence, say,
sequence k, to exclude.
– Carry out the predictive-updating step to update ak
• Stop when no more observable changes in likelihood• Available Programs:
– Gibbs motif sampler (http://www.wadsworth.org/res&res/bioinfo)
– Bioprospector (http://bioprospector.stanford.edu)
– MACAW (http://www.ncbi.nlm.nih.gov)
)0()0(2
)0(1 ,......,, Kaaa
![Page 42: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/42.jpg)
Biological Sequence Analysis 42
a2
a1The PU-Step
ak ?
a3
1. Compute predictive frequencies of each position i in motifcij= count of amino acid type j at position i.c0j = count of amino acid type j in all non-site positions.qij= (cij+bj)/(K-1+B), B=b1+ • • •+ bK “pseudo-counts”
2. Sample from the predictive distriubtion of ak .
P a lqqk
i R l i
R l ii
wk
k
( ) , ( )
, ( )==== ++++ ∝∝∝∝ ++++
++++====∏∏∏∏1
01
![Page 43: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/43.jpg)
Biological Sequence Analysis 43
Using MACAW
![Page 44: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/44.jpg)
Biological Sequence Analysis 44
Idea 2: Mixture modeling
• View the dataset as a long sequence with k motif types:
• Idea: partition the input sequence into segments that
correspond to different (unknown) motif models.• It is a mixture model (unsupervised learning).
• Implement a predictive updating scheme.
![Page 45: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/45.jpg)
Biological Sequence Analysis 45
Special Case: Bernoulli Sampler
• Sequence data: R = r1 r2 r3…… rN
• Indicator variable: ∆∆∆∆
= δ1δ2δ3 .. .. .. δΝ
• Likelihood: π(R, ∆ | Θ, ε), ε is the prior prob for δi=1• Predictive Update:
δ i ====
10
,,
if it is the start of an elementif not.
π δπ δ
εε
( | , )( | , )
>
>
>
>
[ ]
[ ]
,
,
k k
k k
i r
ri
wRR
pp
k i
k i
========
====−−−−
−−−−
−−−− ====
++++ −−−−
++++ −−−−
∏∏∏∏10 1
1
101
∆∆
parameter for the motif model
![Page 46: Jun Liu - Harvard Universityjunliu/sequence_analysis.pdf · Biological Sequence Analysis 3 Central Paradigm Courtesy of Doug Brutlag. ... gap opening penalty; e: gap extension penalty](https://reader036.vdocuments.us/reader036/viewer/2022081600/603954e4cada891514174ed2/html5/thumbnails/46.jpg)
Biological Sequence Analysis 46
References (self)• Durbin R. et al. (1997). Biological Sequence Analysis, Cambridge
University Press, London.• Liu, J.S. (2001). Monte Carlo Strategies in Scientific Computing, Springer-
Verlag, New York.• Liu, J.S. and Lawrence, C.E. (1999). Bayesian analysis on biopolymer
sequences. Bioinformatics, 138-143.• Liu, J.S., Neuwald, A.F., and Lawrence, C.E. (1999) . Markovian
structures in biological sequence alignments. J. Amer. Statist. Assoc., 1-15.• Neuwald, A.F., Liu, J.S., Lipman, D.J., and Lawrence, C.E. (1997) .
Extracting protein alignment models from the sequence database. Nucleic Acids. Res. 25, 1665-1677.
• Liu, J.S., Neuwald, A.F., and Lawrence, C.E. (1995) . Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Amer. Statist. Assoc. 90, 1156-1170.
• Lawrence, et al. (1993). Science 262, 208-214.