Download - Biological Sequence Analysis
![Page 1: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/1.jpg)
1
Biological Sequence Analysis
Lecture 26, Statistics 246
April 27, 2004
![Page 2: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/2.jpg)
2
Synopsis
Some biological backgroundA progression of modelsAcknowledgementsReferences
![Page 3: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/3.jpg)
3
The objects of our study
DNA, RNA and proteins: macromolecules which are unbranched polymers built up from smaller units.
DNA: units are the nucleotide residues A, C, G and T RNA: units are the nucleotide residues A, C, G and U Proteins: units are the amino acid residues A, C, D, E,
F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y.
To a considerable extent, the chemical properties of DNA, RNA and protein molecules are encoded in the linear sequence of these basic units: their primary structure.
![Page 4: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/4.jpg)
4
The central dogma
Protein
mRNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
![Page 5: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/5.jpg)
5
A protein-coding gene
![Page 6: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/6.jpg)
6
Motifs - Sites - Signals - Domains
For this lecture, I’ll use these terms interchangeably to describe recurring elements of interest to us.
In PROTEINS we have: transmembrane domains, coiled-coil domains, EGF-like domains, signal peptides, phosphorylation sites, antigenic determinants, ...
In DNA / RNA we have: enhancers, promoters, terminators, splicing signals, translation initiation sites, centromeres, ...
![Page 7: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/7.jpg)
7
Motifs and models
Motifs typically represent regions of structural significance with specific biological function.
Are generalisations from known examples.
The models can be highly specific.
Multiple models can be used to give higher sensitivity & specificity in their detection.
Can sometimes be generated automatically from examples or multiple alignments.
![Page 8: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/8.jpg)
8
The use of stochastic models for motifs
Can be descriptive, predictive or everything else in between…..almost business as usual.
However, stochastic mechanisms should never be taken literally, but nevertheless they can be amazingly useful.
Care is always needed: a model or method can break down at any time without notice.
Biological confirmation of predictions is almost always necessary.
![Page 9: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/9.jpg)
9
Transcription initiation in E. coli RNA polymerase-promotor interactions
In E. coli transcription is initiated at the promotor , whose sequence is recognised by the Sigma factor of RNA polymerase.
![Page 10: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/10.jpg)
10
Transcription initiation in E. coli, cont.
..YKFSTYATWWIRQAITR..
![Page 11: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/11.jpg)
11
Determinism 1: consensus sequences
Factor Promotor consensus sequence -35 -10 70 TTGACA TATAAT 28 CTAAA CCGATAT Similarly for 32 , 38 and 54. Consensus sequences have the obvious limitation:
there is usually some deviation from them.
![Page 12: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/12.jpg)
12
has 3 Cys-Cys-His-His zinc finger DNA binding domains
The human transcription factor Sp1
![Page 13: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/13.jpg)
13
Determinism 2: regular expressions
The characteristic motif of a Cys-Cys-His-His zinc finger DNA binding domain has regular expression
C-X(2,4)-C-X(3)-[LIVMFYWC]-X(8)-H-X(3,5)-H
Here, as in algebra, X is unknown. The 29 a.a. sequence of our example domain 1SP1 is as follows, clearly fitting the model.
1SP1: KKFACPECPKRFMRSDHLSKHIKTHQNKK
![Page 14: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/14.jpg)
14
Prosite patterns
An early effort at collecting descriptors for functionally important protein motifs. They do not attempt to describe a complete domain or protein, but simply try to identify the most important residue combinations, such as the catalytic site of an enzyme. They use regular expression syntax, and focus on the most highly conserved residues in a protein family.
http://au.expasy.org
![Page 15: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/15.jpg)
15
More on Prosite patterns
This pattern, which must be in the N-terminal of the sequence (‘<'), means: <A - x - [ST] (2) - x(0,1) - V - {LI} Ala-any- [Ser or Thr]-[Ser or Thr] - (any or none)-Val-(any but Leu, Ile)
![Page 16: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/16.jpg)
16
http://www.isrec.isb-sib.ch/software/PATFND_form.htmlc.{2,4}c...[livmfywc]........h.{3,5}h
PatternFind output[ISREC-Server] Date: Wed Aug 22 13:00:41 MET 2001 ...gp|AF234161|7188808|01AEB01ABAC4F945 nuclear
protein NP94b [Homo sapiens] Occurences: 2 Position : 514 CYICKASCSSQQEFQDHMSEPQH Position : 606 CTVCNRYFKTPRKFVEHVKSQGH........
Searching with regular expressions
![Page 17: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/17.jpg)
17
Regular expressions can be limiting
The regular expression syntax is still too rigid to represent many highly divergent protein motifs.
Also, short patterns are sometimes insufficient with today’s large databases. Even requiring perfect matches you might find many false positives. On the other site some real sites might not be perfect matches.
We need to go beyond apparently equally likely alternatives, and ranges for gaps. We deal with the former first, having a distribution at each position.
![Page 18: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/18.jpg)
18
Cys-Cys-His-His profile: sequence logo form
A sequence logo is a scaled position-specific a.a.distribution. Scaling is by a measure of a position’s information content.
![Page 19: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/19.jpg)
19
Weight matrix model (WMM) = Stochastic consensus sequence
0.04 0.88 0.26 0.59 0.49 0.03
0.09 0.03 0.11 0.13 0.21 0.05
0.07 0.01 0.12 0.16 0.12 0.02
0.80 0.08 0.51 0.13 0.18 0.89
9 214 63 142 118 8
22 7 26 31 52 13
18 2 29 38 29 5
193 19 124 31 43 216
A
C
G
T
A
C
G
T
-38 19 1 12 10 -48
-15 -38 -8 -10 -3 -32
-13 -48 -6 -7 -10 -40
17 -32 8 -9 -6 19
A
C
G
T
2
1
01 2 3 4 5 6
Counts from 242 known 70 sites Relative frequencies: fbl
10 log2fbl/pb Informativeness:2+bpbllog2pbl
![Page 20: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/20.jpg)
20
Interpretation of weight matrix entries
candidate sequence CTATAATC....
aligned position 123456
Hypotheses:S=site (and independence)
R=random (equiprobable, independence)
log2 = log2
= (2+log2.09)+...+(2+log2.01)
=
Generally, score sbl = log fbl/pb
pr(CTATAA | S)
pr(CTATAA | R)
.09x.08x.26x.13x.51x.01
.25x.25x.25x.25x.25x.25
1
10-15 - 32 +1 - 9 +10 - 48
l=position, b=base
pb=background frequency
![Page 21: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/21.jpg)
21
Use of the matrix to find sites C T A T A A T C -38 19 1 12 10 -48
-15 -38 -8 -10 -3 -32
-13 -48 -6 -7 -10 -40
17 -32 8 -9 -6 19
A
C
G
T
-38 19 1 12 10 -48
-15 -38 -8 -10 -3 -32
-13 -48 -6 -7 -10 -40
17 -32 8 -9 -6 19
A
C
G
T
-38 19 1 12 10 -48
-15 -38 -8 -10 -3 -32
-13 -48 -6 -7 -10 -40
17 -32 8 -9 -6 19
A
C
G
T
sum
-93
+85
-95
Move the matrixalong the sequenceand score each window.
Peaks should occur at the true sites.
Of course in general any threshold will have some false positive and false negative rate.
![Page 22: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/22.jpg)
22
Profiles
Are a variation of the position specific scoring matrix approach just described. Profiles are calculated slightly differently to reflect amino acid substitutions, and the possibility of gaps, but are used in the same way.
In general a profile entry Mla for location l and amino acid a is calculated by
Mla = ∑bwlbSab
where b ranges over amino acids, wlb is a weight (e.g. the observed frequency of a.a. b in position l) and Sab is the (a,b)-entry of a substitution matrix (e.g. PAM or BLOSUM) calculated as a likelihood ratio.
Position specific gap penalties can also be included.
![Page 23: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/23.jpg)
23
Derivation of a profile for Ig domains
PileUp of: @/home/ucsb00/George/.WAG/pileup-8391.8424
Symbol comparison table: GenRunData:pileuppep.cmp CompCheck:1254
GapWeight: 3.000 GapLengthWeight: 0.100
pileup.msf MSF: 49 Type: P May 27, 1999 13:18 Check: 9167 ..
Name: g1 Len: 49 Check: 2179 Weight: 1.00 Name: m3 Len: 49 Check: 0 Weight: 1.00 Name: k2 Len: 49 Check: 6739 Weight: 1.00 Name: l3 Len: 49 Check: 249 Weight: 1.00
1 49g1 ELVKAGSSVK MSCKATGYTF SSYE....LY WVRQAPGQGL EDLGYISSSm3 GLVEPGGSLR LSCSASGFTF SAND....MN WVRQAPGKGL EWLSFIGGSk2 LPVTPGEPAS ISCRSSQSLL DSGDGNTYLN WYLQKAGQSP QLLIYTLSYl3 VSVALGQTVR ITCQ.GDSLR GYDAA..... WYQQKPGQAP LLVIYGRNN
![Page 24: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/24.jpg)
24
Cons A B C D E F G H I K L M N P Q R S T V W Y Z Gap Len ..V 15 8 -14 14 21 1 24 -4 19 -4 21 19 3 3 8 -14 4 10 32 -33 -14 14 100 100L 9 -10 -14 -12 -5 24 -3 -6 21 -5 38 34 -9 17 0 -7 14 5 25 9 -7 -4 100 100V 20 -20 20 -20 -20 20 20 -30 110 -20 80 60 -30 10 -20 -30 -10 20 150 -80 -10 -20 100 100T 31 21 -10 25 32 -31 21 4 -3 28 -11 0 18 14 17 6 15 32 0 -33 -24 25 100 100P 35 -1 -4 0 3 -13 12 2 5 -1 10 12 -3 50 11 0 13 14 17 -30 -25 6 100 100G 70 60 20 70 50 -60 150 -20 -30 -10 -50 -30 40 30 20 -30 60 40 20 -100 -70 30 100 100E 22 29 -4 36 40 -33 39 10 -12 11 -18 -11 22 15 32 3 31 11 -4 -32 -31 35 100 100S 25 14 26 11 11 -23 29 -5 -3 11 -18 -12 12 38 0 6 57 34 1 -10 -28 4 100 100V 26 -11 -1 -9 -6 16 9 -14 46 -11 45 37 -12 6 -5 -19 -3 11 61 -30 -3 -6 100 100R -4 13 -8 7 7 -30 -3 14 -14 49 -22 5 13 16 17 60 27 4 -14 50 -33 12 100 100! 11I -1 -17 -13 -19 -13 46 -21 -16 67 -8 64 58 -19 -13 -11 -12 -13 5 54 -13 6 -11 100 100S 28 20 43 14 14 -21 40 -13 -3 14 -24 -17 20 27 -7 4 90 38 -3 9 -27 1 100 100C 30 -40 150 -50 -60 -10 20 -10 20 -60 -80 -60 -30 10 -60 -30 70 20 20 -120 100 -60 100 100K 4 18 -11 17 17 -32 6 15 -12 40 -17 1 17 15 31 39 24 4 -11 18 -31 24 100 100A 53 11 19 12 12 -20 31 -6 -1 3 -9 -4 11 21 5 -8 33 17 5 -21 -15 6 30 30S 28 21 28 19 16 -22 45 -11 -5 8 -21 -14 18 21 -2 -2 60 36 2 -13 -27 6 100 100G 29 41 -9 53 39 -44 60 9 -16 7 -24 -15 28 15 37 -4 20 14 1 -54 -37 37 100 100W 2 -4 35 -14 -10 31 1 -4 8 -12 8 -4 1 -8 -23 -12 38 1 -2 43 28 -18 100 100L 10 -10 -19 -10 -3 29 -3 -10 32 -3 44 41 -6 0 -6 -16 -3 44 32 -3 0 -3 100 100W - 21 -28 -18 -39 -26 57 -30 1 29 -15 53 37 -20 -22 -21 -1 -14 -12 13 68 40 -22 100 100! 21S 27 33 18 37 27 -32 50 -4 -10 9 -27 -19 25 18 9 -1 59 18 -3 -20 -29 17 100 100S 29 8 40 4 4 3 19 -4 -2 -2 -10 -11 11 9 -9 -9 48 11 -2 14 4 -6 100 100B 12 35 6 33 21 -10 26 14 -10 0 -15 -15 35 -6 10 -11 10 7 -6 -18 3 14 100 100D 34 47 -20 66 57 -48 39 17 -9 14 -21 -15 32 11 35 -4 15 15 -6 -61 -27 47 100 100A 31 11 7 14 11 -15 31 -4 -4 -1 -8 -4 8 11 6 -8 14 11 6 -25 -14 7 21 21N 3 15 -4 10 7 -7 6 7 -4 6 -6 -4 21 0 6 1 4 3 -4 -4 -1 6 21 21T 6 3 3 3 3 -4 6 -1 3 3 -1 0 3 4 -1 -1 4 21 3 -8 -4 1 21 21Y -4 -4 14 -7 -7 19 -10 4 1 -8 4 -1 -1 -11 -8 -8 -6 -4 -1 15 21 -8 21 21L -3 -20 -34 -21 -12 45 -20 -11 34 -7 66 62 -17 -12 -3 -10 -17 -3 34 12 8 -8 21 21N 2 31 4 15 9 4 3 20 -8 4 -9 -11 46 -11 4 -5 4 2 -11 6 18 4 21 21! 31W - 80 -70 -120 -110 -110 130 -100 -10 -50 10 50 -30 -30 -80 -50 140 30 -60 -80 150 110 -80 100 100F -3 -16 38 -22 -22 51 -16 0 38 -25 35 16 -13 -22 -25 -29 -16 -3 44 10 44 -25 100 100R -8 3 -29 3 6 -10 -14 23 -3 27 7 24 3 10 32 48 -4 -6 -1 44 -23 19 100 100Q 20 50 -60 70 70 -80 20 70 -30 40 -10 0 40 30 150 40 -10 -10 -20 -50 -60 110 100 100A 48 19 -10 19 19 -38 19 0 -6 48 -13 6 19 19 19 16 19 19 0 -22 -29 19 100 100P 49 8 10 10 10 -47 27 10 -11 6 -18 -11 3 92 20 13 28 23 8 -57 -50 14 100 100G 70 60 20 70 50 -60 150 -20 -30 -10 -50 -30 40 30 20 -30 60 40 20 -100 -70 30 100 100Q 11 34 -42 44 44 -55 10 41 -20 44 -10 3 28 18 91 34 -3 -3 -14 -27 -42 68 100 100G 49 26 20 29 23 -30 66 -11 -11 0 -23 -14 20 22 8 -12 45 22 8 -39 -32 12 100 100L 13 -13 -22 -13 -6 16 -6 0 19 -6 38 35 -13 38 6 -3 0 6 29 -10 -16 0 100 100! 41E 11 22 -38 35 53 -17 12 20 1 11 10 12 16 3 42 0 -1 4 2 -35 -20 47 100 100L - 10 -10 -49 -10 -11 42 -20 -2 16 -4 48 32 -7 -19 0 7 -6 -9 12 21 18 -5 100 100L -3 -31 -43 -31 -20 71 -26 -16 61 -20 96 82 -27 -16 -8 -27 -24 -3 66 17 16 -14 100 100I 15 6 19 6 3 10 20 -15 42 -5 13 11 0 3 -8 -12 26 16 36 -26 -12 -2 100 100Y - 24 -27 56 -42 -38 101 -48 16 15 -44 34 1 -13 -55 -45 -41 -27 -21 -3 81 105 -44 100 100I 15 5 12 6 3 10 17 -14 46 -5 17 15 -1 2 -8 -15 9 33 40 -38 -11 -1 100 100S 10 7 -3 6 6 -3 18 -1 1 8 3 12 6 10 6 12 25 7 8 17 -19 4 100 100S 25 33 21 26 20 -25 45 -2 -11 11 -25 -18 36 17 5 0 60 18 -5 -9 -24 10 100 100S 11 21 32 9 6 3 15 5 -6 4 -14 -15 29 2 -6 -4 46 8 -9 21 7 -3 100 100* 12 0 4 6 6 4 22 0 6 6 21 2 6 9 12 7 25 8 10 5 11 0
![Page 25: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/25.jpg)
25
Cons A B C D E ... Gap Len 13C 30 -40 150 -50 -60 100 10014K 4 18 -11 17 17 100 10015A 53 11 19 12 12 30 3016S 28 21 28 19 16 100 100
g1 ELVKAGSSVK MSCKATGYTF SSYE....LY WVm3 GLVEPGGSLR LSCSASGFTF SAND....MN WVk2 LPVTPGEPAS ISCRSSQSLL DSGDGNTYLN WYl3 VSVALGQTVR ITCQ.GDSLR GYDAA..... WY
We search with a profile in the same way as we did before.
These days profiles of this kind have been largely replaced by Hidden Markov Models for sequence searching and alignment.
![Page 26: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/26.jpg)
26
Modelling motifs: the next steps
Missing from the weight matrix models of motifs and profiles are good ways of dealing with:
• Length distributions for insertions/deletions• Local and non-local association of amino acids
Hidden Markov Models (HMM) help with the first. Dealing with the second remains a hard unsolved problem.
![Page 27: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/27.jpg)
27
Hidden Markov Models
Processes {(St, Ot), t=1,…}, where St is the hidden
state and Ot the observation at time t, such that
pr(St | St-1, Ot-1,St-2 , Ot-2 …) = pr(St | St-1)
pr(Ot | St ,St-1, Ot-1,St-2 , Ot-2 …) = pr(Ot | St, St-1)
The basics of HMM were laid bare in a series of beautiful papers by L E Baum and colleagues around 1970, and their formulation has been used almost unchanged to this day.
![Page 28: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/28.jpg)
28
Hidden Markov Models:extensions
Many variants are now used. For example, the distribution of O may depend only on previous S, or also on previous O values,
pr(Ot | St , St-1 , Ot-1 ,.. ) = pr(Ot | St ), or
pr(Ot | St , St-1 , Ot-1 ,.. ) = pr(Ot | St , St-1 ,Ot-1) .
Most importantly for us, the times of S and O may be
decoupled, permitting the Observation corresponding to State time t to be a string whose length and composition depends on St (and possibly St-1 and part or all of the previous Observations). This is called a hidden semi-Markov or generalized hidden Markov model.
![Page 29: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/29.jpg)
Some current applications of HMM to biology mapping chromosomes
aligning biological sequences
predicting sequence structure
inferring evolutionary relationships
finding genes in DNA sequence
Some early applications of HMM finance, but we never saw them
speech recognition
modelling ion channels
In the mid-late 1980s HMMs entered genetics and molecular biology, and they are now firmly entrenched.
![Page 30: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/30.jpg)
30
The algorithms
As the name suggests, with an HMM the series O = (O1,O2 ,O3 ,……., OT) is observed, while the states S = (S1 ,S2 ,S3 ,……., ST) are not.
There are elegant algorithms for calculating pr(O|), arg max pr(O|) in certain special cases, and arg maxS pr(S|O,).
Here are the parameters of the model, e.g. transition and observation probabilities.
![Page 31: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/31.jpg)
31
Profile HMM = stochastic regular expressions
M = Match state, I = Insert state, D = Delete state. To operate, go from left to right. I and M states output amino acids; B, D and E states are “silent”.
![Page 32: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/32.jpg)
32
How profile HMM are used
Instances of the motif are identified by calculating log{pr(sequence | M)/pr(sequence | B)}, where M and B are the motif and background HMM.Alignments of instances of the motif to the HMM are
found by calculating
arg maxstates pr(states | instance, M).Estimation of HMM parameters is by calculating
arg maxparameterspr(sequences | M, parameters).
In all cases, we use the efficient HMM algorithms.
![Page 33: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/33.jpg)
33
Pfam domain-HMM
Pfam is a library of models of recurrent protein domains. They are constructed semi-automatically using profile hidden Markov models.
Pfam families have permanent accession numbers and contain functional annotation and cross-references to other databases, while Pfam-B families are re-generated at each release and are unannotated.
See http://pfam.wustl.edu/
![Page 34: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/34.jpg)
34
On the Modelling of Finding Short DNA Motifs
![Page 35: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/35.jpg)
35
Outline
• Some Biological Background
• Models and Modelling
• Results and Discussions
• Future Work
![Page 36: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/36.jpg)
36
Cartoon of gene structure
(5’ Splice site)
(3’ Splice site)
![Page 37: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/37.jpg)
37
12 examples of 5’splice (donor) sitesTCGGTGAGTTGGGTGTGTCCGGTCCGTATG GTAAGATCT GTAAGTCAGGTAGGACAGGTAGGGAAGGTAAGGAGGGTATGGTGGGTAAGGGAGGTTAGT CATGTGAGT
exon intron
![Page 38: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/38.jpg)
38
Probability models for short DNA motifs
Short: ~6-20 base pairs
DNA motifs: splice sites, transcription factor binding sites, translation initiation sites, enhancers, promoters,...
Why probability models?• to characterize the motifs
• to help identify them • for incorporation into larger models
Aim: given a number of instances of a DNA sequence motif, we want a model for the probability of that motif.
![Page 39: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/39.jpg)
39
Weight matrix models, Staden (1984)
Base -3 -2 -1 0 +1 +2 +3 +4 +5
A 33 61 10 0 0 53 71 7 16
C 37 13 3 0 0 3 8 6 16
G 18 12 80 100 0 42 12 81 22
T 12 14 7 0 100 2 9 6 46
A weight matrix for donor sites. Essentially a mutual independence model.
An improvement over the consensus CAGGTAAGT.But we have to go beyond independence…
![Page 40: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/40.jpg)
40
Beyond independence
Weight array matrices, Zhang & Marr (1993) consider dependencies between adjacent positions, i.e. (non-stationary) first-order Markov models. The number of parameters increases exponentially if we restrict to full higher-order Markovian models.
Variable length Markov models, Rissanen (1986), Buhlmann & Wyner (1999), help us get over this problem. In the last few years, many variants have appeared: all make use of trees.
[The interpolated Markov models of Salzberg et al (1998) address the same problem.]
![Page 41: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/41.jpg)
41
Some notation
L: length of the DNA sequence motif.
Xi: discrete random variable at position i, taking values from the set = {A, C, G, T}.
x = (x1, …, xL): a DNA sequence motif of length L.
x1j (j>1): the sequence (x1 , x2, …., xj ).
P(x): probability of x.
![Page 42: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/42.jpg)
42
Variable Length Markov Models(Rissanen (1986), Buhlmann & Wyner (1999))
Factorize P(x) in the usual telescopic way:
Simplify this using context functions l = 2,..L, to
P(X1 x1) P(X l x l | X1l 1
l2
L
x1l 1)
P(X1 x1) P(X l x l | c l (X1l 1
l2
L
) c l (x1l 1))
11:0
l
ll i
ic
![Page 43: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/43.jpg)
43
VLMM cont.
A VLMM for a DNA sequence motif of length L is specified by
• a distribution for X1, and, for l = 2,…L,• a constrained conditional distribution for Xl given Xl-1,…,X1. That is, we need L-1 context functions.
![Page 44: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/44.jpg)
44
VLMM: an illustrative example
Pruned context: P(X3|X2=A, X1) = P(X3|X2=A), etc.
X2
X2 X1
X3
X2
X2 X1
X3
X2
X2
Full
PrunedX3
![Page 45: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/45.jpg)
45
Sequence dependencies(interactions) are not always local
The methods outlined so far all fail to incorporate long-range ( ≥4 bp) interactions. New model types are needed!
3-dimensional folding; DNA, RNA & protein interactions
![Page 46: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/46.jpg)
46
Modelling “long-range” dependency
The principal work in this area is Burge & Karlin’s (1997)
maximal dependence decomposition (MDD) model. Cai et al (2000) and Barash et al (2003) used Bayesian networks (BN). Yeo & Burge (2003) applied maximum entropy models.
Ellrott et al (2002) ordered the sequence and applied Markov models to the motif. We have adapted this last idea, to give permuted variable length Markov models (PVLMM).
![Page 47: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/47.jpg)
47
Permuted variable length MMs (PVLMM)
Let be a permutation vector. By using VLMM, simplify the context terms in the equation:
The simplified equation has the form:
1 2( , ,..., )L
1 1
1 1 1 12( ) ( ) ( | ( ) ( ))l l
l l
L
l ll
P x P X x P X x c X c x
![Page 48: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/48.jpg)
48
Part of a context (decision) tree for position -2 of a splice donor PVLMM
Sequence order: +2 +5 -1 +4 -2 +3 -3
Node #s: countsEdge #s: split variables
![Page 49: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/49.jpg)
49
Maximal Dependence Decomposition
MDD starts with a mutual independence model as with WMMs. The data are then iteratively subdivided, at each stage splitting on the most dependent position, suitably defined. At the tips of the tree so defined, a mutual independence model across all remaining positions is used.
The details can vary according to the splitting criterion (Burge & Karlin used 2), the actual splits (binary, etc), and the stopping rule.
However, the result is always a single tree.
![Page 50: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/50.jpg)
50
Parts of MDD trees for splice donors
In each case, splits are into the most frequent nt vs the others.
![Page 51: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/51.jpg)
51
Issues in statistical modelling of short DNA motifs
In any study of this kind, essential items are:
• the model class (e.g. WMM, PVLMM, or MDD)• the way we search through the model class (e.g. by
forward selection & simulated annealing)• the way we compare models when searching (e.g.
by AIC, BIC, or MDL), and finally,• the way we assess the final model in relation to our
aims (e.g. by cross-validation).
We always need interesting, high-quality datasets.
![Page 52: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/52.jpg)
52
Model assessment:Stand-alone recognition
M: motif model B: background model
Given a sequence x = (x1, …, xL), we predict x to be a motif if
log {P(x | M) / P(x | B)} > c,
for a suitably chosen threshold value c.
![Page 53: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/53.jpg)
53
Model assessment: terms
TP: true positives, FP: false positives
TN: true negatives, FN: false negatives
Sensitivity (sn) and specificity (sp) are:
sn = TP / [TP + FN] sp = TP / [TP + FP].
![Page 54: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/54.jpg)
54
Transcription factor binding sites
These are of great interest, their signals are very weak, and we typically have only a few instances.
TRANSFAC (Wingender et al 2000).
We have studied 61 TFs with effective length ≤ 9 and ≥ 20 instances, 2,238 sites in all.
In 25/61 cases we are able to improve upon WMM. In the remaining, PVLMM performs similarly to WMM.
![Page 55: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/55.jpg)
55
Assessing our models for TFBS recognition
We randomly inserted each of the 2,238 sites into a background sequence of length 1,000 simulated from a stationary 3rd order MM, with parameters estimated from a large collection of human sequence upstream of genes.
10-fold cross-validation (CV) was used for each TF. At each round of CV, 90% of the true sites were used to fit the models (PVLMM, MDD, WMM). We then scanned the sequences containing the remaining 10% of the sites, and selected the N top-scoring sequences as putative binding sites, where N was the size of the remaining 10%. Thus we have sn = sp.
![Page 56: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/56.jpg)
56
TFBS: results for 25/61
![Page 57: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/57.jpg)
57
Three transcription factors
wmm: 0.54pvlmm: 0.72mdd: 0.64
wmm: 0.38pvlmm: 0.55mdd: 0.45
Sensitivitywmm: 0.37pvlmm: 0.54mdd: 0.37
![Page 58: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/58.jpg)
58
Splice donor sites
Mammalian donor sequences from SpliceDB, Burset et al (2001).
15,155 canonical donor sites of length 9, with GT conserved at position 0 and 1.
Mammalian donor site
![Page 59: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/59.jpg)
59
Beginning of the splicing process
Splice donorExon U1snRNP(RNA+proteins)
![Page 60: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/60.jpg)
60
Base -3 -2 -1 0 +1 +2 +3 +4 +5
A 33 61 10 0 0 53 71 7 16
C 37 13 3 0 0 3 8 6 16
G 18 12 80 100 0 42 12 81 22
T 12 14 7 0 100 2 9 6 46 U1 snRNA G U C C A U U C A
Interpretation of PVLMM model selectedMammalian donor site
![Page 61: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/61.jpg)
61
U1 sn RNA: G U C C A U U C A Optimal permutation: +2 +3 +4 -1 -2 +5 -3
“Long-range” dependence in the model:5’/3’ compensation
![Page 62: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/62.jpg)
62
Integrating into a gene finder
SLAM (Pachter et al, 2002): a eukaryotic cross-species gene finder (generalized pair hidden Markov model (GPHMM)).
Results at the exon level
VLMM MDD PVLMM
Sensitivity 360/465 365/465 371/465
Specificity 360/470 365/465 371/464
![Page 63: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/63.jpg)
63
Conclusions
• Systematic framework for probabilistic modeling of short DNA motifs
• Statistical modeling issues• Better prediction performance by modeling local & non-local dependence• Good biological interpretations
![Page 64: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/64.jpg)
64
Future work
• Model the dependence in protein motifs• Extend to find de novo motifs• Find a way to deal with indels• Incorporate local sequence information into
PVLMM-based predictions• Jointly model multiple motifs in one species and
sites across species
![Page 65: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/65.jpg)
65
ReferencesBiological Sequence AnalysisR Durbin, S Eddy, A Krogh and G MitchisonCambridge University Press, 1998.
Bioinformatics The machine learning approachP Baldi and S BrunakThe MIT Press, 1998
Post-Genome InformaticsM KanehisaOxford University Press, 2000
Find short DNA motifs using permuted Markov modelsX Zhao, H Huang and T Speed, RECOMB2004.
![Page 66: Biological Sequence Analysis](https://reader036.vdocuments.us/reader036/viewer/2022062803/56814789550346895db4b9f6/html5/thumbnails/66.jpg)
66
Acknowledgements Terry Speed, UCB & WEHI Haiyan Huang, UCB
The SLAM team: Simon Cawley, Affymetrix Lior Pachter, UCB Marina Alexandersson, FCC
Sourav Chatterji, UCB Mauro Delorenzi (ISREC) WEHI bioinformatics lab