algorithms for regulatory motif discovery
DESCRIPTION
Algorithms for Regulatory Motif Discovery. Xiaohui Xie University of California, Irvine. Positional weight matrix representation. Lambda cI/cro binding sites. Sequence Logo. Probabilistic model for motif discovery. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Algorithms for Regulatory Motif Discovery](https://reader035.vdocuments.us/reader035/viewer/2022080900/56812d14550346895d91f563/html5/thumbnails/1.jpg)
Algorithms for Regulatory Motif Discovery
Xiaohui Xie
University of California, Irvine
![Page 2: Algorithms for Regulatory Motif Discovery](https://reader035.vdocuments.us/reader035/viewer/2022080900/56812d14550346895d91f563/html5/thumbnails/2.jpg)
Lambda cI/cro binding sites
A: 9 9 94 25 1 71 1 1 1 9 17 32 9 17 1 48 1 71 63C: 17 17 1 25 94 9 86 55 9 40 71 9 1 1 1 1 1 1 9
G: 9 1 1 1 1 1 1 9 71 40 9 55 86 994
25 1 17 17
T: 63 71 1 48 1 17 9 32 17 9 1 1 1 71 1 2594
9 9
Positional weight matrix representation
ijW
Sequence
Logo
![Page 3: Algorithms for Regulatory Motif Discovery](https://reader035.vdocuments.us/reader035/viewer/2022080900/56812d14550346895d91f563/html5/thumbnails/3.jpg)
Probabilistic model for motif discovery
• Input: a set of sequence: {S1, S2 …,SN}, each of which is of length w, i.e. |Si|=w for all i.
• Two models:– Motif model: P(y|)– Background model: P(y|)
• Hidden variables: {z1, z2 …,zN}, where zi =1 if Si is a motif site and 0 otherwise. – P(Si|zi=1) = P(Si|) and P(Si|zi=0) = P(Si|) – P(zi=1|Si) = P(Si|zi=1)P(zi=1)/[P(Si|zi=1)P(zi=1)+P(Si|zi=0)P(zi=0)]
![Page 4: Algorithms for Regulatory Motif Discovery](https://reader035.vdocuments.us/reader035/viewer/2022080900/56812d14550346895d91f563/html5/thumbnails/4.jpg)
Probabilistic model for motif discovery
• Input: a set of sequences: {S1, S2 …,SN}, each of which is of length w, i.e. |Si|=w for all i.
• Parameter estimation problem:– Find the and that maximize P(S1, S2 …,SN| , )
• If zi =1 for all i, then ij = cij/N where cij is the number of letter j occurring at position i in the set of sequences.
![Page 5: Algorithms for Regulatory Motif Discovery](https://reader035.vdocuments.us/reader035/viewer/2022080900/56812d14550346895d91f563/html5/thumbnails/5.jpg)
String matching
![Page 6: Algorithms for Regulatory Motif Discovery](https://reader035.vdocuments.us/reader035/viewer/2022080900/56812d14550346895d91f563/html5/thumbnails/6.jpg)
Exact String Matching
• Input: Two strings T[1…n] and P[1…m], containing symbols from alphabet .
Example: = {A,C,G,T}
T[1…12] = “CAGTACATCGAT”
P[1..3] = “AGT”
• Goal: find all “shifts” 0≤s ≤n-m such that T[s+1…s+m] = P
![Page 7: Algorithms for Regulatory Motif Discovery](https://reader035.vdocuments.us/reader035/viewer/2022080900/56812d14550346895d91f563/html5/thumbnails/7.jpg)
Simple Algorithm
for s ← 0 to n-mMatch ← 1
for j ← 1 to m
if T[s+j]≠P[j] then
Match ← 0
exit loop
if Match=1 then output s
![Page 8: Algorithms for Regulatory Motif Discovery](https://reader035.vdocuments.us/reader035/viewer/2022080900/56812d14550346895d91f563/html5/thumbnails/8.jpg)
Analysis
• Running time of the simple algorithm:– Worst-case: O(nm)– Average-case (random text): O(n) (expectation)
• Ts = time spend on checking shift s(the number of comparisons until 1st mismatch)
• E[Ts] < 2 (why)
• E[sTs] = sE[Ts] = O(n)
![Page 9: Algorithms for Regulatory Motif Discovery](https://reader035.vdocuments.us/reader035/viewer/2022080900/56812d14550346895d91f563/html5/thumbnails/9.jpg)
Worst-case
• Is it possible to achieve O(n) for any input ?– Knuth-Morris-Pratt’77: deterministic– Karp-Rabin’81: randomized
• Digression: what is the difference between– Algorithm that is fast on a random input (as seen on the
previous slide)– Randomized algorithm (as in the rest of this lecture)
![Page 10: Algorithms for Regulatory Motif Discovery](https://reader035.vdocuments.us/reader035/viewer/2022080900/56812d14550346895d91f563/html5/thumbnails/10.jpg)
Karp-Rabin Algorithm
• Idea: semi-numerical approach:– Consider all m-mers:– T[1…m], T[2…m+1], …, T[m-n+1…n]– Map each T[s+1…s+m] into a number ts– Map the pattern P[1…m] into a number p– Report the m-mers that map to the same value as p
• Problem: how to map all m-mers in O(n) time ?
![Page 11: Algorithms for Regulatory Motif Discovery](https://reader035.vdocuments.us/reader035/viewer/2022080900/56812d14550346895d91f563/html5/thumbnails/11.jpg)
Implementation
• Attempt I:– Assume ={0,1}
(for {A,C,G,T} convert: A:00, C:01, G:10, T:11)– Think about each T[s+1…s+m] as a number in binary
representation, i.e.,
• ts=T[s+1]2m-1+T[s+2]2m-2+…+T[s+m]20
– Find a fast way of computing ts+1 given ts
– Output all s such that ts is equal to the number p represented by P
![Page 12: Algorithms for Regulatory Motif Discovery](https://reader035.vdocuments.us/reader035/viewer/2022080900/56812d14550346895d91f563/html5/thumbnails/12.jpg)
Magic formula
• How to transformts=T[s+1]2m-1+T[s+2]2m-2+…+T[s+m]20
into
ts+1=T[s+2]2m-1+T[s+3]2m-2+…+T[s+m+1]20 ?
• Three steps:– Subtract T[s+1]2m-1
– Multiply by 2 (i.e., shift the bits by one position)– Add T[s+m+1]20
• Therefore: ts+1= (ts- T[s+1]2m-1)*2 + T[s+m+1]20
![Page 13: Algorithms for Regulatory Motif Discovery](https://reader035.vdocuments.us/reader035/viewer/2022080900/56812d14550346895d91f563/html5/thumbnails/13.jpg)
Algorithm
ts+1= (ts- T[s+1]2m-1)*2 + T[s+m+1]20
• Can compute ts+1 from ts using 3 arithmetic operations
• Therefore, we can compute all t0,t1,…,tn-m using O(n) arithmetic operations
• We can compute a number corresponding to P using O(m) arithmetic operations
• Are we done ?
![Page 14: Algorithms for Regulatory Motif Discovery](https://reader035.vdocuments.us/reader035/viewer/2022080900/56812d14550346895d91f563/html5/thumbnails/14.jpg)
Problem
• To get O(n) time, we would need to perform each arithmetic operation in O(1) time
• However, the arguments are m-bit long !• If m large, it is unreasonable to assume that
operations on such big numbers can be done in O(1) time
• We need to reduce the number range to something more managable
![Page 15: Algorithms for Regulatory Motif Discovery](https://reader035.vdocuments.us/reader035/viewer/2022080900/56812d14550346895d91f563/html5/thumbnails/15.jpg)
Hashing
• We will instead compute
• t’s=T[s+1]2m-1+T[s+2]2m-2+…+T[s+m]20 mod q where q is an “appropriate” prime number
• One can still compute t’s+1 from t’s :
t’s+1= (t’s- T[s+1]2m-1)*2+T[s+m+1]20 mod q
• If q is not large, we can compute all t’s (and p’) in O(n) time
![Page 16: Algorithms for Regulatory Motif Discovery](https://reader035.vdocuments.us/reader035/viewer/2022080900/56812d14550346895d91f563/html5/thumbnails/16.jpg)
Problem
• Unfortunately, we can have false positives, i.e., T[s+1…s+m]P but ts mod q = p mod q
• Our approach:– Use a random q– Show that the probability of a false positive is small
→ randomized algorithm
![Page 17: Algorithms for Regulatory Motif Discovery](https://reader035.vdocuments.us/reader035/viewer/2022080900/56812d14550346895d91f563/html5/thumbnails/17.jpg)
Aligning two sequences
• Longest common substring (LCS) problem (no gaps):– Input: Two strings T[1…n] and P[1…m], n≥m– Goal: Largest k such that
T[i+1…i+k] = P[j+1…j+k]
for some i,j
• How can we solve this problem efficiently ?– Hint: How can we check if LCS has length k ?
![Page 18: Algorithms for Regulatory Motif Discovery](https://reader035.vdocuments.us/reader035/viewer/2022080900/56812d14550346895d91f563/html5/thumbnails/18.jpg)
Checking for a common m-mer