counting position weight matrices in a sequence & an application to discriminative motif finding...
TRANSCRIPT
![Page 1: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/1.jpg)
Counting position weight matrices in a sequence & an application to
discriminative motif finding
Saurabh SinhaComputer ScienceUniversity of Illinois, Urbana-Champaign
![Page 2: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/2.jpg)
Transcriptional Regulation
GENE
ACAGTGA
TRANSCRIPTIONFACTOR
PROTEIN
![Page 3: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/3.jpg)
GENE
ACAGTGA
TRANSCRIPTIONFACTOR
PROTEIN
Transcriptional Regulation
![Page 4: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/4.jpg)
Binding sites and motifs Transcription factor binding sites in a gene’s neighborhood are the fundamental units of the regulatory network
Transcription factor binding is specific, hence binding sites are similar to each other, but variability is often seen
A motif is the common sequence pattern among binding sites of transcription factor
![Page 5: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/5.jpg)
Motif models Consensus string, e.g., ACGWGT Position Weight Matrix (PWM)
![Page 6: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/6.jpg)
Position Weight Matrix
5 0 2 0 0 2 0 A
0 5 3 1 0 0 0 C
0 0 0 3 5 0 0 G
0 0 0 1 0 3 5 T
ACCCGTTACCGGTTACAGGATACCGGTTACATGAT
Binding sites
PWM
![Page 7: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/7.jpg)
Databases of PWMs Transfac has ~100s of PWMs for human
Jaspar: a smaller, perhaps better curated database of PWMs
Organism specific databases coming up frequenctly
PWMs in databases often derived from experimentally validated binding sites
![Page 8: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/8.jpg)
Bioinformatics of PWMs Popular motif model i.e., several motif finding algorithms that attempt to find PWMs from sequences
Gibbs sampling: one of the earliest; tries to sample PWMs with high “relative entropy”
MEME: another early algorithm; uses expectation maximization to find PWMs that best “model the sequences”
Many more algorithms to find PWMs from a set of sequences
![Page 9: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/9.jpg)
Problem: counting motifs Given DNA sequence, and a consensus motif (say “ACGWGT”), count the motif in the sequence
Trivial solution What if the motif is a Position Weight Matrix (PWM) ?
Why hasn’t this problem been looked at?
Because previous algorithms used different scores of PWMs: how “sharp” they are, how well they explain data, etc.
![Page 10: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/10.jpg)
Counting matches to a PWM: A possibility
For each site s in sequence, compute
If Pr(s | W) > some threshold, call s a site
Count number of sites in sequence No distinction between strong and weak sites, as long as they are above threshold binary scheme, not realistic
![Page 11: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/11.jpg)
A wish-list (for the score) Score should consider both strong and weak occurrences of motif
Score should assign appropriate weights to strong and weak occurrences
Score should be aware that there may also be sites of other known motifs in the sequence
The list goes on: score should be efficiently computable, score should be differentiable, score should …
![Page 12: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/12.jpg)
The “w-score” Defined by a probabilistic model of sequence generation
Given one or more motifs, and a background distribution, defines a probability space on sequences
A simple (zeroth order) Hidden Markov model (HMM)
![Page 13: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/13.jpg)
Probabilistic Model: toy example
Given two motifs W1,W2, a “background” motif Wb, and a sequence length L
Pr(Wi Wj) = pj transition probability
When in state Wi, emit a substring s chosen with probability Pr(s | Wi) emission probability
Stop when length of emitted sequence is L
W1
W2
Wb
A stochastic process generating sequences of length L
![Page 14: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/14.jpg)
A “path” through the HMM
One possible path T1
W1
W1 W2
Wb Wb Wb
W2
Wb Wb
W2
Another possible path T2
![Page 15: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/15.jpg)
Likelihood of sequence & paths
A path of the HMM defines the locations of motif matches
For a sequence S & a path T, the joint probability Pr(S,T) is easy to compute
Conditional probability of a path T, given the data S, is:
Strong matches make the probability higher
Paths with weak matches have lower conditional probabilities
W1
W1 W2
Wb Wb Wb
W2
Wb Wb
W2
![Page 16: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/16.jpg)
Let the number of occurrences of a motif (say W1) in path T be
Compute:
In words: An average of the motif count
, with weights equal to the
probability of T given S
The “w-score”
![Page 17: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/17.jpg)
The “w-score” (Cont’d) Score depends both on number and quality of matches to motif.
Every substring is a potential binding site, and paths placing the motif there will contribute to the count
Pr(T | S) depends on the match strength of all motifs, not just the one being counted
![Page 18: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/18.jpg)
The wish-list (again) Score should give consider both strong and weak occurrences of motif
Score should assign appropriate weights to strong and weak occurrences
Score should be aware that there may also be sites of other known motifs in the sequence
An exciting new feature of this motif score
![Page 19: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/19.jpg)
Computational pros and cons The w-score computation takes time, where L is sequence length, and lm is the motif length. This is relatively expensive
The w-score can be differentiated with respect to all of the PWM parameters in time Important feature for search algorithms
![Page 20: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/20.jpg)
Using the “w-score” in discriminative motif finding
![Page 21: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/21.jpg)
Discriminative motif finding Suppose we have a set of co-regulated genes, i.e., we believe they have binding sites of the same transcription factor (in their regulatory control regions)
Traditionally, motif finding tries to find these binding sites, based on over-representation, conservation etc.
Often we also know a set of genes that should NOT have binding sites of that transcription factor
Examples: ChIP-on-chip, In situ hybridization pictures of Drosophila embryo, etc.
![Page 22: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/22.jpg)
Problem formulation Given two sets of sequences S+ and S-
Find a motif that has many occurrences in S+ and few occurrences in S-
Maximize the difference in the average counts of the motif in the two sets
Let W(S) = count of a motif W in sequence S
Maximize:
![Page 23: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/23.jpg)
Optimization problem Find motif W that maximizes
![Page 24: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/24.jpg)
Derivatives of objective function Let Wk be the PWM entry for base in column k
We can efficiently compute
We can efficiently differentiate our objective function
![Page 25: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/25.jpg)
Algorithm Search space: Set of n = 20 substrings of sequences in S+ (called “site set”)
Objective function: Construct PWM W from site-set, compute score
Length of sites is user-defined
![Page 26: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/26.jpg)
Algorithm
S+
Current site-set C
![Page 27: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/27.jpg)
Algorithm
S+
Replace one site with any site from sequence
Pick a replacement that improves objective function
![Page 28: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/28.jpg)
Algorithm Current solution (site-set): C Candidate new solution: C Many possibilities for C (every substring of every sequence in S+ is a possible replacement)
Evaluate objective function on each candidate C Too slow !
Use derivative information !
![Page 29: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/29.jpg)
Algorithm Estimate the objective function value for each candidate C using partial derivatives and first order approximation
Examine each candidate in decreasing order of estimated score
If a candidate C found with greater score than C, choose it.
![Page 30: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/30.jpg)
Algorithm illustration
Estimated scores
11Accurate score
10
Accurate score
13
Accurate score
Current score = 12
![Page 31: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/31.jpg)
Algorithm Properties Objective function has many desirable properties, but is an expensive operation
Derivative computation has the same time complexity, and is used to guide search
Avoids local optima by searching in a discretized PWM space
Performs significantly better and/or faster than Gibbs sampling and Conjugate Gradients, for this particular score
![Page 32: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/32.jpg)
Discriminative PWM Search (DIPS)
Software available Can easily handle data sets of ~100 sequences
Can find multiple motifs iteratively, but without masking: Find a PWM, then include it in the model as a known PWM, find another PWM, and so on
![Page 33: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/33.jpg)
Performance tests Tested on synthetic data Compared to traditional motif finder as well as two discriminative motif finders
Superior performance in the presence of “distractor” motifs it really helps to be able to count a motif in the presence of other known motifs
![Page 34: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/34.jpg)
Tests on Drosophila Enhancers
0
20
40
60
80
100
120
140
160
180
200
100 80 60 40 20
HEAD TAIL
Pro
tein
Con
cent
ratio
n
BICOID (ACTIVATOR)
![Page 35: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/35.jpg)
Tests on Drosophila Enhancers
HEAD TAIL
Pro
tein
Con
cent
ratio
n CAUDAL (ACTIVATOR)
0
50
100
100 80 60 40 20
![Page 36: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/36.jpg)
DIPS runs S+ = promoters of genes expressed in anterior half of embryo
S- = promoters of genes expressed in posterior half of embryo
Top motif: Bicoid !
0
20
40
60
80
100
120
140
160
180
200
100 80 60 40 20
HEAD TAIL
Pro
tein
Con
cent
ratio
n
BICOID (ACTIVATOR)
![Page 37: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/37.jpg)
DIPS runs S+ = promoters of genes expressed in posterior half of embryo
S- = promoters of genes expressed in anterior half of embryo
Top motif: Caudal !
HEAD TAIL
Pro
tein
Con
cent
ratio
n CAUDAL (ACTIVATOR)
0
50
100
100 80 60 40 20
![Page 38: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/38.jpg)
Summary of results
Phase S+ S- Found motif Best match Pvalue of matchanterior 50% posterior 50% anterior.1 bicoid 0.00posterior 50% anterior 50% posterior.1 caudal 0.03terminal 40% middle 80% terminal.1 torRE 0.00middle 40% 0-30%, 70-100% centrall.1 hunchback 0.0080-100% EL 60-80% EL 5.1.1 torRE 0.1060-80% EL 80-100%,40-60% 5.2.1 caudal 0.0640-60% EL 60-80%, 20-40% 5.3.1 kruppel 0.2520-40% EL 0-20%, 40-60% 5.4.1 knirps 0.030-20% EL 20-40% EL 5.5.1 Dichaete 0.07anterior 50% posterior 50% anterior.2 huckebein 0.02posterior 50% anterior 50% posterior.2 pdm1_2 0.00terminal 40% middle 80% terminal.2 caudal 0.06middle 40% 0-30%, 70-100% centrall.2 huckebein 0.1280-100% EL 60-80% EL 5.1.2 knirps 0.0160-80% EL 80-100%,40-60% 5.2.2 torRE 0.0140-60% EL 60-80%, 20-40% 5.3.2 giant 0.0920-40% EL 0-20%, 40-60% 5.4.2 giant 0.050-20% EL 20-40% EL 5.5.2 bicoid 0.18
1 (activator)
2 (repressor)
3 (activator)
4 (repressor)
![Page 39: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/39.jpg)
Social regulation in honey bee Transition from nursing in the hive to foraging for food is age related, but also regulated by the needs of the colony
32 genes demonstrated to be significantly differentially expressed in brains of nurses and foragers (21 active in foragers only, 11 active in nurses only)
DIPS run on 2Kbp promoters of these social behavior-related genes
![Page 40: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/40.jpg)
Results on honey bee genes
![Page 41: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/41.jpg)
Conclusion Discriminative motif finding increasingly becoming a necessary analysis
Motif finding in the presence of other known motifs also becoming relevant
A search algorithm that maximizes any objective function of the motif counts in the sequences (as long as its differentiable) Several extensions and variations possible
![Page 42: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/42.jpg)
Acknowledgements Eric Siggia, Eran Segal Yoseph Barash (“LearnPSSM”) Andrew Smith (“DME”)
![Page 43: Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,](https://reader035.vdocuments.us/reader035/viewer/2022062304/56649da75503460f94a93e14/html5/thumbnails/43.jpg)
Reference ISMB 2006 (Brazil); Bioinformatics journal.