pattern discovery and recognition for understanding genetic regulation timothy l. bailey institute...
DESCRIPTION
Identifying Statistically Significant Regulatory Modules Overview of the problem Previous research The MCAST algorithm Validation DiscussionTRANSCRIPT
Pattern Discovery and Recognition for Understanding Genetic Regulation
Timothy L. Bailey
Institute for Molecular Bioscience
University of Queensland
Recent Work Identifying statistically significant
regulatory modules Computing motif statistics Evaluation of motif discovery
algorithms Future directions: motif discovery
in sets of orthologous sequences
Identifying Statistically Significant Regulatory Modules Overview of the problem Previous research The MCAST algorithm Validation Discussion
Problem Statement Given a set of one or more motifs,
can we identify the genes that they regulate by searching a genomic database?
The Problem is Hard The futility theorem: the vast
majority potential TF binding sites are false positives (Wasserman).
This is because TF binding sites are short and degenerate, so they occur frequently at random in DNA.
The Approach Groups of transcription factors
often operate in concert, binding near each other.
Multiple binding sites for the same TF often occur close together.
Whereas individual binding sites cannot be statistically significant, clusters may be.
MCAST Hybrid of Cisanalyst and COMET Based on Meta-MEME (CABIOS
Grundy et al. 13:397-406, 1997) MCAST has two input parameters:
Motif p-value threshold (p) Maximum gap size (L)
MCAST builds a motif-based HMM and uses the Viterbi algorithm to find clusters.
Definition of a Motif Cluster
A “cluster” is a collection of “hits” (matches to motifs) with with no gaps longer than L.
Hits are shown schematically as beads on a string. The number is the motif identifier. +/- indicates which DNA strand the hit is on.
+3 -2 +1 +1
Cluster Scoring Function
h1 h2 h3 h4
d3 d4d2
One cluster
Genomic DNA
4324321 dddghhhhM
Hit scores Gap penaltyGap widths
Performance metrics ROC50 measures the area under a
curve that plots true positive rate as a function of false positive rate, up to the 50th false positive.
KB60 is the average number of kilobases per false positive at a threshold that yields 60% sensitivity.
For both metrics, larger is better.
Four Data sets Drosophila Eve regulators (Bcd, Cad, Hb, Kr,
Kni). 19 positives and 2039 putative negatives.
Human LSF-regulated promoters (LSF, Sp1, Ets, TATA). 9 positives and 2005 putative negatives.
Human muscle-specific promoters (Mef-2, Myf, SRF, Tef, Sp1). 27 positives and 2005 putative negatives.
Muscle* - motifs generated without muscle-specific genes.
Comparison with COMETKB60 ROC50 KB60 ROC50MCAST MCAST COMET COMET
Drosoph
>4041 0.68 1010 0.61
LSF 167 0.44 85 0.35muscle 30 0.38 69 0.46muscle*
14 0.16 6 0.25
Red indicates better performance.
Computing motif statistics Looking for fast ways to compute
the probability of a local, multiple alignment.
Objective function of the latest version of the MEME algorithm.
Computing the statistics of random alignments
Knowing the statistical significance of motifs makes it possible to distinguish “real” motifs from patterns that can be explained by chance.
Computing motif significance is therefore critical to any motif discovery approach.
Measuring the goodness off DNA regulatory motifs: IC
Alignment
nij
Counts
fij=nij/N
Frequencies
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA
…HIS7
…ARO4
…ILV6
…THR4
…ARO1
…HOM2
…PRO3
Sequences
IC =IC1+ …+ICw
Information Content
1 GACATCGAAA2 GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCGN TGTGAAGCAC
12 … w
i
j
POP: product of IC p-values IC is the sum of the information
contents of the motif columns. POP is an alternative measure of
motif quality: the product of the p-values of the column information contents.
Statistics of IC scores Large deviation method for computing
distribution of IC of random alignments is known (Hertz and Stormo, Bioinformatics, 15:653-577, 1999).
Time to compute the p-value of one IC score is O(N2).
MEME computes O(w2N) IC scores per motif, so the total time—O(w2N3)—is prohibitive.
POP p-values can be computed efficiently.
Correction factor for POP p-values The p-value of POP score, p, is roughly:
Because of the discrete nature of IC p-values, it is necessary to correct the POP p-values.
Empirically, the p-value error for POP, p, letting x = ln(p), is about
Estimating the POP p-value correction factor parameters To estimate the correction factor
parameters we: estimate the right tail of the distribution using
a convolution method, fit the (non-linear) correction function to the
tail of the distribution using a least squares approach.
The CPU time per motif to compute POP p-values is negligible once the correction factor parameters are known.
CPU time per motif using LD method to compute p-values
w=16
CPU time to estimate correction factor parameters
w=16
Speedup using POP statistic
Discovering regulatory elements in orthologous genes De novo discovery of most known
regulatory elements in yeast has been demonstrated using four closely related yeast genomes (Kellis et al., Nature 423:241-254, 2003).
We are exploring the possibility of extending their approach to the human genome using orthologous genes from mouse.
Evaluation of motif discovery algorithms Joint work with Martin Tompa and others. Eighteen motif discovery algorithms were
tested evaluated on DNA regulatory motifs in four organisms.
Each algorithm was run by experts in that particular algorithm.
The ability of the algorithm to discover motifs in sets of DNA sequences was measured.
Performance of Motif Discovery Algorithms Finding Regulatory Motifs
nCC categorized by species
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Com
bine
d nC
C wholesetflyhumanmouseyeast
Conservation of known regulatory elements in sets of orthologous genes
Human vs. Mouse Four yeast species
Source: Liu et al., Genome Res 14:451-458, 2004.
Background sequences
Regulatory elements
Regulatory elements
Background sequences
Large-scale discovery of human regulatory elements Compared with yeast, regulatory elements
make up less of human intergenic DNA (3% vs. 15%).
The relative difference in conservation rate (window percent identity) between human and mouse regulatory elements and background sequence is higher than among the four yeast species.
Large-scale motif discovery should be possible using human and mouse orthologous genes.