pattern discovery and recognition for understanding genetic regulation timothy l. bailey institute...

Pattern Discovery and Recognition for Understanding Genetic Regulation

Timothy L. Bailey

Institute for Molecular Bioscience

University of Queensland

Recent Work Identifying statistically significant

regulatory modules Computing motif statistics Evaluation of motif discovery

algorithms Future directions: motif discovery

in sets of orthologous sequences

Identifying Statistically Significant Regulatory Modules Overview of the problem Previous research The MCAST algorithm Validation Discussion

Problem Statement Given a set of one or more motifs,

can we identify the genes that they regulate by searching a genomic database?

The Problem is Hard The futility theorem: the vast

majority potential TF binding sites are false positives (Wasserman).

This is because TF binding sites are short and degenerate, so they occur frequently at random in DNA.

The Approach Groups of transcription factors

often operate in concert, binding near each other.

Multiple binding sites for the same TF often occur close together.

Whereas individual binding sites cannot be statistically significant, clusters may be.

MCAST Hybrid of Cisanalyst and COMET Based on Meta-MEME (CABIOS

Grundy et al. 13:397-406, 1997) MCAST has two input parameters:

Motif p-value threshold (p) Maximum gap size (L)

MCAST builds a motif-based HMM and uses the Viterbi algorithm to find clusters.

Definition of a Motif Cluster

A “cluster” is a collection of “hits” (matches to motifs) with with no gaps longer than L.

Hits are shown schematically as beads on a string. The number is the motif identifier. +/- indicates which DNA strand the hit is on.

+3 -2 +1 +1

Cluster Scoring Function

h1 h2 h3 h4

d3 d4d2

One cluster

Genomic DNA

4324321 dddghhhhM

Hit scores Gap penaltyGap widths

Performance metrics ROC50 measures the area under a

curve that plots true positive rate as a function of false positive rate, up to the 50th false positive.

KB60 is the average number of kilobases per false positive at a threshold that yields 60% sensitivity.

For both metrics, larger is better.

Four Data sets Drosophila Eve regulators (Bcd, Cad, Hb, Kr,

Kni). 19 positives and 2039 putative negatives.

Human LSF-regulated promoters (LSF, Sp1, Ets, TATA). 9 positives and 2005 putative negatives.

Human muscle-specific promoters (Mef-2, Myf, SRF, Tef, Sp1). 27 positives and 2005 putative negatives.

Muscle* - motifs generated without muscle-specific genes.

Comparison with COMETKB60 ROC50 KB60 ROC50MCAST MCAST COMET COMET

Drosoph

>4041 0.68 1010 0.61

LSF 167 0.44 85 0.35muscle 30 0.38 69 0.46muscle*

14 0.16 6 0.25

Red indicates better performance.

Computing motif statistics Looking for fast ways to compute

the probability of a local, multiple alignment.

Objective function of the latest version of the MEME algorithm.

Computing the statistics of random alignments

Knowing the statistical significance of motifs makes it possible to distinguish “real” motifs from patterns that can be explained by chance.

Computing motif significance is therefore critical to any motif discovery approach.

Measuring the goodness off DNA regulatory motifs: IC

Alignment

nij

Counts

fij=nij/N

Frequencies

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

Sequences

IC =IC1+ …+ICw

Information Content

1 GACATCGAAA2 GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCGN TGTGAAGCAC

12 … w

i

j

POP: product of IC p-values IC is the sum of the information

contents of the motif columns. POP is an alternative measure of

motif quality: the product of the p-values of the column information contents.

Statistics of IC scores Large deviation method for computing

distribution of IC of random alignments is known (Hertz and Stormo, Bioinformatics, 15:653-577, 1999).

Time to compute the p-value of one IC score is O(N2).

MEME computes O(w2N) IC scores per motif, so the total time—O(w2N3)—is prohibitive.

POP p-values can be computed efficiently.

Correction factor for POP p-values The p-value of POP score, p, is roughly:

Because of the discrete nature of IC p-values, it is necessary to correct the POP p-values.

Empirically, the p-value error for POP, p, letting x = ln(p), is about

Estimating the POP p-value correction factor parameters To estimate the correction factor

parameters we: estimate the right tail of the distribution using

a convolution method, fit the (non-linear) correction function to the

tail of the distribution using a least squares approach.

The CPU time per motif to compute POP p-values is negligible once the correction factor parameters are known.

CPU time per motif using LD method to compute p-values

w=16

CPU time to estimate correction factor parameters

w=16

Speedup using POP statistic

Discovering regulatory elements in orthologous genes De novo discovery of most known

regulatory elements in yeast has been demonstrated using four closely related yeast genomes (Kellis et al., Nature 423:241-254, 2003).

We are exploring the possibility of extending their approach to the human genome using orthologous genes from mouse.

Evaluation of motif discovery algorithms Joint work with Martin Tompa and others. Eighteen motif discovery algorithms were

tested evaluated on DNA regulatory motifs in four organisms.

Each algorithm was run by experts in that particular algorithm.

The ability of the algorithm to discover motifs in sets of DNA sequences was measured.

Performance of Motif Discovery Algorithms Finding Regulatory Motifs

nCC categorized by species

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Com

bine

d nC

C wholesetflyhumanmouseyeast

Conservation of known regulatory elements in sets of orthologous genes

Human vs. Mouse Four yeast species

Source: Liu et al., Genome Res 14:451-458, 2004.

Background sequences

Regulatory elements

Regulatory elements

Background sequences

Large-scale discovery of human regulatory elements Compared with yeast, regulatory elements

make up less of human intergenic DNA (3% vs. 15%).

The relative difference in conservation rate (window percent identity) between human and mouse regulatory elements and background sequence is higher than among the four yeast species.

Large-scale motif discovery should be possible using human and mouse orthologous genes.

pattern discovery and recognition for understanding genetic regulation timothy l. bailey institute...

Documents