pattern discovery and recognition for understanding genetic regulation timothy l. bailey institute...

27
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Upload: christian-branden-lane

Post on 18-Jan-2018

216 views

Category:

Documents


0 download

DESCRIPTION

Identifying Statistically Significant Regulatory Modules Overview of the problem Previous research The MCAST algorithm Validation Discussion

TRANSCRIPT

Page 1: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Pattern Discovery and Recognition for Understanding Genetic Regulation

Timothy L. Bailey

Institute for Molecular Bioscience

University of Queensland

Page 2: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Recent Work Identifying statistically significant

regulatory modules Computing motif statistics Evaluation of motif discovery

algorithms Future directions: motif discovery

in sets of orthologous sequences

Page 3: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Identifying Statistically Significant Regulatory Modules Overview of the problem Previous research The MCAST algorithm Validation Discussion

Page 4: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Problem Statement Given a set of one or more motifs,

can we identify the genes that they regulate by searching a genomic database?

Page 5: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

The Problem is Hard The futility theorem: the vast

majority potential TF binding sites are false positives (Wasserman).

This is because TF binding sites are short and degenerate, so they occur frequently at random in DNA.

Page 6: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

The Approach Groups of transcription factors

often operate in concert, binding near each other.

Multiple binding sites for the same TF often occur close together.

Whereas individual binding sites cannot be statistically significant, clusters may be.

Page 7: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

MCAST Hybrid of Cisanalyst and COMET Based on Meta-MEME (CABIOS

Grundy et al. 13:397-406, 1997) MCAST has two input parameters:

Motif p-value threshold (p) Maximum gap size (L)

MCAST builds a motif-based HMM and uses the Viterbi algorithm to find clusters.

Page 8: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Definition of a Motif Cluster

A “cluster” is a collection of “hits” (matches to motifs) with with no gaps longer than L.

Hits are shown schematically as beads on a string. The number is the motif identifier. +/- indicates which DNA strand the hit is on.

+3 -2 +1 +1

Page 9: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Cluster Scoring Function

h1 h2 h3 h4

d3 d4d2

One cluster

Genomic DNA

4324321 dddghhhhM

Hit scores Gap penaltyGap widths

Page 10: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Performance metrics ROC50 measures the area under a

curve that plots true positive rate as a function of false positive rate, up to the 50th false positive.

KB60 is the average number of kilobases per false positive at a threshold that yields 60% sensitivity.

For both metrics, larger is better.

Page 11: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Four Data sets Drosophila Eve regulators (Bcd, Cad, Hb, Kr,

Kni). 19 positives and 2039 putative negatives.

Human LSF-regulated promoters (LSF, Sp1, Ets, TATA). 9 positives and 2005 putative negatives.

Human muscle-specific promoters (Mef-2, Myf, SRF, Tef, Sp1). 27 positives and 2005 putative negatives.

Muscle* - motifs generated without muscle-specific genes.

Page 12: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Comparison with COMETKB60 ROC50 KB60 ROC50MCAST MCAST COMET COMET

Drosoph

>4041 0.68 1010 0.61

LSF 167 0.44 85 0.35muscle 30 0.38 69 0.46muscle*

14 0.16 6 0.25

Red indicates better performance.

Page 13: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Computing motif statistics Looking for fast ways to compute

the probability of a local, multiple alignment.

Objective function of the latest version of the MEME algorithm.

Page 14: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Computing the statistics of random alignments

Knowing the statistical significance of motifs makes it possible to distinguish “real” motifs from patterns that can be explained by chance.

Computing motif significance is therefore critical to any motif discovery approach.

Page 15: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Measuring the goodness off DNA regulatory motifs: IC

Alignment

nij

Counts

fij=nij/N

Frequencies

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

Sequences

IC =IC1+ …+ICw

Information Content

1 GACATCGAAA2 GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCGN TGTGAAGCAC

12 … w

i

j

Page 16: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

POP: product of IC p-values IC is the sum of the information

contents of the motif columns. POP is an alternative measure of

motif quality: the product of the p-values of the column information contents.

Page 17: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Statistics of IC scores Large deviation method for computing

distribution of IC of random alignments is known (Hertz and Stormo, Bioinformatics, 15:653-577, 1999).

Time to compute the p-value of one IC score is O(N2).

MEME computes O(w2N) IC scores per motif, so the total time—O(w2N3)—is prohibitive.

POP p-values can be computed efficiently.

Page 18: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Correction factor for POP p-values The p-value of POP score, p, is roughly:

Because of the discrete nature of IC p-values, it is necessary to correct the POP p-values.

Empirically, the p-value error for POP, p, letting x = ln(p), is about

Page 19: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Estimating the POP p-value correction factor parameters To estimate the correction factor

parameters we: estimate the right tail of the distribution using

a convolution method, fit the (non-linear) correction function to the

tail of the distribution using a least squares approach.

The CPU time per motif to compute POP p-values is negligible once the correction factor parameters are known.

Page 20: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

CPU time per motif using LD method to compute p-values

w=16

Page 21: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

CPU time to estimate correction factor parameters

w=16

Page 22: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Speedup using POP statistic

Page 23: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Discovering regulatory elements in orthologous genes De novo discovery of most known

regulatory elements in yeast has been demonstrated using four closely related yeast genomes (Kellis et al., Nature 423:241-254, 2003).

We are exploring the possibility of extending their approach to the human genome using orthologous genes from mouse.

Page 24: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Evaluation of motif discovery algorithms Joint work with Martin Tompa and others. Eighteen motif discovery algorithms were

tested evaluated on DNA regulatory motifs in four organisms.

Each algorithm was run by experts in that particular algorithm.

The ability of the algorithm to discover motifs in sets of DNA sequences was measured.

Page 25: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Performance of Motif Discovery Algorithms Finding Regulatory Motifs

nCC categorized by species

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Com

bine

d nC

C wholesetflyhumanmouseyeast

Page 26: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Conservation of known regulatory elements in sets of orthologous genes

Human vs. Mouse Four yeast species

Source: Liu et al., Genome Res 14:451-458, 2004.

Background sequences

Regulatory elements

Regulatory elements

Background sequences

Page 27: Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland

Large-scale discovery of human regulatory elements Compared with yeast, regulatory elements

make up less of human intergenic DNA (3% vs. 15%).

The relative difference in conservation rate (window percent identity) between human and mouse regulatory elements and background sequence is higher than among the four yeast species.

Large-scale motif discovery should be possible using human and mouse orthologous genes.