LOGOS: a modular Bayesian model for
de novo motif detection
Eric Xing
�Wei Wu
�
Michael Jordan�
Richard Karp
�
�
University of California, Berkeley�
epxing,jordan,karp
�@cs.berkeley.edu
�
Lawrance Berkeley National [email protected]
13 Aug 2003
CSB2003
The cis-regulatory machinery of eukaryotic organisms
CSB2003 Eric Xing et al. — 1
A DNA motif (Gcn4)
Motif features:
� PSMD: position-specific multinomial distribution� PWM: position weight matrix
CSB2003 Eric Xing et al. — 2
Meta-sequence features
� Shape bias: typical patterns of site-conservativeness due to binding surfaceconstraints.
� Spatial organization of motifs: dependencies between motif instances.
— Believed to be crucial in distinguishing biologically meaningful motifs from random
background or trivial recurring patterns.
Our goal: Build statistical models for motif-bearing sequences that:
� exploit the meta-sequence features and provide improved sensitivity tobiologically plausible motifs
� can generalize easily to various motifs and handle complex cis-regulatorystructures
� can make use of biologically identified motifs for de novo detection
CSB2003 Eric Xing et al. — 3
Introduction
� Using a hierarchical Bayesian Markovian model to posit hiddendependency structures of motifs
� Exploit this model to predict locations of novel motifs
� Talk outline
� Modeling generation of motif instances with a hidden MarkovDirichlet-multinomial model� Modeling motif instance distribution with a hidden Markov grammar� The LOGOS motif model and a variational EM algorithm for motifdetection� Experiments� Future work
CSB2003 Eric Xing et al. — 4
The Local Alignment Model— for multiple-aligned motif instances
CSB2003 Eric Xing et al. — 5
Signature internal structures of motifs
Shape bias:
gal4:
0
1
2
bit
s
5′
1
C2
G3
CG
4CGA
5TA
CG
6
T
A
GC
7
TCGA
8
TGC
9
AT
10
T
CG
11
GCT
12
T
G
AC
13
T
CG
14
G
ACT
15
GC
16
C
17
G3′
gal4-permute:
0
1
2
bit
s
5′
1
TA
CG
2
C
3
C
4
CGA
5
T
CG
6
CG
7
T
CG
8
TCGA
9
AT
10
T
G
AC
11
G
12
G
ACT
13
G
14
T
A
GC
15
GC
16
GCT
17
TGC
3′
pho4:
0
1
2
bit
s
5′
1
CAT
2 3
GCAT
4 5
CGA
6
AC
7
GA
8
C9
G10
T11
TG
12
TCG
13
TAG
3′
pho4-permute:
0
1
2
bit
s
5′
1
TAG
2
G
3
GA
4
TG
5
CAT
6
T
7 8
CGA
9
C
10
GCAT
11 12
AC
13
TCG
3′
Two yeast motifs and their column-permuted versions.
� The conservativeness of sites in the motif may be spatially dependent.
� The conserved sites are more likely to occur consecutively and possibly followed (or
preceded) by heterogeneous sites that are also consecutive.� The “shape” is only associated with the conservativeness pattern of a motif PWM, but
not with any specific consensus sequence of the motif.
CSB2003 Eric Xing et al. — 6
A standard motif alignment model
The Product Multinomial (PM) Model:
A 1 A 2 A M...={ }
1: AAAAGAGTCA2: AAATGACTCA. AAATGAGTCA . GAATGAGTCAM: AAAAGAGTCA
A=
θLθ2θ1
M M M
...A Am,2 Am,Lm,1
� The nucleotide distributions at different sites within the motif are assumedto be independent.
� Limitation:
� Unable to capture site dependence and shape preference in motifs
CSB2003 Eric Xing et al. — 7
The PSMD simplex and the Dirichlet Distribution
Each possible position-specific nucleotide distribution corresponds to a point in the simplex.
CSB2003 Eric Xing et al. — 8
A Markov model for PSMD-simplex sequence
α 1 α 3
α 2
α 4
The PSMD-simplex sequence defines a structural prior for the PWM.
CSB2003 Eric Xing et al. — 9
Hidden Markov Dirichlet-multinomial model (HMDM)
I
s s
θ θ θ
M M M
1 2 L
1 2 L
...
...
s
A Am,2 Am,Lm,1
α
� Parameters (for all motifs):� �
sets of 4-dimensional Dirichlet parameter � � � � �� � � � � �
.� Initial probability � and transition probability
�
of an HMM.
� Random variables (for each motif site):� The site prototype indicator indexing different distributions on thePSMD simplex is represented by a multinomial r.v. �� .� The PSMD associated with the site is represented by a Dirichlet r.v.
�� .� The nucleotide content of the site is represented as a multinomial r.v.��� �� .
CSB2003 Eric Xing et al. — 10
The Global Distribution Model— for patterns of motif occurrences in genomic
sequences
CSB2003 Eric Xing et al. — 11
Global organization of motif instances
cis-regulatory modules (Davidson [2001]):
� There exist instance (resp. cluster)-level dependencies between motifs(resp. modules).
CSB2003 Eric Xing et al. — 12
A conventional global model
The uniform and independent (UI) start-position model:
� The locations of motif instances are assumed to be independently sampledfrom a uniform distribution over possible positions.
� Limitations:
� instances can overlap� does not capture inter-instance, inter-motif dependences
CSB2003 Eric Xing et al. — 13
A hidden Markov model for motif occurrences
The state-transition grammar of the global HMM:
b(1)
(1) (1) (1)...1 2 L
(1’)... 1L(1’) 2 (1’)
β1
11/2
1/2
1
1
1
1
β1,i
β i,1
1,0
1
1
β
b(0)
0,0
αi,0β0,i αk,0 α1,0β0,1
β0,k
β
b(k)
(k) (k) (k)...1 2 L
(k’)...L(k’) 2 (k’) 1
k
k
1/2
1/2
1
1
1
1
βk,i
β i,k
k,0
1
1
...
� Latent states:
� all possible sites within motifs on the forward and reverse complementary strand� intra-cluster backgrounds� inter-cluster (global) background
� Observed output: unannotated sequence of nucleotides.
� Parameters: HMM parameters
��� �� � � �
; motif PWMs
�
; background frequencies
�! .
CSB2003 Eric Xing et al. — 14
A modular motif model for de novo motif detection
� The integrated LOcal and GlObal motif Sequence model
� The occurrences of motifs in a DNA sequence, as indicated by ", aregoverned by a global distribution model # � $ " %& ' ( .� For each type of motif, the nucleotide sequence pattern shared by all itsinstances admits a local alignment model #*) $+ $ " , ( % " & � (
.� Assume that the background non-motif sequences are modeled by asimple conditional model, # $ ,.- + $ , " ( % " �0/ (
.
� The likelihood of a regulatory sequence , is:
# $ , %& ( �1
# � $ " % & ' ( #) $ , % " & � (
�1
# � $ " %& ' ( #) $+ % " & � ( # $ ,- + % " �/ (
CSB2003 Eric Xing et al. — 15
Inference and parameter estimation
� The inference and learning problems:
� Maximum a posteriori (MAP) prediction of motif locations� Bayesian estimation of motif PWMs
A variational EM algorithm for LOGOS
Variational “E” step: Compute the
23 4, the expected count of each
nucleotide, for each motif and each site within it, via inference in theglobal motif model given
2576 � 4and sequence ,.
Variational “M” step: Compute the Bayesian estimation of the PWM, innatural parameter form
2 56 � 4, via inference in the local motif alignment
model given
23 4
.
CSB2003 Eric Xing et al. — 16
Experiments:
� Semi-synthetic sequence: random background (300 8600bp) with genuinemotif instance(s) planted at known positions.
� Real genomic sequences: 500-5000 bp fragments in the genomicsequences flanking the motif instances.
CSB2003 Eric Xing et al. — 17
Detecting a single motif using HMDM model
0 20 40 600
1
2
abf1 (21)
0 20 40 600
1
2
gal4 (14)
0 20 40 600
1
2
gcn4 (24)
0 20 40 600
1
2
gcr1 (17)
0 20 40 600
1
2
mat−a2 (12)
0 20 40 600
1
2
mcb (16)
0 20 40 600
1
2
mig1 (11)
0 20 40 600
1
2
crp (24)
abf1 gal4 gcn4 gcr1 mat mcb mig1 crpHMDM(f) 0.81 0.82 0.71 0.65 0 0 0.73 0.58HMDM(e) 0.77 0.89 0.63 0.12 0.67 0 1.00 0.54HMDM(b) 0.12 0 0 0.59 0.92 0.72 0.91 0.04
PM 0.14 0 0 0.41 0.79 0.13 0.64 0.04
� Result: different HMDMs can be used for different “classes” of motifs.
� can use #-value to identify best-performing model.
CSB2003 Eric Xing et al. — 18
Detecting multiple motif instances using LOGOS
� We compare three variants of LOGOS, ordered with decreasing modelexpressiveness,
� HMDM+HMM (LOGOS 9 9)� PM+HMM (LOGOS : 9)� PM+UI (LOGOS :; )
motif LOGOS < < LOGOS = < LOGOS =>
name FP FN FP FN FP FNabf1 0.31 0.21 0.68 0.20 0.79 0.91gal4 0.16 0.16 0.19 0.15 0.29 0.79gcn4 0.18 0.24 0.61 0.28 0 0.96gcr1 0.20 0.21 0.34 0.20 0.33 0.94mat 0.07 0.03 0.36 0 0.50 0.96mcb 0.37 0.09 0.36 0.08 0.33 0.94mig1 0.08 0 0.09 0 0.98 0.10crp 0.38 0.34 0.27 0.53 0 0.95
CSB2003 Eric Xing et al. — 19
Analyzing Yeast promoter sequences using LOGOS
set LOGOS MEME AlignACE seqname FP FN FP FN FP FN no.abf1 0.79 0.65 1.00 1.00 0.53 0.61 20csre 0.44 0.17 0.78 0.50 0.80 0.50 4gal4 0.13 0.07 0.17 0.29 0.33 0.14 6gcn4 0.35 0.18 1.00 1.00 0.33 0.56 9gcr1 0.29 0.61 1.00 1.00 0.45 0.46 6hstf 0.86 0.55 0.60 0.56 0.85 0.67 6mat 0.42 0 0.38 0.56 0.25 0.25 7mcb 0.47 0.25 0.20 0.33 0.25 0.25 6mig1 0.81 0.28 1.00 1.00 0.83 0.79 22pho2 0.90 0.50 1.00 1.00 1.00 1.00 3swi5 0.76 0.50 1.00 1.00 0.94 0.75 2uash 0.83 0.68 1.00 1.00 0.92 0.95 18
CSB2003 Eric Xing et al. — 20
Analyzing Drosophila promoter sequences
� Motif patterns derived from biologically identified motif instances (Berman et al. [2002]).
0
1
2
bit
s
5′
1
A
TC
2
CAT
3
GA
4
A
5
GT
6
A
TC
7
GTC
8
T
CG
3′
0
1
2
bit
s
5′1
T
G
A
2
G
3 4
G
A
TC
5
A
6
GCT
7
A
8
A
9
A
10
C
TGA
3′
0
1
2
bit
s
5′
1
GCAT
2
T
3
G
CT
4
T
5
T
6
T
7
T
CGA
8
G
CAT
9
CTG
10 11
C
G
T
3′
0
1
2
bit
s
5′
1
T
CGA
2
GCA
3
TCA
4
G
TCA
5
T
G
6
AG
7
CAG
8
GT
9
A
CGT
10
C
TGA
3′
0
1
2
bit
s
5′
1
A
2
A
3
GAC
4
GT
5
CA
6
G
7
GCA
8
CTG
9
C
10
A3′
bcd cad hb kni kr
� Motif patterns detected by LOGOS in the promoter regions (600-5000bp) of 9 Drosophilagenes.
0
1
2
bit
s
5′
1
AT
2
CTA
3
ATC
4
A
TGC
5
T
G
AC
6
TGCA
7
GTA
8
GC
9
C
10
C
11
TGA
12
CTA
13
CAT
14
C
15
GC
3′
0
1
2
bit
s
5′
1
TGA
2
ACT
3 4
G
T
AC
5
CTGA
6
G
TCA
7
GCAT
8
A
C
T
9
GTA
10
G
AT
11
A
C
GT
12
TAC
13
TAG
14 15
TAC
16
TGA
17
AT
18
T
GCA
19TA
20
CTA
3′
0
1
2
bit
s
5′
1 2
AGT
3
CAT
4
A
GCT
5
A
CGT
6
A
CGT
7
GCAT
8
GTA
9
C
GAT
10
TG
11
T
G
12
GCT
13
G
A
CT
14
G
CT
15
A
G
CT
3′
0
1
2
bit
s
5′
1
A
G
T
2
GT
3
GC
4
GC
5
G
CTA
6
T
C
AG
7
A
CG
8
C
A
T
9 10
C
G
TA
11 12
GC
13
GTC
14
GAC
15
T
G
A
16
AT
17
GACT
18
C
3′
0
1
2
bit
s
5′
1
CT
2
GTC
3
T
4
TC
5
A
G
6
ATC
7
GACT
8
CGT
9
ACT
10 3′
0
1
2
bit
s
5′
1
A
T
2
T
A
GC
3 4
A
GC
5
CG
6
G
7
AC
8 9
CTA
10
A
C
GT
11
TCG
12
AGC
13
G
C
TA
14
TCGA
15
GAT
3′
0
1
2
bit
s
5′1
AT
2
CTA
3
GTA
4
TCA
5
TA
6
TA
7
AT
8
CGT
9
GAT
10
T
GA
3′
0
1
2
bit
s
5′
1
CTG
2
C
3
C
4
TAG
5
CG
6
T
A
GC
7
A
TCG
8
GCA
9
AGT
10
C3′
� Motif patterns detected using conventional models.
0
1
2
bit
s
5′
1
GCT
2
AT
3
A
GT
4
G
AT
5
CGT
6
GTC
7
ACTG
8
A
TC
9
GTC
10
CG
11
C
G
A
12
T
ACG
13
A
TC
14
A
CGT
15
G
T
CA
3′
0
1
2
bit
s
5′
1
CGA
2
T
CAG
3
T
GA
4
CA
5
TA
6
TCA
7
GCT
8
GCT
9
G
T
A
10
CAG
11
T
CGA
12
ACT
13
T
AGC
14
GTA
15
A
TCG
16
A
C
TG
17
GT
18
C
G
A
T
19
AGT
20
GTC
3′
0
1
2
bit
s
5′
1
C
ATG
2
TGC
3T
G
AC
4
T
GA
5
GAT
6
C
TGA
7
G
T
CA
8
CTA
9
C
T
GA
10
GCA
3′
CSB2003 Eric Xing et al. — 21
Future work
� Background model: higher order HMM
� Incorporating expression, binding, comparative genomic data
� Analyzing eucaryotic cis-regulatory networks
CSB2003 Eric Xing et al. — 22