A hierarchical regression mixture modelfor inferring gene regulatory networks
Mayetri [email protected]
University of North Carolina at Chapel Hill
[email protected] -- p.1/30
Upstream regulation ↔ Downstream expression
Fundamental question: how can we understand the biologicalmechanisms leading to disease?
...gtggtTAGAATagcgactgttttt... gene 1
...taggTATAATacagtctgacaaaa... gene 2
...cagcaacattgaTATAATtgccat... gene 3
...ctaaaacaatTATTATttatcagg... gene 4
0
1
2
bits | 1 T 2 A 3 GT 4 TA 5 GCA 6 T|
Co-regulated genes sharesimilar upstream patterns
Identify genes that are differentially expressed under differenttreatments or conditions
[email protected] -- p.2/30
Gene Regulation: DNA Motifs
Proteins bind to DNA to activate gene transcription
0
1
2
bits |
1 T 2 A 3 GT
4 TA 5 GCA 6 T|Position specific weight
matrix (PSWM)
or Motif
[email protected] -- p.3/30
Gene regulation in complex genomes
Harder problem: many transcription factors working inco-ordination
LARGE sequence search space: using sequence data only→ many false positives?
[email protected] -- p.4/30
Upstream regulation ↔ Downstream expression
Gene expression contains information about sequence motifs
Sequence may contain information on gene co-regulation
Expression clustering→ Motif discovery
or
Expression clustering↔ Motif discovery ?
What if initial clustering is inaccurate?
[email protected] -- p.5/30
Cell-cycle data set (Spellman, Mol Biol Cell. 1998)
28 63 98
−2−1
01
23
G1G2/MM/G1SS/G2
PS
fragreplacem
ents
gene
expr
essi
on
time (minutes)
Measurements over 18 time points, 3 different experiments∼ 800 genes known to be cell-cycle dependent
Do clusters of genes share common TFs?Do certain TFs work combinatorially on groups of genes?
color: time when gene is active
[email protected] -- p.6/30
Using Gene Expression in Motif Discovery
REDUCER (Bussemaker, Nat. Genet. 2001) correlatesexpression of gene with number of motif occurrences
MDScan (Liu, Nat Biotech. 2002) Most strongly differentiallyexpressed genes → candidate motifs.
Motif Regressor (Conlon, PNAS 2003)Multiple regression model: Sum of motif effects explainsgene expression
Yg = α+M
∑m=1
βmSmg + εg
Yg: expression of gene g; Smg: motif-match score
[email protected] -- p.7/30
Using Gene Expression in Motif Discovery
Non-parametric approaches
Phuong et al. (Bioinformatics, 2004): Classification Trees(CART)
Arbitrary decision criterion based on the number ofoccurrences of a motif type
Multivariate adaptive regression splines (Das et al, PNAS2004)
Joint sequence-expression model without parametric connection
Holmes and Bruno (ISMB proc.,2000) joint likelihood forsequence and expression data
[email protected] -- p.8/30
Using motif information in gene clustering
Infer sets of transcription factors involved in regulating groups ofgenes
Higher transcriptional activity → greater presence of TFbinding sites, more pronounced expression changes
Genes within a “cluster” may be correlated, with or withoutsharing common transcription factors
Measurements on the same gene in different conditions maybe correlated due to sharing the same upstream transcriptionfactor binding sites
[email protected] -- p.9/30
Linear mixed effects model
y = Xβ+Zb+ ε
β: fixed effectsb: random effects
ε ∼ N(0,τ−10 I)
y: gene expression
Fixed effects: Sequence MotifLevels of factor are reproduced exactly if experiment isrepeated
Random effects: Expression clusterLevels of factor (expression + clusters) may not bereproduced exactly if experiment is repeated
[email protected] -- p.10/30
Joint Model for Sequence-Expression
Complication: gene cluster identity cannot be assumed knownZ matrix not “fixed”
Conditional on cluster k, (k = 1, . . . ,K), vector of log-expressionvalues of gene g generated from mixed-effects model:
Yg|zg = k,X , parameters ∼ N(ξg +XTg βk1,σ2
kI) (≡ fk)
[email protected] -- p.11/30
Joint Model for Sequence-Expression
Complication: gene cluster identity cannot be assumed knownZ matrix not “fixed”
Conditional on cluster k, (k = 1, . . . ,K), vector of log-expressionvalues of gene g generated from mixed-effects model:
Yg|zg = k,X , parameters ∼ N(ξg +XTg βk1,σ2
kI) (≡ fk)
Unconditionally, a mixture model
P(Y |X , parameters) = ∏genes g
[∑
cluster kπk fk(Yg| parameters)
]
Which β’s significant, in which [email protected] -- p.12/30
Bayesian hierarchical formulation
Prior distributions for cluster k parameters:“Expression” model:
µk ∼ N(·,v2k0I)
σ2k ∼ InvGamma(·, ·)
ξg|zg = k ∼ N(µk,τ0σ2kI)
“Sequence” model: βk ∼ N(β0,V k)Probabilities of cluster membership:P(zg = k) = πk
(π1, . . . ,πK) ∼ Dirichlet(α1, . . . ,αK)
G genes; T measurements per gene
[email protected] -- p.13/30
Bayesian hierarchical formulation
For each βk, we use multivariate extension of g-prior (Zellner,1986), so that
V k =cσ2
k
TS−1
k , where Sk = ∑zg=k
XgXTg
Why use g-prior?
Computational efficiency, varying c −→ more/less informative
Induces dependence among genes in a cluster due tosequence effects
Cov(Yg,Yg′|Zg,Zg′ = k) = v2k0I +
cσ2k
T1[XT
g S−1k Xg
]1T
[email protected] -- p.14/30
Regulatory Motif Model
· · ·θ0 θ0 θ1 · · · θ6 θ0 θ0 · · ·
Every non-siteposition multinomialwithθ0 = (θ01, . . . ,θ04)
Every motif position imultinomial withθi = (θi1, . . . ,θi4)
Product Multinomial modelChallenge: Find position of sites and θ’s
MEME (Bailey, ISMB 1994) Gibbs Motif Sampler (Liu, JASA 1995), AlignAce (Roth, Nat. BioTech 1998),
BioProspector (Liu, Pac. Symp. Biocomp. 2001), Stochastic Dictionary (Gupta, JASA 2003)
[email protected] -- p.15/30
Sequence Motif Scoring
Starting set of motifs
De-novo: Different clusters of genes exhibiting “strong” up- ordown-regulation (MDScan)
Databases: Derived from experimental data
Motif score for w-width motif j and upstream sequence g:
Xg j = ∑positions i
P(seq(i, i+w−1)|motif j)P(seq(i, i+w−1)| null)
[email protected] -- p.16/30
Sequence motif selectionInitial set of D motif candidates (D can be large!)In regression model, want to know which motifs correlated“significantly” with response (gene expression)
u = (u1, . . . ,uD) where
u j =
{1 if motif j is in model
0 otherwise.
Prior probability of motif inclusion
P(u) =D
∏j=1
ηu j(1−η)1−u j
Variable selection from LARGE potential [email protected] -- p.17/30
Outline of method
select motifs from
given motifs update clusters
update model parametersgiven cluster membership
and functional motifs ineach class
large initial set
active in each class
Complications
Cluster membership z unknown
Number of clusters K may be unknown (assume fixed for now)
Number of motif candidates is large
[email protected] -- p.18/30
Parameter updating using MCMC
Update from joint posterior distribution
P(θ,β,u,z|Y ,X ,K)
θ = (µ,σ2,π)
For updating steps for parameters θ,β and z, marginalize overother parameters for efficiency (Conjugate forms permit this)
For updating u, use evolutionary Monte Carlo (Liang andWong, JASA 2001)
Select motifs that have most effect on expression, and differentiatemost among clusters
[email protected] -- p.19/30
Three simulated data sets
Regression coeffs. for motifs cor-
responding to 3 PSWMs from
JASPAR database SAP1, SRF,
amd MEF2
200 “genes” in K = 2 clusters
2 measurements each
Data 1 Data 2 Data 3
Coef. C1 C2 C1 C2 C1 C2
βM1 2 0 2 0 2 -2
βM2 0 2 0 -2 0 0
βM3 0 0 0 0 0 0
(Motif 3 not present in data)
+
+
+
+
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
++
++
++
+
+
+
+
++
+
+
+
++
+
++++
+
+
+
+
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+
+
+
o
o
o
o
o o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o o
oo
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o o o
ooo
o o
o
o
o
o
o
o
ooo
o
o
oo
o
o
o
o
o
o
o
o
o
o
−1 0 1 2
01
23
4
x11
+
+
+
+
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
++
++
++
+
+
+
+
++
+
+
+
++
+
+++
++
+
+
+
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+ +
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+
+
+
o
o
o
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
oo
oo
o o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
ooo
ooo
oo
o
o
o
o
o
o
ooo
o
o
oo
o
o
o
o
o
o
o
o
o
o
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
01
23
4
x12
Y1 +
+
+
+
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
++
++
++
+
+
+
+
++
+
+
+
++
+
+++
++
+
+
+
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+
+
+
o
o
o
o
o o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o o
oo
o o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o o o
ooo
o o
o
o
o
o
o
o
o oo
o
o
oo
o
o
o
o
o
o
o
o
o
o
−1.0 −0.5 0.0 0.5 1.0
01
23
4
x13
Y1
++
+
+
++
+
+ ++
++
+
+
++
+
+
+
+
+
+
+
+
+++
+
+
+
+
+
+
++
+
+
++
+
++
+ ++
+
+
+
++
+
+
++
+
+++
+
+
+
++
+
++
++
+
+
+
+++
+
+
++
+
+++
+
+
+
+
+
++
+
+
+
+
++
+
+
+
++
o
oo
o
o
oo
oo
o
o
o
o
o
o
o o
o
o
oo
o oo
o
oo
o
oo
oo
o
o
ooooo
oo
o
o
o
oo
o
o
o
oo o
o
o
o
o
o
o
oo
o o
o
o
oo
o
o
o
o
oooo
o
o
oo
o
o
o
o
oo
o
o oo
o
o
o
o
o
o
o
o
o
o
o
o
−1 0 1 2
−10
12
34
5
x31
++
+
+
++
+
+ ++
++
+
+
++
+
+
+
+
+
+
+
+
+ ++
+
+
+
+
+
+
++
+
+
++
+
++
+ + +
+
+
+
++
+
+
+ +
+
++++
+
+
++
+
+ +
++
+
+
+
++ +
+
+
+ +
+
++ +
+
+
+
+
+
+++
+
+
+
++
+
+
+
++
o
oo
o
o
oo
oo
o
o
o
o
o
o
oo
o
o
oo
ooo
o
oo
o
oo
oo
o
o
oo ooo
oo
o
o
o
oo
o
o
o
ooo
o
o
o
o
o
o
oo
oo
o
o
oo
o
o
o
o
oooo
o
o
oo
o
o
o
o
oo
o
ooo
o
o
o
o
o
o
o
o
o
o
o
o
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
−10
12
34
5
x32
Y3
++
+
+
++
+
+ ++
+ +
+
+
++
+
+
+
+
+
+
+
+
+++
+
+
+
+
+
+
++
+
+
++
+
+ +
+++
+
+
+
++
+
+
+ +
+
+ ++
+
+
+
++
+
++
++
+
+
+
+++
+
+
+ +
+
++ +
+
+
+
+
+
++
+
+
+
+
++
+
+
+
++
o
oo
o
o
oo
oo
o
o
o
o
o
o
o o
o
o
oo
ooo
o
oo
o
o o
oo
o
o
o oo oo
oo
o
o
o
oo
o
o
o
oo o
o
o
o
o
o
o
oo
o o
o
o
oo
o
o
o
o
o oo o
o
o
o o
o
o
o
o
oo
o
o oo
o
o
o
o
o
o
o
o
o
o
o
o
−1.0 −0.5 0.0 0.5 1.0
−10
12
34
5
x33
Y3
++
++
+
++
+
+
+
+
+
++
+
+
+
++
+
++
+
+
+
+
++
+
+
+
+++
+
+ +
+
+
++
+
+
+
++
+
+
++
+
+
+
+
++
+
+
+
++
++
+
+
+ +
+
++
+
+
++
+ ++
+
++
+
++
+
++
+
++
+
+
++
+
+
+
+
+
+
+
o
o
o
o
o
o
o
oo
o
o
o
o
ooo
o
o
oo o
o
oo
o
o
o
o
oo
ooo
o
o
oo
o
oo
oo
o
o
o
oo
o
o oo
o
o
o
o
o
o
o o
o
o
o
oo
o
oo
ooo
o
oo
o
oo
oo
o
o
o
o
o o
o
o
oo
o
o
o
o
ooooo
oo
o
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5
−20
24
++
++
+
++
+
+
+
+
+
+ +
+
+
+
++
+
+ +
+
+
+
+
++
+
+
+
+++
+
++
+
+
++
+
+
+
++
+
+
++
+
+
+
+
++
+
+
+
++
++
+
+
++
+
++
+
+
++
+ ++
+
++
+
+ +
+
++
+
++
+
+
++
+
+
+
+
+
+
+
o
o
o
o
o
o
o
oo
o
o
o
o
ooo
o
o
oo o
o
oo
o
o
o
o
oo
ooo
o
o
oo
o
oo
oo
o
o
o
oo
o
o oo
o
o
o
o
o
o
o o
o
o
o
oo
o
oo
ooo
o
oo
o
oo
o o
o
o
o
o
oo
o
o
oo
o
o
o
o
o oooo
o o
o
−0.5 0.0 0.5 1.0 1.5 2.0
−20
24
Y2
++
++
+
++
+
+
+
+
+
++
+
+
+
++
+
+ +
+
+
+
+
++
+
+
+
+ ++
+
+ +
+
+
++
+
+
+
++
+
+
++
+
+
+
+
++
+
+
+
++
++
+
+
++
+
+ +
+
+
+++ +++
++
+
++
+
++
+
+ +
+
+
++
+
+
+
+
+
+
+
o
o
o
o
o
o
o
oo
o
o
o
o
ooo
o
o
oo o
o
oo
o
o
o
o
o o
ooo
o
o
oo
o
oo
oo
o
o
o
oo
o
o oo
o
o
o
o
o
o
oo
o
o
o
oo
o
oo
oo o
o
oo
o
oo
o o
o
o
o
o
oo
o
o
oo
o
o
o
o
o ooo
o
o o
o
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−20
24
Y2
S1 S2 S3Y
(3)
1Y
(2)
1Y
(1)
1
Y1 with motif 1,2,3 scores (columns)
in 3 data sets (rows)
[email protected] -- p.20/30
Simulation study results: Bayes factors
Data 1 Data 2 Data 3
1 2 3 4 5 6
−3
00
−2
50
−2
00
PS
fragreplacem
ents
K
P(M
K)
1 2 3 4 5 6
−7
00
−6
00
−5
00
−4
00
data3
PS
fragreplacem
ents
K
P(M
K)
1 2 3 4 5 6
−7
00
−6
00
−5
00
−4
00
−3
00
data2
PS
fragreplacem
ents
K
P(M
K)
Optimal choice: K = 2
Marginal model probability for MK through Double mixtureimportance sampling
P(Y |MK)=̂1
NtNs∑
t∑
sP(Y |Z(t)
,θ(s),K)
π1(θ(s)|z(t))
f1(θ(s)|z(t))
π2(z(t))
f2(z(t))
[email protected] -- p.21/30
Simulation: β estimates
Data 1 Data 2 Data 3
1 2
−0
.50
.00
.51
.01
.52
.0
1 2
−0
.50
.00
.51
.01
.52
.0
PS
fragreplacem
ents
motif type
coef
ficie
nt
c1c2
1 2
−3
−2
−1
01
2
1 2
−3
−2
−1
01
2
PS
fragreplacem
entsmotif type
coef
ficie
nt
c1c2
−2
−1
01
2−
2−
10
12
PS
fragreplacem
ents
motif type
coef
ficie
nt c1c2
Data sets 1 and 2: Motifs 1,2 selectedData set 3: Motif 1 selected
[email protected] -- p.22/30
Spellman (1998) data
28 63 98
−2−1
01
23
G1G2/MM/G1SS/G2
PS
fragreplacem
ents
gene
expr
essi
on
time (minutes)
Pick different groups of genes that are highly differentially expressed
Sets of motif candidates of widths 7-12 bp (using MDscan)
Motif overlaps lead to collinearity: remove motifs with correlation > 0.5 with a
higher-ranked one → 32 candidate motifs
Two consecutive time points: 1-2, 3-4, . . ., 17-18
[email protected] -- p.23/30
Number of clusters K∗
log(B̂F) compared to 1-component model
22 2
2
2
2 2
2 2−2
00
2040
time interval
log(
BF)
1 2 3 4 5 6 7 8 9
3
3
3
3
3
3
3
3 3
Interval 1 2 3 4 5 6 7 8 9K∗ 1 3 1 2 2 2 2 1 1
0
1
2
bits
1
TA
C
2
T
AG
3
AGC
4
A
CG
5
TA
6
TA
7
TA
8
G
CTA
9
GTA
0
1
2
bits
1
TA
2
A
C
3
T
AG
4
A
TC
5
A
G
6
AT
7
CTA
0
1
2
bits
1
TA
2
A
C
3
T
AG
4
A
TC
5
A
G
6
AT
7
CTA
0
1
2
bits
1
GTA
2
CGTA
3
GTC
4GTA
5CTGA
6
GTC
7
CGTA
0
1
2
bits
1
TA
2
A
C
3
T
AG
4
A
TC
5
A
G
6
AT
7
CTA
0
1
2
bits
1
C
TAG
2
C
TAG
3
TCGA
4
T
GAC
5
G
TCA
6
CATG
7
CTGA
0
1
2
bits
1
TGAC
2
TCAG
3
TGCA
4
GCTA
5
T
CAG
6
GCTA
7
TCAG
0
1
2
bits
1
G
TAC
2
AT
3
GCTA
4
CTA
5
AT
6
A
CT
7
GCTA
0
1
2
bits
1
TA
2
A
C
3
T
AG
4
A
TC
5
A
G
6
AT
7
CTA
SCB MCB MCB SFF MCB MCM1 MCM1 MCB
0
1
2
bits
1
GTA
2
CGTA
3
GTC
4
GTA
5
CTGA
6
GTC
7
CGTA
0
1
2bi
ts
1
GTA
2
CGTA
3GTC
4
GTA
5
CTGA
6
GTC
7
CGTA
0
1
2
bits
1
T
A
2
A
C
3
T
AG
4
A
TC
5
A
G
6
AT
7
CTA
0
1
2
bits
1
T
A
2
A
C
3
T
AG
4
A
TC
5
A
G
6
AT
7
CTA
SFF SFF MCB MCB
0
1
2
bits
1GTA
2CGTA
3
GTC
4
GTA
5
CTGA
6
GTC
7
CGTA
SFFSignificant motifs over time [email protected] -- p.24/30
Motif influence at different time intervalsMotif index −→
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
PS
fragreplacem
ents
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
PS
fragreplacem
ents
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
PS
fragreplacem
ents
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
PS
fragreplacem
ents
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
PS
fragreplacem
ents
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
PS
fragreplacem
ents
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
PS
fragreplacem
ents
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
PS
fragreplacem
ents0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
PS
fragreplacem
ents
Post
erio
rpro
b.of
sele
ctio
n
Selected motif types for 9 time intervals for optimal K
[email protected] -- p.25/30
Significant motifs match experimental PSWMs
Index TF name Consensus Expt. Phase4 MCM1 CGAAGAG/CTCTTCG CCNNNWWRGG M6 GCN1 TCAGTCA/TGACTGA TCAGTCA7 [CSRE] GGACAGA/TCTGTCC [YCGGAYRRAWGG]
10 MCB ACGCGTA/TACGCGT WCGCGW G116 SFF AACAACA/TGTTGTT GTMAACAA M18 MCM1 CCAATTAGG/CCTAATTGG CCNNNWWRGG M20 [RME1] TTCAGGTAC/GTACCTGAA [GAACCTCAA]22 SCB CGCGAAAAA/TTTTTCGCG CNCGAAA G125 PHO4 CGTACGTAC/GTACGTACG CACGTK29 − CTTCGCATC/GATGCGAAG
K ≡ {G or T} ; M ≡ {A or C} ; N ≡ {A or C or G or T} ; R ≡ {A or G} ; W ≡ {A or T} ; Y ≡ {C or T}
[email protected] -- p.26/30
Summary
Treating gene expression clustering as a variable may help indiscovering relationships between functional sequence motifs, andgroups of genes they regulate
Different groups of genes may behave as a cluster at differenttime points
Upstream sequence motifs “constant” but effects/interactionsover time may vary
[email protected] -- p.27/30
Further extensions
Motif scoring issues: sensitivity, co-occurrence of sites
Efficient model selection
Extension to high density ChIP tiling arrays
Acknowledgement:Joseph G. Ibrahim (UNC), Jason Lieb (UNC)
UNC high-performance scientific computing group
[email protected] -- p.28/30
Model Selection: number of clusters K
Likelihood-based methods not valid (BIC, etc.)
Bayes factor: ratio of model marginal probabilitiesMarginal probability for model MK
P(Y |MK) = ∑z
∫
θP(Y |Z,K)p(θ|z)p(Z|K)dθ
Double mixture importance sampling
P(Y |MK)=̂1
NtNs∑
t∑
sP(Y |Z(t)
,θ(s),K)
π1(θ(s)|z(t))
f1(θ(s)|z(t))
π2(z(t))
f2(z(t))
[email protected] -- p.29/30
Model Selection: Number of clusters
Double mixture importance sampling
P(Y |MK)=̂1
NtNs∑
t∑
sP(Y |Z(t)
,θ(s),K)
π1(θ(s)|z(t))
f1(θ(s)|z(t))
π2(z(t))
f2(z(t))
Challenge: Good sampling densities f (·) for z and θ
For a simpler case (θ marginalized), Raftery et al (TR, 2003)propose permutation-based methods to find good candidatesampling densities for z
[email protected] -- p.30/30