a non-gaussian model for causal discovery in the presence of hidden common causes
TRANSCRIPT
Shohei Shimizu
Shiga University / Osaka University
Japan
1
A non-Gaussian model for causal
discovery in the presence of hidden
common causes
2016 Munich Workshop on
Causal Inference and Information Theory
Abstract
• Managing hidden common causes is
essential in causal discovery
• Non-causally-related observed variables
can be correlated due to hidden common
causes
• Propose a linear non-Gaussian model for
estimating causal direction in cases with
hidden common causes
2
Motivation
Illustrative example
Strong correlation btw chocolate
consumption and number of Nobel
laureates (Messerli12NEJM)
4
2002-2011Chocolate consumption (kg/yr/capita)Num
. N
obel la
ure
ate
s p
er
10 m
illio
n p
op.
Corr. 0.791
P-value < 0.001
Eating more chocolate increases
num. Nobel laureates?
• Interpretational drift (Maurage+13, J. Nutrition)
5
Chclt Nobel?
Chclt Nobelor
GDP GDP
Chclt Nobelor
GDP
Corr. 0.791
P-value < 0.001
No
bel
Chocolate
Hidden
Common
cause
Manage this gap!
Hidden
Common
cause
Hidden
Common
cause
Formulating the problem
Structural causal models (Pearl, 2000,2009; cf. Bollen, 1989)
• A framework for describing causal relations
• Generally speaking, if the value of 𝑥1 has
been changed and then that of 𝑥2 changes,
then 𝑥1 causes 𝑥2
7
2122
111
,,
,
efxgx
efgx
x1 x2
f
e1 e2
GDP
NobelChclt
Challenge in causal discovery8
Hidden common cause
2122
111
,,
,
efxgx
efgx
Data matrixx1
x2 21... ,~ xxpdii
obs.1
Assume that either of
the three generated
the data
Estimate which of the
three models generated
the data
obs.nobs.2 …
x1 x2
f
x1 x2
f
x1 x2
f
e1 e2 e1 e2 e1 e2
fpepep ,, 21
Hidden common cause Hidden common cause
222
1211
,
,,
efgx
efxgx
222
111
,
,
efgx
efgx
fpepep ,, 21 fpepep ,, 21
Under what conditions
can we manage the gap?
• We have shown that it is possible under the three
assumptions: i) linearity; ii) Acyclicty;
iii) non-Gaussianity (Hoyer+08IJAR; Shimizu+14JMLR):
• Classical Bayesian network approach incapable
9
x1 x2?
x1 x2or
f1 f1
x1 x2
f1
or
21211212
11121
efxbx
efx
21212
11122121
efx
efxbx
22212
11121
efx
efx
Basic non-Gaussian model
(No hidden common cause)
S. Shimizu, P. O. Hoyer, A. Hyvärinen
and A. Kerminen
Journal of Machine Learning Research
2006
Linear Non-Gaussian Acyclic
Model (LiNGAM) (Shimizu et al., 2006)
• Identifiable: causal directions and coefficients
• Various extensions including nonlinear (Hoyer+08NIPS,
Zhang+09UAI) and cyclic (Lacerda+08UAI) models
11
i
ij
jiji exbx
x1 x2
x3
21b
23b13b
2e
3e
1e
Linearity
Acyclicity
Non-Gaussian errors eiIndependence of errors ei
(no hidden common causes)
1212Different directions give
different data distributionsGaussian Non-Gaussian
(ex. uniform)
Model 1:
Model 2:
x1
x2
x1
x2
e1
e2
x1
x2
e1
e2
x1
x2
x1
x2
x1
x2
212
11
8.0 exx
ex
22
121 8.0
ex
exx
1varvar 21 xx
,021 eEeE
13
Independent Component Analysis
(ICA) (Jutten & Herault, 1991; Comon, 1994; Hyvarinen et al., 2001)
• Observed variables are modeled by
where
– Hidden variables are non-Gaussian and independent
• Then, mixing matrix A is identifiable up to permutation and scaling of the columns
Asx
pjs j ,,1
p
j
jiji sax1
or
ix
Sketch of the identifiability proof
• Different directions give different zero/non-
zero patterns of the mixing matrices
– No zeros on the diagonal in the causal model
– No permutation indeterminacy
14
2
1
212
1
1
01
e
e
bx
x
21212
11
exbx
ex
A sx
2
112
2
1
10
1
e
eb
x
x
A sx22
12121
ex
exbx
x1
x2
e1
e2
x1
x2
e1
e2
0
0
Model 1:
Model 2:
LiNGAM with hidden
common causes
P. O. Hoyer, S. Shimizu, A. Kerminen,
and M. Palviainen
Int. J. Approximate Reasoning
2008
qf
2121
1
22
1
1
11
exbfx
efx
Q
q
Q
q
i
ij
jij
Q
q
qiqi exbfx 1
• Extension to incorporate non-Gaussian hidden
common causes
LiNGAM with hidden
common causes (Hoyer+08IJAR)
16
where are independent: ),,1( Qqfq
x1 x2 2e1e
1f 2f
i
ij
jij
Q
q
qiqi exbfx 1
2
:2 fef
1
:1 fef
qfWLG, hidden common causes
are assumed to be independent
Independent hidden
common causes
17
x1 x2 2e1e
1fe
2fe
x1 x2 2e1e
1f 2f
Dependent hidden
common causes
2
1
2221
11
2221
11
2
100
2
1
f
f
aa
a
e
e
aa
a
f
f
f
f
Non-Gaussian
x2
x1
Gaussian e1,e2, f1
x2
• Faithfulness on 𝑥𝑖, 𝑓𝑖 + Number of 𝑓𝑖 given
Different directions give different
zero/non-zero patterns (Hoyer+08IJAR)
18
x1 x2
f1
x1 x2
f1
x1 x2
f1
Models
1.
2.
3.
**0
*0*
***
*0*
**0
***
A
A
Previous estimation methods(Hoyer+08IJAR; Henao+11JMLR)
• Explicitly model hidden common causes
• Do model comparison based on maximum
likelihood principle or Bayesian approach
• Need to specify their number and distributions,
which is difficult in general
19
x1 x2
f1
x1 x2
orfQ f1 fQ… …
2e1e2e1e
Our proposal:
A Bayesian LiNGAM
approach
S. Shimizu and K. Bollen.
Journal of Machine Learning Research,
2014
and something extra
Key idea (1/2)
• Transform the model to a model with
no hidden common causes
21
)1(
1x)1(
2x
)(
2
mx)1(
1xx1 x2
f1 fQ…
2e1e
)1(
2e)1(
1e
)(
2
me)(
1
me
……
21b
21b
21b)(
2
m
)1(
2
LiNGAM with no hidden
common causes but with
possibly different
intercepts over obs.
LiNGAM with
hidden common
causes
)1(
1
)(
1
m
Key idea (2/2)
• Include the sums of hidden common causes as
the model parameters, i.e., observation-specific
intercepts:
• Not explicitly model hidden common causes
– Neither necessary to specify the number of hidden
common causes Q nor estimate the coefficients
22
)(
2
m
)(
2
)(
121
1
)(
2
)(
2
mmQ
q
m
m exbfx
m-th obs.:
q2
Obs.-specific
intercept
• Compare the marginal likelihoods wth data stndrdzd
• Once a direction has been estimated, compute the
posterior of the connection strength b21 or b12
• Many obs.-specific intercepts
– Similar to mixed models and multi-level models
– Informative prior
)()(
121
)(
2
)(
2
)(
1
)(
1
)(
1
m
i
mmm
mmm
exbx
ex
Bayesian model selection23
),,1;2,1()( nmim
i
Model 3 (x1 x2)
)(
2
)(
2
)(
2
)(
1
)(
212
)(
1
)(
1
mmm
mmmm
ex
exbx
Model 4 (x1 x2)
Prior for the observation-specific
intercepts
• Motivation: Central limit theorem
– Sums of independent variables tend to be more Gaussian
• Approximate the density by a bell-shaped curve dist.
– Dependent due to hidden common causes
• Select the hyper-parameter values
that maximize the marginal likelihood
24
Q
q
m
mQ
q
m
m ff1
)(
2
)(
2
1
)(
1
)(
1 ,
~)(
2
)(
1
m
m
t-distribution with sd ,
correlation , and DOF1221,v
}8.0,.6.0,4.0{, 21
)(m
qf
(here, 8)
Error distributions and other
priors used in the experiment
• Error distributions
– Fixed to be the Laplace distribution
– Possible to be estimated assuming a family of
generalized Gaussian distributions, for
example
• Priors for the other parameters
25
)75.0,0(~
)75.0,0(~
)1,1(~
2
21
2
12
12
Nb
Nb
U
)1,0(~)(
)1,0(~)(
2
1
Uestd
Uestd
)(),( 21 epep
Experiment on sociology data
Sociology data
• Source: General Social Survey (n=1380)– Non-farm background, ages 35-44, white, male, in the labor
force, no missing data for any of the covariates, 1972-2006
• 15 pairs with known temporal directions (Duncan+1972)
27
Status attainment model(Duncan et al., 1972)
x2: Son’s Income
Numbers of successes
(n=1380)
28
FE
✔
✔
Cf. LiNGAM-GU-UK (Chen+13NECO) 0.20; PNL(Zhang+09UAI): 0.60
Known (temporal)
orderings of 15 pairs
Son’s
Education
Father’s
Education
Son’s
Income
Son’s
Occupation
…
f1
f1
Conclusion
Conclusion• Estimation of causal direction in the presence of
hidden common causes is a major challenge in
causal discovery
• Proposed a linear non-Gaussian SEM approach
– Not necessary to model individual hidden common
causes
• Future directions
– Cyclic cases: Using some prior for forcing the
identifiability condition of Lacerda+08UAI?
– Non-stationarity: Combining with Kun’s method
(Huang+15IJACI)?
30