part v causal discovery 2 - 2018.ds3-datascience...

Part VCausal Discovery 2: Linear, non-Gaussian Models

• “Independence mechanism” instantiation 2: Independent noise condition

• Causal discovery based on structural equation models: linear non-Gaussian case

• Extensions to deal with confounders/cycles

Fully Identifiable Causal Structure? Two-Variable Case.

• Structural equation model / functional causal model

• Related to this type of “independence”:

• Start with the linear case

• Determine causal direction in the two-variable case? Identifiability!

funcX

E

Y

P(X)→X→P(Y|X)

Y→

⫫

Y = f(X,E), where E ?? X

Y = aX + E, where E ?? X

X Y ------------- -1.1 1.0 2.1 2.0 3.1 4.2

2.3 -0.6

1.3 2.2 -1.8 0.9 ... ....

X Y

X Y

X Y

or

orZ

Gaussian vs. Non-Gaussian Distributions

−4 −2 0 2 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8Three distribusions with zero mean and unit variance

GaussianLaplacianUniform 50 100 150 200 250 300 350 400 450 500

−2

0

2

Guassian

50 100 150 200 250 300 350 400 450 500−4−2024

Laplacian

0 50 100 150 200 250 300 350 400 450 500−2

0

2

Uniform

Causal Asymmetry the Linear Case: Illustration

EXY EY

XX Y

X

Y

X

Y

EX

X

Y

X

Y

Y

Gaussian case

Uniform case

Linear regression Y = aX + EYLinear regression X = bY + EX

Data generated by Y = aX + E (i.e., X →Y):

EY

Super-Gaussian CaseData generated by Y = aX + E (X →Y):

X

Y

X

Y Y

EX

X

E

X

EY

More Generally, LiNGAM Model• Linear, non-Gaussian, acyclic causal model (LiNGAM) (Shimizu et al., 2006):

• Disturbances (errors) Ei are non-Gaussian (or at most one is Gaussian) and mutually independent

• Example:X2 X3

X1

0.5

-0.2 0.3E2 E3

E1

X2 = E2,

X3 = 0.5X2 + E3,

X1 = �0.2X2 + 0.3X3 + E1.

Xi =X

j: parents of i

bijXj + Ei or X = BX+E

Shimizu et al. (2006). A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7:2003–2030.

Identifiability of Causal Direction in the Linear Case

• Supported by the “independent component analysis” theory

• Later will be approached in a more general nonlinear setting, and you’ll see the linear-Gassian case is one of the few non-identifiable situations

Darmois-Skitovich Theorem & Identifiability of Causal Direction in Two-Variable CaseDarmois-Skitovitch theorem: Define two random variables,

Y1 and Y2, as linear combinations of independent random variablesSi, i = 1, ..., n:

Y1 = ↵1S1 + ↵2S2 + ...+ ↵nSn,

Y2 = �1S1 + �2S2 + ...+ �nSn.

If Y1 and Y2 are statistically independent, then all variables Sj forwhich ↵j�j 6= 0 are Gaussian.

Kagan et al., Characterization Problems in Mathematical Statistics. New York: Wiley, 1973

Generated by Y = aX + E (X →Y):

Assuming Y →X (fitting X = bY + EY):

1 b0 1

�·EY

Y

�=

=

XY

�)

EY

Y

�=

1� ab �b

a 1

�·EX

�

☓

0 11 a

�·EX

�

Independent Component Analysis

X1

Xm

observed signals

ICA system

output: as independent as

possible

W

… … Y1

Yn

de-mixing

estimate

A

… …s1

sn

unknown mixing system

independent sources

mixing

…

X = A·S Y = W·X

A

… …S1

Sn


independent sources

mixing

• Assumptions in ICA

• At most one of Si is Gaussian

• #Source >= # Sensor, and A is of full column rank

.5 .3 1.1 �0.3 ....8 �.7 .3 .5 ...

�=

? ?? ?

�·? ? ? ? ...? ? ? ? ...

�

Hyvärinen et al., Independent Component Analysis, 2001

Then A can be estimated up to column scale and permutation

indeterminacies

As1

s2

X1

X2

Recovering Data-Generating Process

X1

Xm

observed signals

…• This is a causal process, but may be less interesting because Sk

are usually not observable/manipulable

• When Sk is identical to Xj, we have causal relations among Xj

S1

S2

X = A·S

Intuition: Why ICA works?• (After preprocessing) ICA aims to find a

rotation transformation Y = W·X to making Yi independent

• By maximum likelihood log p(X|A), mutual information MI(Y1,...,Ym) minimization, infomax...

X1

X2

X1

X2

Y1

Y2

X1

X2

Y1

Y2

X1

X2

Y1

Y2

How ICA works? By Maximum Likelihood• From a maximum likelihood perspective

• To be maximized by the gradient-based method or natural-gradient based method

X = A·SY = W·XpS = ⇧n

i=1pSi

)pX = ⇧ni=1pSi(W

|i X)/|A|

)TX

t=1

log pX(Xt) =TX

t=1

nX

i=1

log pSi(W|i X

t) + T log |W|

(Xt: the t-th point of X.)

Probabilities: Product and Sum Rules Are Fundamental 1. P(X,Y|I) = P(X|Y, I)P(Y|I) 2. P(X|I) = ∑YP(X,Y|I) (I: background information. In the continuous case the sum is replaced by an integral.)

How ICA works? By Mutual Information Minimization

• Mutual information I(Y1,...,Yn) is the Kullback-Leiber divergence from PY to ∏iPYi :

• Nonnegative and zero iff Yi are independent

• H(·): differential entropy--how random the variable is?

Hyvärinen et al., Independent Component Analysis...

I(Y1, ..., Yn) =

Z. . .

ZpY1,...,Yn log

pY1,...,Yn

pY1 ...pYn

dy1...dyn

=

Z. . .

ZpY1,...,Yn log pY1,...,Yndy1...dyn �

ZpY1,...,Yn

nX

i=1

log pYidyi

=X

i

H(Yi)�H(Y )

=X

i

H(Yi)�H(X)� log |W| because Y = WX

How ICA works? Some Interpretation

• Some methods (e.g., FastICA, JADE) pre-whiten the data, and then aim to find a rotation, for which |W| = 1

• Minimizing I ⇔ minimizing the entropies

• Given the variance, the Gaussian distribution has the largest entropy (among all continuous distributions)

• Maximizing non-Gaussianity !

• FastICA adopts some approximations of negentropy of each output Yi

I(Y1, ..., Yn) =X

i

H(Yi)�H(X)� log |W| =X

i

H(Yi) + const.

Non-Gaussianity is Informative in the Linear Case

• Smaller entropy, more structural, more interesting

• “Purer” according to the central limit theorem

198 ICA BY MAXIMIZATION OF NONGAUSSIANITY

Fig. 8.26 An illustration of projection pursuit and the “interesting” directions. The data inthis figure is clearly divided into two clusters. The goal in projection pursuit is to find theprojection (here, on the horizontal axis) that reveals the clustering or other structure of thedata.

the other hand, the projection on the vertical direction, which is also the direction ofthe first principal component, fails to show this structure. This also shows that PCAdoes not use the clustering structure. In fact, clustering structure is not visible in thecovariance or correlation matrix on which PCA is based.

Thus projection pursuit is usually performed by finding the most nongaussianprojections of the data. This is the same thing that we did in this chapter to estimate theICA model. This means that all the nongaussianity measures and the correspondingICA algorithms presented in this chapter could also be called projection pursuit“indices” and algorithms.

It should be noted that in the formulation of projection pursuit, no data modelor assumption about independent components is made. If the ICA model holds,optimizing the ICA nongaussianity measures produce independent components; ifthe model does not hold, then what we get are the projection pursuit directions.

8.6 CONCLUDING REMARKS AND REFERENCES

A fundamental approach to ICA is given by the principle of nongaussianity. Theindependent components can be found by finding directions in which the data ismaximally nongaussian. Nongaussianity can be measured by entropy-based mea-sures or cumulant-based measures like kurtosis. Estimation of the ICA model canthen be performed by maximizing such nongaussianity measures; this can be doneby gradient methods or by fixed-point algorithms. Several independent componentscan be found by finding several directions of maximum nongaussianity under theconstraint of decorrelation.

Which direction is more interesting?

Hyvärinen et al., Independent Component Analysis, 2001

A Demo of the ICA

Procedure

Why Gaussianity Was Widely Used?

• Central limit theorem: An illustration

• “Simplicity” of the form; completely characterized by mean and covariance

• Marginal and conditionals are also Gaussian

• Has maximum entropy, given values of the mean and the covariance matrix

E. T. Jaynes. Probability Theory: The Logic of Science. 1994. Chapter 7.

−0.5 0 0.50

20

40

60

80

100

120hist(Ui)

−1 −0.5 0 0.5 10

50

100

150

200hist( (U1+U2)/sqrt(2) )

−1 −0.5 0 0.5 10

50

100

150

200hist( (U1+U2+U3)/sqrt(3) )

Gaussianity or Non-Gaussianity?

• Non-Gaussianity is actually ubiquitous

• Linear closure property of Gaussian distribution: If the sum of any finite independent variables is Gaussian, then all summands must be Gaussian (Cramér, 1970)

• Gaussian distribution is “special” in the linear case

• Practical issue: How non-Gaussian they are?

LiNGAM Analysis by ICA • LiNGAM:

• B has special structure: acyclic relations

• ICA: Y = WX

• B can be seen from W by permutation and re-scaling

• Faithfulness assumption avoided

• E.g., 2

4E1

E3

E2

3

5 =

2

41 0 0

�0.5 1 00.2 �0.3 1

3

5 ·

2

4X2

X3

X1

3

5

,

8><

>:

X2 = E1

X3 = 0.5X2 + E3

X1 = �0.2X2 + 0.3X3 + E2

X2 X3

X1

0.5

-0.2 0.3

So we have the causal relation:W

Xi =X

j: parents of i

bijXj + Ei or X = BX+E ⇒ E = (I-B)X

Question 1. How to find W?

Question 2. How to see B from W?

Can You See Causal Relations fromW?• ICA gives Y = WX and

• Can we find the causal model?

2. Then divide each row of Ẅ by its diagonal entry, giving Ẅ’. 3. B = I� W0 .

W =

2

664

0.6 �0.4 2 01.5 0 0 00 0.2 0 0.51.5 3 0 0

3

775

* LiNGAM: X = BX+E or E = (I-B)X; ICA gives Y = WX

* So W is a row-permuted and re-scaled version of (I-B). Can we determine B uniquely?

* (I-B): All diagonal entries are 1. So we have the procedure ⤴

* Uniqueness? implied by acyclicity of the causal relations (B can be permuted to strictly lower-triangularity, i.e., if Xi follow the causal ordering, B is strictly lower-triangular

1. First permute the rows of W to make all diagonal entries non-zero, yielding Ẅ.

ICA-Based LiNGAM: A Real Example

−2 −1 0 1 2−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Height

Arm

spa

n A(:, 1)

A(:, 2)

For small-scale problems, we can compare the

dependence between the residual & hypothetical

cause in both directions !

Y1

Y2

�= W ·

height

arm span

�, where

W =

1.33 �0.39�1.56 2.02

�, or

A = W�1 =

0.97 0.190.75 0.64

�

If W1,2 = 0, then

height =1

1.33Y1,

arm span =1.56

2.02height +

1

2.02Y2.

Independence Test / Dependence Measure

• Measure: mutual information MI(Y1,Y2) ≥0 with equality holds iff Y1⫫Y2

• Statistical test for independence

• Y1⫫Y2 if and only if all functions of them are uncorrelated

• The functional space can be narrowed down to the reproducing kernel Hilbert space

• HSIC independence test; Kernel-based (conditional) independence test; other tests also exist

Gretton et al. (2008). A kernel statistical test of independence. In Advances in Neural Information Processing Systems, 585–592.

Zhang et al. (2011). Kernel-based conditional independence test and application in causal discovery. In Proc. UAI, 804–813.

50 100 150 200 250 300 350 400−100

−50

0

50

100

Height

residuals

50 100 150 200 250 300 350 400100

150

200

250

300

350

400

450

Height

Arm

spa

n

100 150 200 250 300 350 400 450−150

−100

−50

0

50

100

150

Arm spam

residuals

100 150 200 250 300 350 400 45050

100

150

200

250

300

350

400

Arm span

Hei

ght

p-value for ind test. is 0.93; MI = 0.002

p-value for ind test. is 0.53; MI = 0.042

Real Examples: By Checking Independence in Both Directions

Some Estimation Methods for LiNGAM

• ICA-LiNGAM

• ICA with Sparse Connections

• DirectLiNGAM...

Shimizu et al. (2006). A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7:2003–2030.

Zhang et al. (2006) ICA with sparse connections: Revisited. Lecture Notes in Computer Science, 5441:195–202, 2009

Shimizu, et al. (2011). DirectLiNGAM: A direct method for learning a linear non-Gaussian structural equation model. Journal of Machine Learning Research, 12:1225–1248.

ICA-Based LiNGAM: A Demonstration

X2 X3

X1

0.5

-0.6 0.3

X4

0.8 -0.5

What’s the output of PC? LiNGAM?

(with uniformly distributed noise)

Data is in‘LiNGAM_4variables.txt’

Practical Issues beyond LiNGAM…• Confounding (SGS 1993; Hoyer et al., 2008)

• Feedback (Lacerda et a;., 2008; Richardson 1996)

• Nonlinearities (Zhang & Chan, ICONIP’06; Hoyer et al., NIPS’08; Zhang & Hyvärinen, UAI’09, Huang et al., KDD’18)

• Causality in time series

• Time-delayed + instantaneous relations (Hyvarinen ICML’08; Zhang et al., ECML’09; Hyvarinen et al., JMLR’10)

• Subsampling / temporally aggregation (Danks & Plis, NIPS WS’14; Gong et al., ICML’15 & UAI’17)

• From partially observable time series (Geiger et al., ICML’15)

• Nonstationary/heterogeneous data (Zhang et al., IJCAI’17; Huang et al, ICDM’17)

• Measurement error (Zhang et al., UAI’18; PSA’18)

• Selection bias (Zhang et al., UAI’16) 29

Are They Confounders ?

X2

X3

X1

1.2E1

E3

X5

X4

3

2

-0.3

E2

E5

E40.5X1? X4? X2? X5?

Identifiability of Overcomplete ICA

• More independent sources than observed variables, i.e., n>m

Theorem: Suppose the random vector X = (X1, ..., Xm)| isgenerated by X = AS, where the components of S, S1, ..., Sn, arestatistically independent. Even when n > m, the columns of A arestill identifiable up to a scale transformation if

• all Si are non-Gaussian, or

• A is of full column rank and at most one of Si is Gaussian.

X1

Xm

observed signals

A

… …s1

sn


independent sources

mixing

…

Kagan et al., Characterization Problems in Mathematical Statistics. New York: Wiley, 1973Eriksson and Koivunen (2004). Identifiability, Separability and Uiiiqueness of Linear ICA Models, IEEE

Signal Processing Lett.: vol. 11, no. 7, pp. GOI-604, Jul. 2004.

n>m

.5 .3 1.1 �0.3 ....8 �.7 .3 .5 ...

�=

? ? ?? ? ?

�·

2

4? ? ? ? ...? ? ? ? ...? ? ? ? ...

3

5As1s2s3

X1

X2

Overcomplete ICA: Illustration

−20 −15 −10 −5 0 5 10 15 20 25 30−30

−20

−10

0

10

20

30

X1

X 2

X1

X2

�=

0.8 0.4 �0.9 00.3 0.8 0.8 1

�·

2

664

S1

S2

S3

S4

3

775

What if they are Gaussian?

Causal Discovery under Confounders

• Can we see the causal direction ?

• Can we determine a3 ? a1 and a2 ?

• Observationally equivalent model:

X1 X2

Za1 a2

a3

E1 E2

X1 X2

E11 -a2/a1

a3+a2/a1

a1˙Z E2

X1

X2

�=

"1 0 1

(a3 +a2a1

⌘+ �a2

a11

⇣a3 +

a2a1

⌘#·

2

4E1

E2

a1Z

3

5

X1

X2

�=

1 0 a1a3 1 a1a3 + a2

�·

2

4E1

E2

Z

3

5 =

1 0 1a3 1 a3 +

a2a1

�·

2

4E1

E2

a1Z

3

5

Hoyer et al. (2008). Estimation of causal effects using linear nonGaussian causal models with hidden variables. International Journal of Approximate Reasoning, 49(2):362– 378.

Confounders: ExampleX1 X2

Z

a1=0.6 a2=0.8

a3=0.5

E1 E2

X1

X2

�=

1 0 a1a3 1 a1a3 + a2

�·

2

4E1

E2

Z

3

5 =

1 0 1a3 1 a3 +

a2a1

�·

2

4E1

E2

a1Z

3

5

−20−10

010

2030

−20−10

010

20

−20

−15

−10

−5

0

5

10

ZE1

E 2

−30 −20 −10 0 10 20 30−30

−20

−10

0

10

20

30

X1

X 2

What If There Are Cycles/Loops?

X2

X3

X4

2-1

-0.3

Cycles• Causal relations may have cycles; Consider an example

X1 → X2

Lacerda, Spirtes, Ramsey and Hoyer (2008). Discovering cyclic causal models by independent component analysis. In Proc. UAI.

X2

X3

X1

1.2E1

E3

X5

X4

3

2-1

-0.3

E2

E5

E4

X1 = E1

X2 = 1.2X1 � 0.3X4 + E2

X3 = 2X2 + E3

X4 = �X3 + E4

X5 = 3X2 + E5

Or in matrix form, X = BX+E, where

B =

2

66664

0 0 0 0 0

1.2 0 0 �0.3 0

0 2 0 0 0

0 0 �1 0 0

0 3 0 0 0

3

77775

A conditional-independence-based method is given in T. Richardson (1996) - A Polynomial-Time Algorithm for Deciding Markov Equivalence of Directed Cyclic Graphical Models. Proc. UAI

Why Cycles?• Some situations where we can recover cycles with ICA:

• Each process reaches its equilibrium state & we observe the equilibrium states of multiple processes

• On temporally aggregated data

X1 → X2

X1,t-1

X2,t-1

X1,t

X2,t

X1,t+1

X2,t+1

...

... ...

...

B B

Xt = BXt�1 + Et.

At convergence we have Xt = Xt�1 for eachdynamical process, so

Xt = BXt + Et, or Et = (I�B)Xt.

Suppose the underlying process is Xt = BXt�1 + Et, but we just observeXt =

1L

PLk=1 Xt+k. Since

1

L

LX

k=1

Xt+k = B1

L

LX

k=1

Xt+k�1 +1

L

LX

k=1

Et+k.

We have Xt = BXt +Et as L ! 1.

✗ ✗

✗✗

Examples• Some situations where we can recover cycles with ICA:

• Each process reaches its equilibrium state & we observe the equilibrium states of multiple processes

• On temporally aggregated data

X1 → X2

X1,t-1

X2,t-1

X1,t

X2,t

X1,t+1

X2,t+1

...

... ...

...

B BConsider the price and demand of the same

product in di↵erent states:

pricet = b1 · pricet�1 + b2 · demandt�1 + E1

demandt = b3 · pricet�1 + b4 · demandt�1 + E2

Suppose the underlying process is Xt = BXt�1 + Et, but we just observeXt =

1L

PLk=1 Xt+k.

Consider the causal relation between two stocks: the causal influence takesplace very quickly (⇠ 1-2 minutes) but we only have daily returns.

Can We Recover Cyclic Relations?

• E = (I-B)X; ICA can give Y = WX

• Without cycles: unique solution to B

• With cycles: solutions to B not unique any more; why? :-(

• A 2-D example?

• Only one solution is stable (assuming no self-loops), i.e., s.t. |product of coefficients over the cycle| < 1 :-)

Summary:1. Still m independent components;2. W cannot be permuted to be lower-triangular

X2

X1

aE2

E1 b

Suppose we have the process

Xt =

0 ba 0

�

| {z }B

Xt + Et.

That is,

(I�B)X = E, or

1 �b�a 1

�Xt = Et

)�a 11 �b

�Xt =

0 11 0

�· Et

)

1 �1/a�1/b 1

�Xt =

0 �1/a

�1/b 0

�· Et

)Xt =

0 1/a1/b 0

�

| {z }B0

Xt +

0 �1/a

�1/b 0

�· Et.

W

W’

X1 = bX2 + E1

X2 = aX1 + E2

+

X2 =1

bX1 �

1

bE1

X1 =1

aX2 �

1

aE2

Can You Find the Alternative Causal Model ?

• For this example...

X1 → X2

X1 = E1

X2 = 1.2X1 � 0.3X4 + E2

X3 = 2X2 + E3

X4 = �X3 + E4

X5 = 3X2 + E5


B =

2

66664

0 0 0 0 0

1.2 0 0 �0.3 0

0 2 0 0 0

0 0 �1 0 0

0 3 0 0 0

3

77775

I�B =

2

66664

1 0 0 0 0�1.2 1 0 0.3 00 �2 1 0 00 0 1 1 00 �3 0 0 1

3

77775.

W0 =

2

66664

1 0 0 0 00 �2 1 0 00 0 1 1 0

�1.2 1 0 0.3 00 �3 0 0 1

3

77775. That is,

B0 =

2

66664

0 0 0 0 00 0 0.5 0 00 0 0 �1 04 �3.3 0 0 00 3 0 0 0

3

77775.

X2

X3

X1

1.2E1

E3

X5

X4

3

2-1

-0.3

E2

E5

E4

Can You Find the Alternative Causal Model ?

• For this example...

X1 → X2

X1 = E1

X2 = 1.2X1 � 0.3X4 + E2

X3 = 2X2 + E3

X4 = �X3 + E4

X5 = 3X2 + E5


B =

2

66664

0 0 0 0 0

1.2 0 0 �0.3 0

0 2 0 0 0

0 0 �1 0 0

0 3 0 0 0

3

77775

I�B =

2

66664

1 0 0 0 0�1.2 1 0 0.3 00 �2 1 0 00 0 1 1 00 �3 0 0 1

3

77775.

W0 =

2

66664

1 0 0 0 00 �2 1 0 00 0 1 1 0

�1.2 1 0 0.3 00 �3 0 0 1

3

77775. That is,

B0 =

2

66664

0 0 0 0 00 0 0.5 0 00 0 0 �1 04 �3.3 0 0 00 3 0 0 0

3

77775.

X2

X3

X1

1.2E1

E3

X5

X4

3

2-1

-0.3

E2

E5

E4 X2

X3

X14

E1

E’3

X5

X4

3

0.5-1

-3.3

E’2

E5

E’4

X2 = 1.2X1 � 0.3X4 + E2 can be rewritten as X4 =1.2

0.3X1 �

1

0.3X2 +

1

0.3E2 ...

Some Simulation Result

• Simulate 15000 data points with non-Gaussian noise using this model:

• Output of the algorithm:X2

X3

X1

1.2E1

E3X5

X4

3

2-1

-0.3

E2

E5

E4

Fig. 3: The output of LiNG-D: Candidate #1 and Candi-date #2

assuming linearity and no dependence between errorterms:

• DGs G1 and G2 are zero partial correlation equiv-alent if and only if the set of zero partial correla-tions entailed for all values of the free parameters(non-zero linear coe�cients, distribution of the er-ror terms) of a linear SEM with DG G1 is the sameas the set of zero partial correlations entailed forall values of the free parameters of a linear SEMwith G2. For linear models, this is the same asd-separation equivalence. [13]

• DGs G1 and G2 are covariance equivalent if andonly if for every set of parameter values for the freeparameters of a linear SEM with DG G1, there isa set of parameter values for the free parametersof a linear SEM with DG G2 such that the twoSEMs entail the same covariance matrix over thesubstantive variables, and vice-versa.

• DGs G1 and G2 are distribution equivalent if andonly if for every set of parameter values for the freeparameters of a linear SEM with DG G1, there is aset of parameter values for the free parameters ofa linear SEM with DG G2 such that the two SEMsentail the same distribution over the substantivevariables, and vice-versa. Do not confuse this withthe notion of distribution-entailment equivalencebetween SEMs: two SEMs with fixed parametersare distribution-entailment equivalent i↵ they en-tail the same distribution.

It follows from well-known theorems about the Gaus-sian case [13], and some trivial consequences of knownresults about the non-Gaussian case [12], that the fol-lowing relationships exist among the di↵erent senses ofequivalence for acyclic graphs: If all of the error termsare assumed to be Gaussian, distribution equivalenceis equivalent to covariance equivalence, which in turnis equivalent to d-separation equivalence. If not all of

the error terms are assumed to be Gaussian, then dis-tribution equivalence entails (but is not entailed by)covariance equivalence, which entails (but is not en-tailed by) d-separation equivalence.

So for example, given Gaussian error terms, A Band A! B are zero partial correlation equivalent, co-variance equivalent, and distribution equivalent. Butgiven non-Gaussian error terms, A B and A ! Bare zero-partial-correlation equivalent and covarianceequivalent, but not distribution equivalent. So forGaussian errors and this pair of DGs, no algorithmthat relies only on observational data can reliably se-lect a unique acyclic graph that fits the population dis-tribution as the correct causal graph without makingfurther assumptions; but for all (or all except one) non-Gaussian errors there will always be a unique acyclicgraph that fits the population distribution.

While there are theorems about the case of cyclicgraphs and Gaussian errors, we are not aware of anysuch theorems about cyclic graphs with non-Gaussianerrors with respect to distribution equivalence. Inthe case of cyclic graphs with all Gaussian errors,distribution equivalence is equivalent to covarianceequivalence, which entails (but is not entailed by) d-separation equivalence [14]. In the case of cyclic graphsin which at most one error term is non-Gaussian, dis-tribution equivalence entails (but is not entailed by)covariance equivalence, which in turn entails (but isnot entailed by) d-separation equivalence. However,given at most one Gaussian error term, the importantdi↵erence between acyclic graphs and cyclic graphs isthat no two di↵erent acyclic graphs are distributionequivalent, but there are di↵erent cyclic graphs thatare distribution equivalent.

Hence, no algorithm that relies only on observationaldata can reliably select a unique cyclic graph that fitsthe data as the correct causal graph without mak-ing further assumptions. For example, the two cyclicgraphs in Fig. 3 are distribution equivalent.

5.2 The output of LiNG-D is correct and asfine as possible

Theorem 1 The output of LiNG-D is a set of SEMsthat comprise a distribution-entailment equivalenceclass.

Proof: First, we show that any two SEMs in the out-put of LiNG-D entail the same distribution.

The weight matrix output by ICA is determined onlyup to scaling and row permutation. Intuitively, then,permuting the error terms does not change the mix-ture. Now, more formally:

Lacerda, Spirtes, Ramsey and Hoyer (2008). Discovering cyclic causal models by independent component analysis. In Proc. UAI.

part v causal discovery 2 - 2018.ds3-datascience...

Documents