independentcomponentanalysis - stanford...
TRANSCRIPT
1'
&
$
%
Independent Component Analysis
Background paper:
http://www-stat.stanford.edu/∼hastie/Papers/ica.pdf
2'
&
$
%
ICA Problem
X = AS
where
• X is a random p-vector representing multivariate input
measurements.
• S is a latent source p-vector whose components are
independently distributed random variables.
• A is p× p mixing matrix.
Given realizations x1, x2, . . . , xN of X , the goals of ICA are to
• Estimate A
• Estimate the source distributions Sj ∼ fSj, j = 1, . . . , p.
3'
&
$
%
Cocktail Party Problem
In a room there are p independent sources of sound, and p
microphones placed around the room hear different mixtures.
Source Signals Measured Signals
PCA Solution ICA Solution
Here each of the xij = xj(ti) and recovered sources are a
time-series sampled uniformly at times ti.
4'
&
$
%
Independent vs Uncorrelated
WoLOG can assume that E(S) = 0 and Cov(S) = I, and hence
Cov(X) = Cov(AS) = AAT .
Suppose X = AS with S unit variance, uncorrelated
Let R be any orthogonal p× p matrix. Then
X = AS = ARTRS = A
∗S∗
and Cov(S∗) = I
It is not enough to find uncorrelated variables, as they are not
unique under rotations.
Hence methods based on second order moments, like principal
components and Gaussian factor analysis, cannot recover A.
ICA uses independence, and non-Gaussianity of S, to recover A —
e.g. higher order moments.
5'
&
$
%
Independent vs Uncorrelated Demo
*
**
*
*
*
*
*
*
*
*
* *
*
*
*
*
*
*
*
**
**
*
*
*
*
**
*
*
*
*
*
**
**
*
**
*
*
* *
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
**
*
*
*
*
*
**
*
**
*
**
****
*
*
* *
*
*
*
*
*
**
*
*
*
*
*
* *
*
*
*
*
*
**
*
* * *
**
*
*
*
*
*
*
*
*** *
*
*
**
*
*
*
*
**
*
*
*
*
*
* *
**
*
*
*
**
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
*
*
**
*
*
** *
*
*
*
*
*
*
*
*
*
*
**
**
*
***
*
*
*
**
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
* *
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
**
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
**
*
*
*
*
*
*
*
*
**
*
*
*
*
*
**
*
*
*
**
*
*
*
*
*
*
*
*
*
*
**
*
* **
*
*
*
*
*
* *
*
*
*
**
**
*
*
*
*
*
**
*
*
*
*
*
*
*
*
**
*
*
*
**
*
*
*
*
**
**
*
*
*
*
*
**
**
*
*
*
*
**
*
**
**
*
*
*
*
*
*
**
*
*
**
*
*
**
*
*
*
**
*
*
**
*
*
*
**
*
*
*
**
*
*
*
*
*
* *
*
* *
*
***
*
*
**
* * *
**
*
*
*
*
*
**
**
*
*
*
*
*
*
*
*
* * *
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
Source S
*
**
*
*
*
*
*
***
*
*
*
*
*
**
*
* **
*
*
*
*
*
**
*
*
*
*
*
**
*
**
*
*
*
*
*
*
**
*
*
*
*
**
*
**
**
*
*
*
*
*
**
*
*
*
*
*
***
*
***
*
*
*
*
*
*
*
**
*
**
**
*
**
*
*
*
*
**
*
**
* **
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
**
**
*
**
**
*
*
* *
*
*
*
* *
*
*
*
*
* *
*
*
*
*
*
*
*
*
*
*
*
*
* *
* *
*
*
*
* *
**
*
*
*
*
*** *
**
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
**
**
*
*
*
* *
*
*
*
* **
**
*
**
**
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
**
*
*
*
* *
*
**
*
**
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
* *
*
*
***
*
*
**
**
*
*
* *
* *
*
*
*
**
*
*
**
*
*
*
**
*
*
**
*
*
**
* *
*
*
*
*
*
* *
*
***
*
*
*
*
**
*
*
*
*
*
*
*
**
*
*
*
*
**
*
*
*
*
*
***
*
*
*
*
*
**
*
**
*
**
*
** *
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
**
*
*
*
*
* *
*
*
**
**
***
*
**
*
*
**
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
****
*
*
*
*
*
**
**
**
**
*
*
**
*
*
*
*
*
*
* *
*
**
**
*
*
*
*
*
*
*
*
*
*
*
*
*
* **
*
*
**
*
*
*
*
Data X
*
**
*
**
*
*
*
*
**
*
**
*
**
**
**
*
*
*
** **
*
*
*
*
**
*
**
*
*
**
*
**
**
*
*
**
*
*
**
*
*
**
*
*
*
*
*
*
*
*
*
* *
*
*
*
***
*
*
***
**
*
**
*
**
***
*
* *
*
*
**
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
**
**
*
*
*
*
*
*
**
*
*
*
*
*
*
** *
**
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
**
*
**
**
**
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
*
**
*
**
*
*
*
*
*
**
*
*
* *
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
**
*
*
*
*
**
*
*
**
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
* *
*
*
*
*
*
*
*
**
***
*
*
*
* *
*
*
**
*
*
*
*
*
*
*
**
** *
*
**
*
*
* *
*
*
*
*
*
**
*
*
**
*
***
*
*
***
*
**
**
*
*
*
**
*
*
**
* *
*
*
*
*
*
*
*
*
*
*
*
*
**
*
* *
*
*
**
*
*
*
*
*
*
*
**
*
*
*
*
*
*
**
*
*
*
*
*
* *
*
*
**
**
*
**
*
*
*
**
**
*
**
**
***
*
*
*
*
*
*
*
*
*
**
*
*
**
**
*
*
*
**
*
*
*
*
*
*
****
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
**
*
*
*
*
**
*
*
*
**
*
*
*
*
* *
*
**
*
*
PCA Solution
*
* *
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
**
**
*
*
*
*
**
*
*
*
*
*
**
**
*
**
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
**
*
*
*
*
*
**
*
**
*
**
* * **
*
*
**
*
*
*
*
*
* *
*
*
*
*
*
**
*
*
*
*
*
**
*
***
**
*
*
*
*
*
*
*
*
***
*
*
**
*
*
*
*
**
*
*
*
*
*
**
*
**
*
*
* *
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
**
*
*
* **
*
**
*
*
*
*
*
*
*
**
**
*
** *
*
*
*
* *
*
**
*
**
**
*
*
*
*
*
*
*
*
*
*
*
**
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
* *
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
**
*
*
*
*
*
*
*
*
*
*
**
*
** *
*
*
*
*
*
**
*
*
*
**
**
*
*
*
*
*
**
*
*
*
*
*
*
*
*
* *
*
*
*
*
*
*
*
*
*
* **
*
*
*
*
*
*
* *
* *
*
*
*
*
**
*
**
*
* *
*
*
*
*
*
***
*
*
*
*
*
**
*
*
*
**
*
*
**
*
*
*
*
*
*
*
*
**
*
*
*
*
*
**
*
**
*
* **
*
*
* *
***
**
*
*
*
*
*
* *
**
*
*
**
**
*
****
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
ICA Solution
Principal components are
uncorrelated linear
combinations of X , chosen
to successively maximize
variance.
Independent components
are also uncorrelated
linear combinations of X ,
chosen to be as
independent as possible.
6'
&
$
%
Example: ICA representation of Natural Images
Pixel blocks are treated as vectors, and then the collection of such
vectors for an image forms an image database. ICA can lead to a
sparse coding for the image, using a natural basis.
see http://www.cis.hut.fi/projects/ica/imageica/(Patrik
Hoyer and Aapo Hyvarinen, Helsinki University of Technology)
7'
&
$
%
Example: ICA and EEG data
See http://www.cnl.salk.edu/∼tewon/ica cnl.html (Scott
Makeig, Salk Institute)
8'
&
$
%
Approaches to ICA
ICA literature is HUGE. Recent book by
Hyvarinen, Karhunen & Oja (Wiley, 2001) is a
great source for learning about ICA, and some
good computational tricks.
• Mutual Information and Entropy, maximizing non-Gaussianity
— FastICA (HKO 2001), Infomax (Bell and Sejnowski, 1995)
• Likelihood methods — ProdDenICA (Hastie +Tibshirani)-
later
• Nonlinear decorrelation — Y1 independent Y2 iff
maxg,f Corr[f(Y1), g(Y2)] = 0 (Herault-Jutten, 1984),
KernelICA (Bach and Jordan, 2001)
• Tensorial moment methods
9'
&
$
%
Simplification
Let Σ = Cov(X) and suppose X = AS, and B = A−1. Then we
can write B = WΣ−
1
2 for some nonsingular W. Then
S = BX = WΣ−
1
2X with Cov(S) = I and W orthonormal.
So operationally we sphere the data X = Σ−
1
2X , and then seek an
orthogonal matrix W so that the components S = WX are
independent.
10'
&
$
%
Entropy and Mutual Information
Entropy: H(Y ) = −∫
f(y) log f(y)dy — maximized by
f(y) = φ(y), the Gaussian (for fixed variance).
Mutual Information: I(Y ) =∑p
j=1 H(Yj)−H(Y )
• Y is a random vector with joint density f(y) and entropy H(Y )
• H(Yj) is the (marginal) entropy of component Yj , with
marginal density fj(yj).
• I(Y ) is the Kullback-Leibler divergence between f(Y ) and it’s
independence version∏p
1 fj(yj) (which is the KL closest of all
independence densities to f(y))
• Hence I(Y ) is a measure of dependence between the
components of a random vector Y .
11'
&
$
%
Entropy, Mutual Information, and ICA
If Cov(X) = I and W is orthogonal then simple calculations show
I(WX) =
p∑
j=1
H(wTj X)−H(X)
Hence
minW
I(WX) ⇐⇒ minW
{dependence between wTj X}
⇐⇒ minW
{the sum of the entropies of the wTj X}
⇐⇒ maxW
{departures from Gaussianity of the wTj X}
• Many methods for ICA look for low-entropy or non-gaussian
projections (Hyvarinen, Karhunen & Oja, 2001)
• Strong similarities with projection pursuit (Friedman and
Tukey, 1974)
12'
&
$
%
Negentropy and FastICA
Negentropy: J(Yj) = H(Yj)−H(Zj), where Zj is a Gaussian RV
with same variance as Yj . Measures the departure from
Gaussianity.
FastICA uses simple approximations to negentropy
J(wTj X) ≈ [EG(wT
j X)− EG(Zj)]2,
and with data replaced expectations by a sample averages. They
use
G(y) =1
alog cosh y; , 1 ≤ a ≤ 2
13'
&
$
%
ICA Density Model
Under the ICA model, the density of S:
fS(s) =
p∏
j=1
fj(sj)
with each fj a univariate density, with mean 0 and variance 1.
Next we develop two direct ways of fitting this density model.
14'
&
$
%
Discrimination from Gaussian
• “Unsupervised vs supervised” idea
• Consider case of p = 2 variables. Idea: observed data is
assigned to class G = 1, and a background sample is generated
from spherical Gaussian g0(x) and assigned to class G = 0.
• We fit to this data a projection pursuit model of the form
logPr(G = 1)
1− Pr(G = 1)= f1(a
T1 X) + f2(a
T2 X) (1)
where [a1, a2] is an orthonormal basis.
• This implies the data density model
g(X) = g0(X) · exp f1(aT1 X) · exp f2(a
T2 X)
Can also show that g0(X) factors into a product
h1(aT1 X) · h2(a
T2 X), since [a1, a2] is an orthonormal basis.
15'
&
$
%
• Hence model (1) is a product density for the data. If also looks
like a neural network
• Can fit it by a) sphering (“whitening”) the data and then b)
applying ppr function in R, combined with IRLS (iteratively
reweighted least squares).
16'
&
$
%
Projection Pursuit Solutions
17'
&
$
%
ICA Density Model -direct fitting
Density of X = AS:
fX(x) = |B|
p∏
j=1
fj(bTj x)
with B = A−1.
Log-likelihood, given x1, x2, . . . , xN :
ℓ(B, {fj}p1) = log |B|+
N∑
i=1
p∑
j=1
log fj(bTj xi)
18'
&
$
%
ICA Density Model -fitting ctd
• As before: let Σ = Cov(X); then for any B we can write
B = WΣ−
1
2 for some nonsingular W. Then S = BX and
Cov(S) = I =⇒ W is orthonormal. So we sphere the data
X = Σ−
1
2X and seek an orthonormal W.
• Let
fj(sj) = φ(sj)egj(sj),
a tilted Gaussian density. Here φ is the standard Gaussian density,
and gj satisfies the normalization conditions.
Then
ℓ(W,Σ, {gj}p1) = −
1
2log |Σ|+
N∑
i=1
p∑
j=1
[
log φj(wTj xi) + gj(w
Tj xi
)
]
• We estimate Σ by the sample covariance of the xi, and ignore it
thereafter — pre-whitening in the ICA literature.
19'
&
$
%
ProdDenICA algorithm
Maximizes log-likelihood on previous slide, over W and gj(·),
j = 1, 2, . . . p, with smoothness constraints on gj(·).
Details on next 3 slides.
20'
&
$
%
Restrictions on gj
Our model is still over-parameterized. We maximize instead a
penalized log-likelihood:
p∑
j=1
[
1
N
N∑
i=1
[
log φ(wTj xi) + gj(w
Tj xi)
]
− λj
∫
{g′′′j (t)}2(t)dt
]
w.r.t. WT = (w1, w2, . . . , wp) and gj , j = 1, . . . , p
subject to
• WTW = I
• each fj(s) = φ(s)egj(s) is a density, with mean 0 and variance 1.
• solutions gj are piecewise quartic splines
21'
&
$
%
ProDenICA: Product Density ICA algorithm
1. Initialize W (random Gaussian matrix followed by orthogo-
nalization).
2. Alternate until convergence of W, using the Amari metric.
(a) Given W, optimize the penalized log-likelihood w.r.t. gj
(separately for each j), using the penalized density esti-
mation algorithm.
(b) Given gj , j = 1, . . . , p, perform one step of a fixed point
algorithm towards finding the optimal W.
22'
&
$
%
Penalized density estimation for step (a)
• Discretize the data
• Fit a generalized additive model with Poisson family of
distributions (see text chapter 9)
23'
&
$
%
Simulations
a b c
d e f
g h i
j k l
m n o
p q r
Taken from Bach and Jordan (2001)
• Each distribution used to gen-
erate 2-dim S and X (with ran-
dom A) — 30 reps each
• 300 reps with 4-dim S — distri-
butions of Sj picked at random.
Compared FastICA (homegrown
Splus version), KernelICA (Francis
Bach’s matlab version), and Pro-
DenICA (Splus).
24'
&
$
%
Amari Metric
We evaluate solutions by comparing their estimated W with the
true W0, using the Amari metric (HKO 2001, Bach and Jordan
2001):
d(W0,W) =1
2p
p∑
i=1
(
∑p
j=1 |rij |
maxj |rij |− 1
)
+1
2p
p∑
j=1
(∑p
i=1 |rij|
maxi |rij |− 1
)
where rij = (W0W−1)ij .
25'
&
$
%
2-Dim examples
Distribution
Am
ari D
ista
nce
from
Tru
e W
(
Log
Sca
le)
•
•
•
••
• •
•
•
•
•
•
•
•
•
•
•
•
••
•
•
• ••
•
•
••
•
•
•
•
•
•
•
••
•
•
• ••
•
•
••
•
• •
•
••
•
a b c d e f g h i j k l m n o p q r
0.01
0.05
0.10
0.50
FastICAKernelICAProdDenICA
26'
&
$
%
4-dim examples
0.0
0.5
1.0
1.5
2.0
FastICA KernelICA ProdDenICA
Am
ari D
ista
nce
from
Tru
e W
0.25 (0.014) 0.08 (0.004) 0.14 (0.011)
27'
&
$
%
FastICA
-3 -2 -1 0
•
••
•
•
••
•
•
••
•
•• ••
•
••••
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•••
•
••
••
•
• •
••
•
••
•
•••
••
•
•
•
•
••
•
•
•
•
••
•
•
•
•
•
•
••
•
•
•
•
•
•
•••
• ••
••••
•
••
• ••
•
•
•
• •
•
•
•
•
• ••
•
•
••
••
•
•
•
•
••
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
• •
• •
•
••
•
••
•
•
•• •
••
•
•
••
•
••
••
•
••
•
••
• •
••
•
• •
•
••
•
•
•
•
• •
•••
•
••
•
•
••
•
•
•
••
•
•
•
•
•
•
• •• •
•
•
••
•
••
•
•
•
•
••
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
••
•
•• •
•
•
•
••
•
•
•
•
•
•••
•••
•
•
•
•
•••
•
•
•
•
••
•
•
-3-2
-10
•
••
•
•
••
•
•
••
•
•• ••
•
•••••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•••
•
••
••
•
• •
••
•
••
•
•••
••
•
•
•
•
••
•
•
•
•
••
•
•
•
•
•
•
••
•
•
•
•
•
•
•• •• •
•
• •••
•
••
• ••
•
•
•
• •
•
•
•
•
•••
•
•
••••
•
•
•
•
••
•
•
•
•
•
•
•
•
• •
•
•
•
•
•
•
•
•
• •
••
•
••
•
••
•
•
•••
••
•
•
••
•
••
••
•
••
•
••
••
••
•
••
•
••
•
•
•
•
• •
•••
•
••
•
•
• •
•
•
•
••
•
•
•
•
•
•
••• •
•
•
••
•
••
•
•
•
•
••
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
••
•
•• ••
•
•
••
•
•
•
•
•
•••
•••
•
•
•
•
•• •
•
•
•
•
••
•
•
-3-2
-10
•
•
•
•
•
• •••
••
•
•••
••
•• •
••
•
•
••
• •
••
••
•
••
••
•
•
•
•
• ••
•
•
••
•
•
•
• •
•
•
••••
•
•
•
•
•••
•
•
•
• ••
• • ••
•• •
•
•
•
•
•
••
• •••• •
•• ••
•
• •••
••
•
••
••
•• ••
••
•
••
•
••
•
•
••
•• •• • •
••
•
••
•
•
•
•
•
•
•
•
•
••
••
••
•
•
••
••
•
•
•
•
•
••
•
••
••
••
•
•
•
••
•
••
•
• •••
••
•
•
•
••
•
••
•••
•
•
•• ••
••
•
••
•
•••
•
••
••••
•
•
••
•
•
•
••
•
•
•
••
••
•
•
•
•
•
•
••
••
••
•
•
•
•
•••
•
••
•
•
•• •
•• •
•
•
•••
••••• •
•
•
••
•••
•
•
•• •
•
•
•
••
•
• •
•
• •
••
KernelICA•
•
•
•
•
••••
••
•
••
•
••
•••
••
•
•
••
••
••
••
•
••
•••
•
•
•
•••
•
•
••
•
•
•
••
•
•
•••
••
•
•
•
•••
•
•
•
••••••••
• •
•
•
•
•
•
••
•• •• ••
••• •
•
••••
••
•
••
••
• •••
••
•
••
•
•••
•
••
•• •• ••
• •
•
• •
•
•
•
•
•
•
•
•
•
••
••
••
•
•
••
••
•
•
•
•
•
••
•
••
••
••
•
•
•
••
•
• •
•
•• ••
••
•
•
•
••
•
••
• ••
•
•
••••
••
•
••
•
•••
•
••
• •• •
•
•
••
•
•
•
••
•
•
•
••
••
•
•
•
•
•
•
••
••
••
•
•
•
•
•••
•
• •
•
•
•• •
•• •
•
•
•••
••••••
•
•
••
•••
•
•
•••
•
•
•
• •
•
••
•
• •
••
-3 -2 -1 0
•
•
•••
• •
••
•
•
•••
•
•
•
•
• •••
•
••
•
•
•
• •
•••
•
•
•
••
••
•
• • ••
•
•• •
•
•
• •
•
•••••
•
•
•
•
• ••
••
• •• •• •
••••
••
••
•
••
•
• ••
•
••
••
••
•
• •••
••
•••
•
••• •
••
••
•
•
•
•
• ••
••
••
•• • ••
•
••
••
•
•
•
•
•
•
••
••
•
•
•• ••
•
•
•• •
••
•
••••
••
• •
••
•
•
•
••
•
••
•• •
• •••
••
••
• •
•
••
•
••
•
•• •
•
•
•
•
• •
•
•
• • ••
•
••
••
•
•• •
•
• •••• • •
•
•
••
•
•
•
•
•
•• •
•
•
• •
•
•• •
•••
•
•• ••
••
•
•
•
•
•
• •
•• •
•
••••
••
••
••
••
•••
••
•
• •
•
•
••
•
•
•
•••
•
•• •
••
••
•
•
•••
•
•
•
•
••••
•
••
•
•
•
••
•••
•
•
•
••
••
•
•• ••
•
• • •
•
•
••
•
••• ••
•
•
•
•
•••
••
••• •• •• ••
•
• •
••
•
••
•
•••
•
••
••
••
•
••••
••
•••
•
•••
•
••
••
•
•
•
•
• ••
••
••• •••
•
•
••
••
•
•
•
•
•
•
• •
••
•
•
• • ••
•
•
•• •
••
•
•• ••
••
• •
••
•
•
•
••
•
••
• ••
• ••
••
•
••
• •
•
••
•
••
•
•••
•
•
•
•
••
•
•
••••
•
••
••
•
•••
•
• •••• ••
•
•
••
•
•
•
•
•
•• •
•
•
••
•
•• •
•••
•
••••
••
•
•
•
•
•
• •
• ••
•
••••
••
••
••
••
•••
••
•
• •
•
•
••
•
•
•
••
-3.5 -2.5 -1.5 -0.5
-3.5
-2.5
-1.5
-0.5
ProdDenICA