subspace estimation in linear dimension reduction...subspace estimation in linear dimension...
TRANSCRIPT
Subspace estimation in linear dimension reduction
Hannu Oja (with Klaus Nordhausen and David E. Tyler)
BIRS workshop, Banff, November 2015
1
The plan
• Linear and nonlinear dimension reduction
• Supervised and unsupervised dimension reduction
• Similarities between PCA, FOBI and SIR
• Signal and noise subspaces
• Bootstrap tests for the dimension of signal subspace
• Estimation of the dimension of signal subspace
2
Introduction
• Let x be a p-variate random vector with cumulative distribution Fx.
• Linear dimension reduction.
Find a projection matrix P such that you do not loose information
if you transform x → z = Px:
(i) x|Px is not “interesting” (unsupervised)
(ii) y ⊥⊥ x |Px for some “interesting” y (supervised)
• Nonlinear dimension reduction - not discussed here.
Find a (nonlinear) function H : Rp → Rk such that you do not loose information
if you transform x → z = H(x):
(i) x|H(x) is not “interesting” (unsupervised)
(ii) y ⊥⊥ x |H(x) for some “interesting” y (supervised)
3
Linear dimension reduction
• The dimension of x is reduced using a k × p matrix B.
Then
x → z = Bx
or
x → z = PBx where PB = B′(BB′)−1B.
• The idea is that k << p and that “no information is lost” in the transformation.
• Dimension reduction methods (unsupervised and supervised):
PCA, ICA, ICS, SIR, SAVE, etc.
4
Looking for similarities: PCA, FOBI, SIR
• Assume that E(x) = 0. In PCA, one then finds the p× p transformation matrix W
such that
WW′ = Ip and WE(xx′)W′ = D
where D is a diagonal matrix with diagonal elements d1 ≥ ... ≥ dp ≥ 0.
• In the independent component analysis (ICA), FOBI finds transformation matrix W such
that
WE(xx′)W′ = Ip and WE(xx′E(xx′)−1xx′)W′ = D
where the diagonal elements D are ordered so that
|d1 − (p+ 2)| ≥ ... ≥ |dp − (p+ 2)|.
• The sliced inverse regression (SIR) uses a dependent variable y, and finds a
transformation matrix W which satisfies
WE(xx′)W′ = Ip and WE(E(x|y)E(x|y)′)W′ = D
where the diagonal elements D are d1 ≥ ... ≥ dp ≥ 0.
5
• The idea in dimension reduction is then that W = (W′
1,W′
2)′ where
– k-dimensional W1x presents information (signal), and
– (p− k)-dimensional W2x presents noise.
6
Figure 1: Data set 1, Fisher’s Iris Data: Original variables.
Sepal.Length
2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5
4.5
5.5
6.5
7.5
2.0
2.5
3.0
3.5
4.0
Sepal.Width
Petal.Length
12
34
56
7
4.5 5.5 6.5 7.5
0.5
1.0
1.5
2.0
2.5
1 2 3 4 5 6 7
Petal.Width
7
Figure 2: Data set 1, Fisher’s Iris Data: Principal components.
Comp.1
−1.0 0.0 0.5 1.0 −0.4 0.0 0.2 0.4
−3
−1
12
34
−1.
00.
01.
0
Comp.2
Comp.3
−0.
50.
00.
5
−3 −1 1 2 3 4
−0.
40.
00.
20.
4
−0.5 0.0 0.5
Comp.4
8
Figure 3: Data set 1, Fisher’s Iris Data : FOBI coordinates.
IC.1
5 6 7 8 9 10 −2.0 −1.0 0.0 1.0
45
67
89
56
78
910
IC.2
IC.3
−7
−6
−5
−4
−3
4 5 6 7 8 9
−2.
0−
1.0
0.0
1.0
−7 −6 −5 −4 −3
IC.4
9
Figure 4: Data set 2: Original variables.
V1
−10 0 5 −5 0 5 10 −5 0 5
−3
−1
13
−10
05
V2
V3
−10
05
−5
05
10
V4
V5
−5
05
10
−3 −1 1 3
−5
05
−10 0 5 −5 0 5 10
V6
10
Figure 5: Data set 2: Principal components.
V1
−5 0 5 10 −6 −2 2 −1.5 0.0 1.0
−15
−5
515
−5
05
10
V2
V3
−10
05
10
−6
−2
2
V4
V5
−1.
00.
51.
5
−15 −5 5 15
−1.
50.
01.
0
−10 0 5 10 −1.0 0.5 1.5
V6
11
Figure 6: Data set 2: FOBI coordinates.
V1
−3 −1 1 3 −2 0 2 −3 −1 0 1
−4
−2
02
−3
−1
13
V2
V3
−3
−1
13
−2
02
V4
V5
−3
−1
01
−4 −2 0 2
−3
−1
01
−3 −1 1 3 −3 −1 0 1
V6
12
Figure 7: Data set 3: Original variables.
V1
−5 0 5 −10 0 5 −4 0 2 4
−5
05
−5
05
V2
V3
−4
04
−10
05
V4
V5
−4
02
4
−5 0 5
−4
02
4
−4 0 4 −4 0 2 4
V6
13
Figure 8: Data set 3: Principal components.
V1
−4 0 4 −4 0 2 −1.0 0.0 1.0
−15
−5
5
−4
04
V2
V3
−2
24
6
−4
02
V4
V5
−2
01
−15 −5 5
−1.
00.
01.
0
−2 2 4 6 −2 0 1
V6
14
Figure 9: Data set 3: FOBI coordinates.
V1
−3 −1 1 −2 0 2 −0.5 1.0 2.5
−3
−1
1
−3
−1
1
V2
V3
−3
−1
13
−2
02
V4
V5
−2
02
−3 −1 1
−0.
51.
02.
5
−3 −1 1 3 −2 0 2
V6
15
Figure 10: Data set 4: Original variables.
X.1
−6 −2 2 6 −6 0 4 −6 −2 2
−6
04
−6
−2
26
X.2
X.3
−10
010
−6
04
X.4
X.5
−4
04
−6 0 4
−6
−2
2
−10 0 10 −4 0 4
y
16
Figure 11: Data set 4: SIR coordinates.
Z.1
−3 0 2 −3 0 2 −6 −2 2
−3
02
4
−3
02
Z.2
Z.3
−3
02
−3
02
Z.4
Z.5
−3
02
−3 0 2 4
−6
−2
2
−3 0 2 −3 0 2
y
17
Testing whether W2x is noise
• In dimension reduction W = (W′
1,W′
2)′ and k-variate W1x is assumed to carry the
relevant information. We then wish to test the following null hypotheses saying that W2x
presents noise:
– PCA:
(i) H0 :W2x ∼ Np−k(0, σ2Ip−k),
(ii) H0 :W2x is spherically symmetric, or
(iii) H0 :W2x has exchangeable components.
– FOBI:
H0 :W2x ∼ Np−k(0, Ip−k).
– SIR:
H0 : (y,W1x) ⊥⊥ W2x (implies that y ⊥⊥ W2x|W1x and linearity condition).
• Unconventional semiparametric bootstrapping is used in the following to test for these
hypotheses.
18
Test statistics for the dimension of W1x
• Let X = (x1, ...,xn)′ (or (y,X)) be a random sample for the distribution of x (or of
(y,x) and W and D natural estimates of W and D, respectively. We then have the
following.
• PCA: H0 implies that d1 ≥ ... ≥ dk > dk+1 = ... = dp. We choose
T (X) = − log
( ∏pi=k+1 di
1/(p−k)
∑pi=k+1 di/(p− k)
).
• FOBI: H0 implies that d1 ≥ ... ≥ dk > dk+1 = ... = dp = p+ 2. We choose
T (X) =
p∑
i=k+1
(di − p− 2)2.
• SIR: H0 implies that d1 ≥ ... ≥ dk > dk+1 = ... = dp = 0. we choose
T (y,X) = log
(p∏
i=k+1
di1/(p−k)
)(or∑p
i=k+1 d2i ) .
19
Tests based on limiting distributions
• Let X = (x1, ...,xn)′ (or (y,X)) be a random sample for the distribution of x (or of
(y,x) and W and D natural estimates of W and D, respectively. We then have the
following.
• PCA: Tyler (1981), Schott (2006), etc.
• FOBI: ?
• SIR: LI (1991), Bura and Cook (2001)
20
PCA: Strategies for bootstrapping
• Write Z = (X− 1nµ′)W′ and Z = (Z1, Z2) = (X− 1nµ
′)(W1
′
,W2
′
).
• Our bootstrap samples X∗ under the null model is then obtained as follows.
1. Write Z = (Z1, Z2) for a bootstrap sample of size n from {z1, ..., zn}.
2. Z∗
1 = Z1 and
2.1 Z∗
2 = (O1z21, ...,Onz2n)′ for n independent random orthogonal
(p− k)× (p− k) matrices O1, ...,On (subsphericity of W2x), or
2.2 Z∗
2 = (P1z21, ...,Pnz2n)′ for n independent random (p− k)× (p− k)
permutation matrices P1, ...,Pn (exchangeability of W2x).
3. Write Z∗ = (Z∗
1,Z∗
2).
4. Write X∗ = Z∗(W′)−1 + 1nµ′.
• An estimated p-value for a bootstrap test with the test statistic T (X) is then obtained as
M−1#{T (X∗
j ) ≥ T (X)} where X∗
1, ...,X∗
M are M independent bootstrap
samples.
21
PCA: Simulation results
• 500 repetitions (random samples) for sample sizes n = 50, 100, 150, 200 were
generated from N5(0, diag(3, 2, 1, 1, 1)).
• For each random sample, M = 200 bootstrap samples were generated for null
hypotheses k = 3, 2, 1 under the assumptions of subsphericity (O) and
subexchangeability (P).
• The proportion of bootstrap p-values below 0.05 is reported in the following. The true
value is k = 2.
n k = 3 k = 2 k = 1
O P O P O P
50 0.026 0.020 0.032 0.028 0.416 0.392
100 0.016 0.016 0.044 0.050 0.828 0.814
150 0.016 0.016 0.054 0.060 0.970 0.972
200 0.022 0.018 0.036 0.034 0.998 0.998
22
FOBI: Strategies for bootstrapping
• Write Z = (X− 1nµ′)W′ and Z = (Z1, Z2) = (X− 1nµ
′)(W1
′
,W2
′
).
• Our bootstrap samples X∗ under the null model is then obtained as follows.
1. Write Z∗
1 for a matrix of componentwise bootstrap samples of size n from Z1.
2. Let Z∗
2 be a random sample of size n from Np−k(0, Ik).
3. Write Z∗ = (Z∗
1,Z∗
2).
4. Write X∗ = Z∗(W′)−1 + 1nµ′.
• An estimated p-value for a bootstrap test with the test statistic T (X) is then obtained as
M−1#{T (X∗
j ) ≥ T (X)} where X∗
1, ...,X∗
M are M independent bootstrap
samples.
23
FOBI: Simulation results
• 500 repetitions (random samples) for sample sizes n = 50, 100, 200, ..., 1000 were
generated from a 5-variate independent component model (Setting 1 and 2 below).
• For each random sample, M = 200 bootstrap samples were generated for null
hypotheses k = 2, 3, 4 under the assumptions of subsgaussianity.
• The proportion of bootstrap p-values below 0.05 is reported in the following.
24
• FOBI, Setting 1The distribution of the independent components are χ2
3, N(0, 1), N(0, 1), N(0, 1) and
U(0, 1), and the mixing matrix is I5. The true value is k = 2.
n k = 3 k = 2 k = 1
50 0.026 0.030 0.044
100 0.028 0.050 0.104
200 0.030 0.062 0.236
500 0.018 0.062 0.890
1000 0.028 0.044 1.000
25
• FOBI, Setting 2The distributions of the independent components are exp(1), t6, N(0, 1), N(0, 1),
N(0, 1), and the mixing matrix is I5. The true value is k = 2.
n k = 3 k = 2 k = 1
50 0.038 0.034 0.102
100 0.040 0.048 0.206
200 0.058 0.094 0.468
500 0.024 0.058 0.798
1000 0.028 0.070 0.962
26
SIR: Strategies for bootstrapping
• Write Z = (X− 1nµ′)W′ and Z = (Z1, Z2) = (X− 1nµ
′)(W1
′
,W2
′
).
• Our bootstrap samples (y∗,X∗) under the null model is then obtained as follows.
1. Let(y∗,Z∗
1) be a bootstrap sample of size n from (y, Z1).
2. Let Z∗
2 be a bootstrap sample form random sample of size n from Z2.
(Bootstrap samples are independent)
3. Write Z∗ = (Z∗
1,Z∗
2).
4. Write X∗ = Z∗(W′)−1 + 1nµ′.
• An estimated p-value for a bootstrap test with the test statistic T (y,X) is then obtained
as M−1#{T ((y∗,X∗)j) ≥ T (y,X)} where (y∗,X∗)1, ..., (y∗,X∗)M are M
independent bootstrap samples.
27
SIR: Simulation results
• 500 repetitions (random samples) for sample sizes n = 50, 100, 200, ..., 1000 were
generated from a nonlinear model for response y and 5-variate x (Setting 1 and 2 below).
• For each random sample, M = 200 bootstrap samples were generated for null
hypotheses k = 2, 3, 4 under the assumptions that (y,W1x) and W2x are
independent.
• The proportion of bootstrap p-values below 0.05 is reported in the following.
28
• SIR, Setting 1Now x ∼ N5(0, I5) and y = x1(x1 + x2 + 1) + ǫ, where ǫ ∼ N(0, 0.25) and
ǫ ⊥⊥ x. Again, k = 2.
n k = 3 k = 2 k = 1
100 0.010 0.024 0.162
200 0.004 0.034 0.298
500 0.004 0.042 0.552
1000 0.012 0.038 0.740
2000 0.010 0.040 0.908
5000 0.006 0.046 0.982
10000 0.010 0.052 0.996
29
• SIR, Setting 2Now x ∼ N5(0, I5) and y = x1
0.5+(x2+1.5)2 + ǫ, where ǫ ∼ N(0, 0.25) and ǫ ⊥⊥ x.
The true value is k = 2.
n k = 3 k = 2 k = 1
100 0.006 0.030 0.242
200 0.010 0.036 0.398
500 0.006 0.060 0.710
1000 0.004 0.032 0.856
2000 0.006 0.030 0.950
5000 0.002 0.028 0.986
10000 0.008 0.060 0.996
30
Final remarks
• FOBI and SIR just serve here as first examples on ICA (ICS) methods and supervised dimension
reduction methods. Our approach works for other methods as well.
• Comparison: Asymptotic tests vs. bootstrap tests
• How to robustify?
– PCA: Replace the covariance matrix by a robust scatter matrix (elliptic case)
– FOBI: Use two robust scatter matrices with independent property
– SIR: Robustify both Cov(x) and Cov(x|y). Gather et al. (2001, 2002), Yohai and Noste
(2005)
31
• Estimation of k:
Test H0,0
Test H0,1
Test H0,2
...
reject
k = 1
accept
reject
k = 0
accept
Figure 12: Estimation through stepwise testing procedure.
32
Some referencesBura, E. and Cook, R.D. (2001). Extending Sliced Inverse Regression: the Weighted Chi-Squared Test.
Journal of the American Statistical Association, 96, 996-1003.
Dray, S. (2008). On the number of principal components: A test of dimensionality based on
measurements of similarity between matrices. Computational Statistics & Data Analysis, 52, 2228 2237
Ilmonen, P., Serfling, R., and Oja, H. (2012). Invariant coordinate selection (ICS) functionals.
International Statistical Review, 80, 93110.
Li, K.C. (1991). Sliced Inverse Regression for Dimension Reduction. Journal of the American Statistical
Association, 86, 316-342.
Liski, E., Nordhausen, K., and Oja, H. (2013). Supervised invariant coordinate selection. Statistics: A
Journal of Theoretical and Applied Statistics, 48, 711-731.
Miettinen, J., Nordhausen, K., Oja, H. and Taskinen, S. (2015). Fourth moments and independent
component analysis. Statistical Science,30, 372-390.
Tyler, D.E. (1981). Asymptotic Inference for Eigenvectors. The Annals of Statistics, 9, 725-736.
Tyler, D.E., Critchley, F., Dumbgen, L. and Oja, H. (2009). Invariant coordinate selection. Journal of
Royal Statistical Society B, 71, 549-592.
33
THANK YOU FOR YOUR INTEREST !
34