principal component analysis based on l1-norm maximization nojun kwak ieee transactions on pattern...
TRANSCRIPT
Principal Component Analysis Based on L1-Norm Maximization
Nojun KwakIEEE Transactions on Pattern Analysis
and Machine Intelligence, 2008
Outline
• Introduction• Background Knowledge• Problem Description• Algorithms• Experiments• Conclusion
2
Introduction
• In data analysis problems, why do we need dimensionality reduction?
• Principal Component Analysis(PCA)• PCA based on the L2-Norm is prone to the
presence of outliers.
3
Introduction
• Some algorithms for this problem:– L1-PCA• Weighted median method• Convex programming method• Maximum likelihood estimation method
– R1-PCA
4
Background Knowledge
• L1-Norm, L2-Norm• Principal Component Analysis(PCA)
5
Lp-Norm
• Consider an n-dimensional vector: • Define the p-Norm:
• L1-Norm is
• L2-Norm is
6
],...,,[ 21 nxxxx pn
i
p
ipxx
1
1
n
iixx
11
n
iixx
1
2
2
Lp-Norm
• For example, x = [1, 2, 3]
• Special case :
7
name symbol value approximation
L1-Norm |x|1 6 6.000
L2-Norm |x|2 3.742
L3-Norm |x|3 3.302
L4-Norm |x|4 3.146
L∞-Norm |x|∞ 3 3.000
14
326
72 41
iixx max
Principal Component Analysis
• Principal component analysis (PCA) is a technique to seek projections that best preserve the data in a least-squares sense.
• The projections constitute a low-dimensional linear subspace.
8
Principal Component Analysis
• The projection vectors , …, are the eigenvectors of the scatter matrix having the largest eigenvalues.
9
Scatter matrix:
Principal Component Analysis
• The rotational invariance property: a fundamental property of Euclidean space with L2-Norm.
• So, PCA has rotational invariance property.
10
Problem Description
• Traditional PCA: the presence of outliers.• The effect of the outliers with a large norm is
exaggerated by the use of the L2-Norm.• Other method?
11
Problem Description
• If we use L1-Norm instead of L2-Norm:
where is the dataset.
12
is the projection matrix.is the coefficient matrix.
Problem Description
• However, it’s very hard to achieve the exact solution.
• To resolve it, Ding et al. propose the R1-Norm and an approximate solution.
13
We call it R1-PCA.
Problem Description
• The solution of R1-PCA depends on the dimension of subspace being found.
• The optimal solution when is not necessarily a subspace of when .
• The proposed method: PCA-L1
14
• We consider that:
• The maximization is done on the feature space.
Algorithms
15
ensure to orthonormality of the projection matrix.
Algorithms
• However, it’s difficult to find a global solution for .
• The optimal ith projection varies with different as in R1-PCA.
• How to solve it?
16
Algorithms
• We simplify it into a series of problems using a greedy search method.
• Then, if we set , it become that:
17
Although the successive greedy solutions may differ from the optimal solution, it’s expected to provide a good approximation.
Algorithms
• The optimization is still difficult because it contains absolute value operation, which is nonlinear.
18
Algorithms
19
Algorithms
• However, does the PCA-L1 procedure finds a local maximum point ?
• We should prove it.
20
Theorem
• Theorem: With the PCA-L1 procedure vector converges to , which is a local maximum points of .
• The proof includes two parts:– is a non-decreasing function of .– The objective function has a local maximum value
at .
21
Proof
• is a non-decreasing function of .
is the set of optimal polarity corresponding to . For all ,
22
Proof
• This holds because
23
are parallel.
The inner product of two vectors.
Proof
• So, the objective function is non-decreasing and there are a finite number of data points.
The PCA-L1 procedure converges to a
projection vector .
24
Proof
• The objective function has a local maximum value at .
• Because converges to by the PCA-L1 procedure, for all .
• By Step 4b, for all .
25
Proof
• There exists a small neighborhood of , such that if , then for all .
• Then, since is parallel to , the inequality holds for all .
is a local maximum point.
26
Algorithms
• So, the PCA-L1 procedure finds a local maximum point .
• Because is a linear combination of data points , i.e., , it’s invariant to rotations.
Under rotational transformation R:X→RX, then W→RW.
27
Algorithms
• Computation complexity: • is the number of iterations for
convergence. does not depend on the
dimension .
28
Algorithms
• The PCA-L1 procedure just finds a local maximum solution. It may not be the global solution.
• We can set appropriately.– By setting .– Run the PCA-L1 with different initial vector .
29
Algorithms
• Extracting Multiple Features :
30
Original PCA’s thought.
Run the PCA-L1 for each feature dimension.
Algorithms
• How to guarantee the orthonormality of the projection vectors?
• We should show that is orthogonal to .
31
Proof
• The projection vector is a linear combination of samples .
It’s in the subspace spanned by .• Then, we consider :
32
Form Greedy search algorithm.
normal vector, (=1)
Proof
• Because , is orthogonal to ..
is orthogonal to .
33
The orthonormality of the projection vectors is guaranteed.
Algorithms
• Even if the greedy search algorithm does not provide the optimal solution, it provides a set of good projections that maximize L1 dispersion.
34
Algorithms
• For data analysis, we could decide how much data could be captured.
• In PCA, we could compute the eigenvalue:
35
The eigenvalue is equivalent to the variance of the feature.
We can compute the ratio of the variance to the total variance.
The sum of variance:
In eigenvalue, it exceeds 95% of the total variance, m is set to .
Algorithms
• In PCA-L1, once is obtained, we can compute the variance of the feature.
• The sum of variance:
• The total variance:
36
We can set the appropriate number of extracted features like original PCA.
Experiments
• In the experiments, we apply PCA-L1 algorithm and compared with R1-PCA and original PCA.
• Three experiments:– A Toy problem with an Outlier– UCI Data Sets– Face Reconstruction
37
A Toy Problem with an Outlier
• Consider the data points in a 2D space:
• If we discard the outlier, the projection vector should be .
38
an outlier.
A Toy Problem with an Outlier
• The projection vector:
39
outlier
A Toy Problem with an Outlier
• The residual error :
40
outlier
Average residual error
PCA-L1 L2-PCA R1-PCA
1.200 1.401 1.206Much influenced by the outlier.
UCI Data Sets
• Data sets in UCI machine learning repositories.• Compare the classification performances.• 1-NN classifier was used and 10-fold cross
validation for average classification rate.• For PCA-L1, we set the initial projection vector
as .
41
UCI Data Sets
• The data sets:
42
UCI Data Sets
• The average correct classification rates:
43
UCI Data Sets
• The average correct classification rates:
44
UCI Data Sets
• The average correct classification rates:
45
In many cases, PCA-L1 outperformed L2-PCA and R1-PCA when the number of extracted features was small.
UCI Data Sets
• Average Classification rate on UCI Data Sets:
46
PCA-L1 outperformed other methods by 1% on average.
UCI Data Sets
• Computation cost:
47
Face Reconstruction
• The Yale face database.– 11 individuals.– 15 face images for one person.
• Among 165 images, 20% were selected randomly and occluded with a noise block.
48
Face Reconstruction
• For these image sets, we applied:– L2-PCA(eigenface)– R1-PCA– PCA-L1
• Then, we used extracted features to reconstruct images.
49
Face Reconstruction
50
• Experimental results:
Face Reconstruction
• The average reconstruction error is:
51
original image
reconstructed image
Form 10~20 features, the difference became apparent and PCA-L1 outperformed than other methods.
Face Reconstruction
• We added 30 dummy images consist of random black and white dots to the original 165 Yale images.
• We applied:– L2-PCA(eigenface)– R1-PCA– PCA-L1
• We reconstructed images with features.
52
Face Reconstruction
• Experimental results:
53
Face Reconstruction
• The average reconstruction error:
54
From 6 to 36 features, the error of L2-PCA is constant. The dummy images serious affect those projection vectors.
From 14 to 36 features, the error of R1-PCA is increasing. The dummy images serious affect those projection vectors.
Conclusion
• The PCA-L1 was proven to find a local maximum point.
• The computation complexity is proportional to– the number of samples– the dimension of input space– The number of iterations
• The method is usually faster and robust to outliers.
55
Principal Component Analysis
• Given a dataset of l samples:• We represent D by projecting the data onto a
line running through the sample mean , denoted as ( ):
56
lidi RxD 1
Principal Component Analysis
• Then,
57
Principal Component Analysis
• To look for the best direction ,
58
scatter matrix
Principal Component Analysis
• We want to minimize :
Maximize , subject to • We use Lagrange multipliers:
59
Principal Component Analysis
• Since , minimizing can be achieved by choosing as the largest eigenvector of .
• Similarly, we can extend 1-d to -d projection.
60