principal component analysis with missing data …...principal component analysis with missing data...

Principal Component Analysis With Missing Data and Outliers

Haifeng Chen

Electrical and Computer Engineering DepartmentRutgers University, Piscataway, NJ, 08854

[email protected]

1 Introduction

Principal component analysis (PCA) [10] is a well established technique for dimensionality reduction,

and a chapter on the subject may be found in numerous texts on multivariate analysis. Examples of its

many applications include data compression, image processing, visualisation, exploratory data analysis,

pattern recognition and time series prediction. The popularity of PCA comes from three important prop-

erties. First, it is the optimal (in terms of mean squared error) linear scheme for compressing a set of high

dimensional vectors into a set of lower dimensional vectors and then reconstructing. Second, the model

parameters can be computed directly from the data – for example by diagonalizing the sample covari-

ance. Third, compression and decompression are easy operations to perform given the model parameters

– they require only matrix multiplications.

Despite these attractive features, however, PCA models have several shortcomings. One is that naive

methods for finding the principal component directions have trouble with high dimensional data or large

numbers of data points. Consider attempting to diagonalize the sample covariance matrix of � vectors

in a space of�

dimensions when � and�

are several hundred or several thousand. Difficulties can arise

both in the form of computational complexity and also data scarcity. Computing the sample covariance

itself is very costly, requiring �� operations. In general it is best to avoid computing the sample

covariance explicitly.

Another shortcoming of standard approaches to PCA is that it is not obvious how to deal properly

with incomplete data set, in which some of the points are missing. Currently the incomplete points are

either discarded or completed using a variety of interpolation methods. However, such approaches are

no longer valid when a significant portion of the measurement matrix is unknown.

Typically, the training data for PCA is pre-processed in some way. But in some realistic problems

where the amount of training data is huge, it becomes impractical to manually verify that all the data is

1

’good’. In general, training data may contain some errors from the underlying data generation method.

We view these error points as “outliers”. However, the standard PCA algorithm is based on the assump-

tion that data have not been spoiled by outliers. In case of outliers, robust version of PCA has to be

developed.

To solve the these drawbacks of standard PCA, a lot of methods were proposed in the field of statistics,

computer engineering, neural networks etc. The purpose of this project is to give a overview of those

methods and perform some experiments to show the how the improved PCAs can deal with the missing

data and outliers in high dimensional data set. In Section 1, a brief introduction of standard PCA is

presented. To deal with the high dimensional data, we describe an EM algorithm to calculate principal

components in Section 3.2. Section 4 presents PCA for the data set containing missing points. In Section

5, we give a detailed description of current robust PCA algorithms. Some experimental results are

provided in Section 6.

2 Principal component analysis (PCA)

The most common derivation of PCA is in terms of a standardized linear projection which maximizes

the variance in the projected space [10]. For a set of observed ddimensional data vectors �� , the � principal axes ��! " ��#��$��% are those orthonormal axes onto which the

retained variance under projection is maximal. It can be shown that the vectors � are given by the �dominant eigenvectors (i.e. those with the largest associated eigenvalues & ) of the sample covariance

matrix ' ( )* �,+.- �/0�2143 � �/0�2153 �76� (1)

where 3 is the data sample mean, such that ' � ( & � 98 (2)

The � principal components of the observed vector �� are given by the vector: � (<; 6 �/ � 153 � � (3)

where ; ( �/� -�= � � = �� = �?> � . The variables : � are then uncorrellated such that the covariance matrix@ )�,+.- : � : 6�BA � is diagonal with elements & � .A complementary property of PCA, and that most closely related to the original discussions of [17] is

that, of all orthogonal linear projections (3), the principal component projection minimizes the squared

2

reconstruction error@ )�DC � 1FE � C � , where the optimal linear reconstruction of E � is given byEG� (<; : �IH 3 8 (4)

3 EM Algorithm for PCA

In this section, a version of the expectation maximization (EM) algorithm [18] for learning the principal

components of a data set. The algorithm does not require computing the sample covariance. It can deal

with high dimensional data more efficiently than traditional PCA. In Section 3.1 a probabilistic model

for PCA is given. Based on that model, the EM algorithm is presented in Section 3.2. The advantage of

the EM algorithm is also provided.

3.1 Probabilistic Model of PCA

Principal component analysis can be viewed as a limiting case of a particular class of linear Gaussian

models. The goal of such models is to capture the covariance structure of an observed�

dimensional

variable using fewer than the� � � H � � A � free parameters required in a full covariance matrix. Linear

Gaussian models do this by assuming that was produced as a linear transformation of some � dimen-

sional latent variable : plus additive Gaussian noise. Denoting the transformation by the�KJ � matrix; , and the

�dimensional noise vector by L (with covariance matrix M ), the generative model can be

written as (<; : H L (5)

Conventionally, :5N<O �/PQ�SR � , and the latent variables are defined to be independent and Gaussian with

unit variance. By additionally specifying the error, or noise, model to be likewise Gaussian L N$O �/PQ�ST � ,equation (5) induces a corresponding Gaussian distribution for the observations NUO �/PQ� ;V; 6 H T � (6)

In order to save parameters over the direct covariance representation in�

space, it is necessary to choose�KW � and also to restrict the covariance structure of the Gaussian noise L by constraining the matrix T .

For example, if the shape of the noise distribution is restricted to be axis aligned (its covariance matrix

is diagonal) the model is known as factor analysis.

For the case of isotropic noise T (YX �[Z , equation (5) implies a probability distribution over space

for a given : of the form \ �/^] : � ( �_��` X � ��acb�d �fefg \ �1 �� X � C h1 ; : C � (7)

3

Using Bayes’ rule, the posterior distribution of the latent variables : given the observed may be

calculated\ � : ] � ( �_��` � aci�d � ] X a �fj ] a - d � e�g \lk 1 �� : 1 j a - ; 6 m X a �fj : 1 j a - ; 6 m �n (8)

where the posterior covariance matrix is given byX ��j a - (oX � � X � Z H ; 6 ; � a -(9)

wherej

is a � J � matrix.

3.2 EM Algorithm for PCA

Principal component analysis is a limiting case of the linear Gaussian model as the covariance of the

noise L becomes infinitely small and equal in all directions. Mathematically, PCA is obtained by taking

the limit T (Vp,q,rts_uwvyx Z . This has the effect of making the likelihood of a point dominated solely

by the squared distance between it and its reconstruction ; : . The directions of the columns of ;which minimize this error are known as the principal axes. Inference now reduces to simple least squares

projection \ � : ] � ( O �� ;V; 6 ��a - ; 6 B��z � (|{ � : 15� ;}; 6 ��a - ; 6 � (10)

Since the noise has become infinitesimal, the posterior over states collapses to a single point and the

covariance becomes zero.

The key observation of this note is that even though the principal components can be computed explic-

itly, there is still an EM algorithm for learning them [18]. We can use the formula (10) as the e-step to

estimate the estimate the unknown state and then use (5) to get the m-step to choose ~ . The algorithm

is �e-step � ( � ; 6 ; � a - ; 6��m-step ;��"��<( � � 6 �/�� 6m� a -

where�

is a��J � matrix of all the observed data and � is a � J � matrix of the unknown states.

The columns of ; will span the space of the first � principal axes. To compute the corresponding

eigenvectors and eigenvalues explicitly, the data can be projected into this � dimensional subspace and an

ordered orthogonal basis for the covariance in the subspace can be constructed. Notice that the algorithm

can be performed online using only a single data point at a time and so its storage requirements are only��_� �Q� H �!�_� � � .4

The intuition behind the algorithm is as follows: guess an orientation for the principal subspace. Fix

the guessed subspace and project the data into it to give the values of the hidden states : . Now

fix the values of the hidden states and choose the subspace orientation which minimizes the squared

reconstruction errors of the data points. For the simple two dimensional example above, I can give

a physical analogy. Imagine that we have a rod pinned at the origin which is free to rotate. Pick an

orientation for the rod. Holding the rod still, project every data point onto the rod, and attach each

projected point to its original point with a spring. Now release the rod. Repeat. The direction of the rod

represents our guess of the principal component of the dataset. The energy stored in the springs is the

reconstruction error we are trying to minimize.

In [18], it is shown that the EM algorithm always reach a local maximum of likelihood. Furthermore,

Tipping and Bishop have shown [21] that the only stable local extremum is the global maximum at which

the true principal subspace is found; so it converges to the correct result.

The EM learning algorithm for PCA amounts to an iterative procedure for finding the subspace

spanned by the � leading eigenvectors without explicit computation of the sample covariance. It is

attractive for small � because its complexity is limited by �!�_� � �� per iteration and so depends only lin-

early on both the dimensionality of the data and the number of points. Methods that explicitly compute

the sample covariance matrix have complexities limited by �� . The EM algorithm scales more fa-

vorably in cases where � is small and both�

and � are large. For high dimensional data such as images,

the EM algorithm is much more efficient than traditional PCA algorithm.

4 PCA with Missing Data

During the e-step of the EM algorithm, we compute the hidden states : by projecting the observed

data into the current subspace. This minimizes the model error given the observed data and the model

parameters. Unfortunately, the data matrix is sometimes incomplete in practice. When the percentage

of missing data is very small, it is possible to replace the missing elements with the mean or an extreme

value, which is a common strategy in multivariate statistics [6]. However, such an approach is no longer

valid when a significant portion of the measurement matrix is unknown. It is not unusual for a large

portion of the matrix to be unobservable. For example, in the computer vision field, when we model

a dodecahedron (12-faced polygedra) from a sequence of segmented images. Assume that we have

tracked 12 faces over four nonsingular views. The segmented range images provide trajectories or plane

coordinates ��^�� ] � ( ��.�� \ ( ��.�f�f�� , where � ( ��L�� 6 represents a plane equation with

5

surface normal and normal distance to the origin. The we may form �f� J �f� measurement matrix as

follows: � (�� - �- � � - �� - �� - �� - �� - �� - �� B� � �� B� � �� B� � �� B� � �� - � � � �� - v � �� - v � � � �-�- � � � �- �

¡£¢¢¢¤where every * indicates an unobservable face since there are only six visible faces from each nonsingular

view. For such kind of data, the principal component analysis with missing data (PCAMD) has to be

used. Instead of estimating only : as the value which minimizes the squared distance between the point

and its reconstruction, PCAMD generalizes the estep to:�generalized e-step : For each (possibly incomplete) point find the unique pair of points :9¥and ¥ (such that : ¥ lies in the current principal subspace and ¥ lies in the subspace defined by

the known information about ) which minimize the norm C ; : ¥ 1¦ ¥ C . Set the corresponding

column of � to :B¥ and the corresponding column of�

to ¥ .

If is complete, then ¥ ( andg ¥ is found exactly as before. If not, then :^¥ and ¥ are the solution

to a least squares problem and can be found by, for example, QR factorization of a particular constraint

matrix.

In the above generalized EM algorithm, we still assume the measurements has already been cen-

tered. But in the case of missing data, especially when a significant portion of the measurement matrix

is unknown, the average of the data may not be a very reliable estimate of the mean. Instead of using

the centered data, some methods used the mean as extra parameters for the optimization, such as the

Wiberg’s method [22] in the next section.

4.1 Wiberg’s Method

Suppose the�§J � measurement matrix

�has rank ¨ . If the data is complete and the measurement matrix

filled, the problem of principal component analysis is to determine ©ª ��©' ��©« , such thatC � 1l¬® 6 1}©ª ©' ©« 6 C (12)

is minimized, where ©ª and ©« are�¯J ¨ and � J ¨ matrices with orthogonal columns, ©' ( � ��°�±2� X � �

is a ¨ J ¨ diagonal matrix, ¬ is the maximum likelihood approximation of the mean vector, and 6 (�7��f� � is an � -tuple with all ones. The solution of this problem is essentially the SVD of the centered

(or registered) data matrix ²³1l¬´ 6 .

6

If data is incomplete, we have the following minimization problem:r�q,µ5¶ ( �� *�· � � ��¸ 1�¹ � 1lº 6� L � � (13)R ( ��S��# �¼»½� �¾¸ �À¿¼Á�Â[¿ e ¨�Ã � �Ä�wÅU�^Å � ��DÅl#ÆÅ � where º � and L% are column vector notations defined by

�� º 6 -��º 6b ¡¤ ( ©ª ©'®ÇÈ (14)

and �� L 6 -��L 6)¡¤ ( ©« ©'®ÇÈ (15)

It is trivially true that there are at most ¨I� � H � 1É¨ � independent elements from LU decomposition

of an�?J � matrix of rank ¨ . Hence, a necessary condition to uniquely solve (14), is that the number of

observable elements in�

, Ê , satisfies ÊÌËÍ¨�� H � 1É¨ � . To sufficiently determine the problem (14)

more constraints are needed to normalize either the left matrix ©ª or the right matrix ©« .

If we write the measurement matrix ² as a Ê -dimensional vector © , the minimization problem can be

written as r!q,µ5¶K( �� ©Î 6 ©Î (16)

where ©Î ( ©¯1 E¬¦1lÏ ©L ( ©Ð1UÑ ©º (17)

and ©L ( �� L -��L )¡¤ � ©º ( �� ©º -��©º b

¡¤ � ©º � (YÒ º 6� E¹ �¾Ó 6 (18)

where E¬ is a Ê vector related to the mean estimate of © . Ï is a Ê J ¨ � matrix which is built totally

by the values of ©º . Ñ is a Ê J ��¨ H � �Ô� matrix which is built totally by the values of ©L . To solve the

minimization problem stated by (16), the derivative function (with respect to ©º and ©L ) should be zero,

i.e., Õ¶K( k Ï 6 Ï ©L¯1ÖÏ 6 � ©¯1×E¬ �Ñ 6 Ñ ©ºh1$Ñ 6 © n ( P (19)

Obviously (19) is nonlinear because Ï is a function of ©º and Ñ is a function of ©L . In theory, any

appropriate nonlinear optimization method can be applied to solve it. However, the dimension is so high

in practice and [22] used the following algorithm to solve it

7

�For given ©º , we can build the matrix Ï and vector E¬ . Then ©L is updated by solving a least least-

squares problem ©L ( ÏtØ§� ©Ð1 E¬ � (20)

where Ï Ø is the pseudo-inverse of Ï .�For a given ©L , we can also build the matrix Ñ . Then ©º is updated©º ( Ñ Ø © (21)

where Ñ Ø is the pseudo-inverse of Ñ .

5 PCA with Outliers

All the PCA algorithms mentioned before are based on the assumptions that data have not been spoiled

by outliers. In practice, real data often contain some outliers and usually they are not easy to be sepa-

rated from the data set. In Section 1, we showed that the traditional PCA constructs the rank � subspace

approximation to zero-mean training data that is optimal in a least-squares sense. It is commonly known

that least squares techniques are not robust in the sense that outlying measurements can arbitrarily skew

the solution from the desired solution [11]. Currently it is still a research direction to solve this draw-

back of the original PCA. Several methods were proposed in the field of statistics, neural networks, and

computer engineering etc. But they all have certain limitations.

5.1 Robust PCA by Robustifying the Covariance Matrix

To cope with outliers, the most commonly used approaches in statistics [4][11][19] replace the standard

estimation of the covariance matrix,

', with a robust estimator of the covariance matrix,

' ¥ . This for-

mulation weights the mean and the outer products which form the covariance matrix. Calculating the

eigenvalues and eigenvectors of this robust covariance matrix gives eigenvalues that are robust to sample

outliers. The mean and the robust covariance matrix can be calculated as:¬ ( @ )�,+.-IÙ - �_Ú �� @ )�,+.- Ù - �_Ú �� (22)' ¥ ( @ )�,+.- Ù � �_Ú �� /.�21l¬ � �/0�21l¬ �76@ )�,+.- Ù � �_Ú �� 1Û� (23)

where Ù - �_Ú �� and Ù � �_Ú �� are scalar weights, which are a function of the Mahalanobis distanceÚ �� ( �/G�21l¬ � 6 ' ¥ �/G�21l¬ � (24)

8

and

' ¥ is iteratively estimated. Numerous possible weight functions have been proposed (e.g. Huber’s

weighting coefficients [11] or Ù � �_Ú �� ( � Ù - �_Ú �� Ô� [4]. These approaches, however, weight entire

data samples and are not appropriate for the cases when only a few individual elements are corrupted by

outliers. Another related approach would be to robustly estimate each element of the covariance matrix.

This is not guaranteed to result in a positive definite matrix [4].

These methods, based on robust estimation of the full covariance matrix, are computationally imprac-

tical for high dimensional data such as images. Note that just computing the covariance matrix requires�� operations. Also in some practical applications it is difficult to gather sufficient training data to

guarantee that the covariance matrix is full rank.

5.2 Robust PCA by Projection Pursuit

Li and Chen [12] proposed a solution based on projection pursuit (PP). Dealing with high dimensional

data, PP searches for low dimensional projections that maximize (minimize) an objective function called

projection index. By working in the low dimensional projections, it manages to avoid the difficulty

caused by sparseness of the high-dimensional data.

Principal component analysis is actually a special PP procedure. Let be the�-dimensional random

vector with covariance Ü , and let ÝßÞ be the distribution function of à 6 , where à is a�

vector. Denote

the eigenvalues of Ü by ¨ - ��.��¨ b . Recall that the first principal component is the projection of onto a

certain direction; that is X �/Ý Þ Ç � (Yr�á"âã à ã +.- X �/Ý Þ � (är�á"âã à ã +.- �/à 6 Ü0à � - d � (25)

It is well known that X � �/ÝåÞ Ç � ( à 6 - Ü.à - is the largest eigenvalue ¨ - and that à - is the associated eigen-

vector. In the subsequent steps, each new direction is then constrained to be orthogonal to all previous

directions. For example, the second principal component à 6� is determined byX �/Ý Þ È � ( r�á"âã à ã +.-�¸ ÞçæyÞ Ç X �/Ý Þ � (26)

When the measurement matrix contains outliers, [12] used the robust scale estimator instead of standard

scale estimator to deal with the outliers.

Ammann proposed a similar idea for robust PCA by using projection pursuit [1]. In his approach,

the projection pursuit of estimating the eigenvectors of covariance matrix can be expressed as follows.

Determine the last principal axis à b by minimizing)* +.-�è �/ 6 à b � (27)

9

subject to the constraint C à b C ( � , where denotes the # th measurement vector. Then for � ( � 1��0�� , determine àc> to minimize )* +.-�è �/ 6 àI> � (28)

subject to the constraint C à%> C ( � and à 6> à ( P , � H �KÅÍ#lÅ �. è �Ô� � is the robust loss function to

bound the influence of outliers. Ordinary eigenvectors are obtained by setting è �Ô� � ( C � C �� .5.3 Robust PCA by Self-Organizing Neural Networks

The solution of the standard PCA is made after all the data have been collected and the sample covariance

matrix

'has been calculated, i.e., the approach works in the batch way. When a new sample ^é is added,

we have to recalculate the corresponding new covariance matrix' é ( � ' H GéêGé 6� H � (29)

then all the computations for solving (2) is repeated by solve' é � ( & � 98 (30)

Such approach is not suitable for some real applications where data come incrementally or in the online

way.

The problem can be solved by a number of existing self-organizing rules for PCA [15][16][23]. The

commonly used rules are listed as follows:ë ��ì H � � ( ë ��ì � HUí ��ì � �/ g 1 ë ��ì � g � � (31)ë ��ì H � � ( ë ��ì � HÛí ��ì � �/ g 1 ë ��ì �ë ��ì � 6 ë ��ì � g � � (32)ë ��ì H � � ( ë ��ì � HUí ��ì � Ò g �/h1³îº � H � g 1 g é � � (33)

whereg ( ë ��ì � 6 , îº ( g ë ��ì � , g é ( ë ��ì � 6 îº and í ��ì � is the learning rate which decreases to zero asì�ïñð while satisfying certain condition, e.g.,*Iò í ��ì � ( ðó� *�ò í ��ì �Ôi W4ð �2Áô¨w¿ôÁôÊ eÄõßö � 8 (34)

Each of the three rules will converge to the principal component vector ë almost surely under some mild

conditions which are studied in detail [15][16][23]. By regarding ë ��ì � as the weight vector (i.e., the

10

vector consisting of synapses) of a linear neuron with outputg ( ë ��ì � 6 , all of the three rules can be

considered as modifications of the well-known Hebbian ruleë ��ì H � � ( ë ��ì � HUí ��ì � g (35)

for self-organizing the synapses of a neuron.

From the view of statistical physics, all these rules (31)(32)(33) are connected to certain energy func-

tions. For example, the rule (33) is an adaptive rule for minimizing the following energy function in the

gradient descent manner ÷ �fø Þ � ; � ( )* �ê+.- e �fø Þ �/ù � �( )* �ê+.- C 0�21 ; : � C( )* �ê+.- C 0�21 ;V; 6 G� C( )* �ê+.- b*� +.- ��ú � � 1 >* +.- Ù � g � � � (36)

where : � (|; 6 � are the linear coefficients obtained by projecting the training data onto the principal

subspace, andg � ( @obò +.- Ù ò �ú ò � . ù � ( 0��1 ;}; 6 G� is the reconstruction error vector, and

e �fø Þ �/ù � � (ù 6� ù � is the reconstruction error of �� .In case of outliers, Xu and Yuille [13] have proposed an algorithm that generalizes the energy function

(36) by introducing additional binary variables that are zero when a data sample is considered an outlier.

They minimize ÷¼û�ü � ; � « � ( )* �,+.-þý¾ÿ � C � 1 ;V; 6 � C H�� 7�91 ÿ � ��( )* �,+.- �� ÿ � � b*� +.- ��ú � � 1 >* +.- Ù � g � � � � H�� 7�¼1 ÿ � � ¡¤ (37)

where each ÿ � in« ( Ò ÿ - � ÿ � ��0� ÿ ) Ó is a binary random variable. If ÿ � ( � the sample .� is taken

into consideration, otherwise it is equivalent to discarding � as an outlier. The second term in (37)

is a penalty term, or prior, that discourages the trivial solution where all ÿ � are zero. Given ; , if the

energy, ù � ( G�B1 ;}; 6 G� is smaller than a threshold � , then the algorithm prefers to set ÿ � ( �considering the sample �� as an inlier and 0 if it is greater than or equal to � . Minimization of (37)

involves a combination of discrete and continuous optimization problems and Xu and Yuille [70] derive

11

a mean field approximation to the problem which, after marginalizing the binary variables, can be solved

by minimizing:÷Äû�ü � ; � ( 1 )* �,+.- ��

û�ü �/ù � � � � � � (38)

where � û�ü �/ù � � � � � � (�� Á�±2�7� H e a� �� "� ù�� a�� is a function that is related to robust statistical estimators

[2]. The�

can be varied as an annealing parameter in an attempt to avoid local minima.

Based on such reformulation of the energy function, we can get the corresponding robust version of

the adaptive self-organizing rules (31)(32)(33). For example, the rule (32) changes intoë ��ì H � � ( ë ��ì � HUí ��ì � �� H e a� �� ù�� a�� / g 1 ë ��ì �ë ��ì � 6 ë ��ì � g � � (39)

and the rule (33) changes intoë ��ì H � � ( ë ��ì � HUí ��ì � �� H e a� �� ù�� a�� Ò g �/h1³îº � H � g 1 g é � � (40)

Finally the converged vector ë ø�� )�� is taken as the resulted principal components vector which has the

avoided the effects of outliers. In addition, a byproduct can be easily obtained by

ÿ � ( �� À� e �fø Þ �/ù � � W"! �( PQ� Áôì$# e ¨ Ù �À¿ e (41)

which indicates whether � is an outlier ( ÿ � ( � ) or not ( ÿ � ( P ).5.4 Robust PCA by Weighted SVD

The approach of robust PCA by neural networks is of limited application in some practical problems

as they reject entire data measurement as outliers. In some applications, outliers typically correspond

to small groups of points in the measurement vector and we seek a method that is robust to this type

of outlier yet does not reject the good points in the data samples. Gabriel and Zamir [8] give a partial

solution. They propose a weighted Singular Value Decomposition (SVD) technique that can be used to

construct the principal subspace. In their approach, they minimize÷� % � ; �S� � ( )* �ê+.- b*� +.- # � � ��ú � � 15�/� � � 6 : � � � (42)

where, � � is a column vector containing the elements of the \ -th row of ; . This effectively puts a

weight, # � � on every point in the training data. In related work, Greenacre [9] gives a partial solution

to the problem of factorizing matrices with known weighting data by introducing Generalized Singular

12

Value Decomposition (GSVD). This approach applies when the known weights in (42) are separable;

that is, one weight for each row and one for each column: # � � ( # � # � . The basic idea is to first whiten

the data using the weights, perform SVD, and then un-whiten the bases. The benefit of this approach is

that it takes advantage of efficient implementations of the SVD algorithm. The disadvantages are that the

weights must somehow already be known and that individual point outliers are not allowed.

In the general robust case, where the weights are unknown and there may be a different weight at every

point in every training data, there is no such solution that leverages SVD, [8][9] and one must solve the

minimization problem with “criss-cross regressions” which involve iteratively computing dyadic (rank 1)

fits using weighted least squares. The approach alternates between solving for � � or : � while the other

is fixed; this is similar to the EM approach we discussed before but without a probabilistic interpretation.

In this spirit, Gabriel and Odorof [8] note how the quadratic formulation in (36) is not robust to outliers

and propose making the rank 1 fitting process in (42) robust. They propose a number of methods to make

the criss-cross regressions robust but they apply the approach to very low dimensional data and their

optimization methods do not scale well to very high dimensional data such as images. In related work,

Croux and Filzmoser [5] use a similar idea to construct a robust matrix factorization based on a weighted& - norm.

5.5 Torre and Black’s Algorithm

In the computer vision field, PCA is a popular technique for parameterizing shape, appearance, and

motion [3][20][14]. Learned PCA representations have proven useful for solving problems such as face

and object recognition, tracking, detection, and background modeling [20][14]. Typically, the training

data for PCA is pre-processed in some way (e.g. faces are aligned [14]) or is generated by some other

vision algorithm (e.g. optical flow is computed from training data [3]). As automated learning methods

are applied to more realistic problems, and the amount of training data increases, it becomes impractical

to manually verify that all the data is good. In general, training data may contain undesirable artifacts

due to occlusion (e.g. a hand in front of a face), illumination (e.g. specular reflections), image noise

(e.g. from scanning archival data), or errors from the underlying data generation method (e.g. incorrect

optical flow vectors). We view these artifacts as statistical outliers .

Due to the high dimensionality of the image data, we can’t rely on the calculation of robust covariance

matrix to get the principal components. The projection based approach also suffers from the high com-

putational cost. The approach of Xu and Yuille described in the previous section suffers from three main

problems: First, a single bad pixel value can make an image lie far enough from the subspace that the

13

entire sample is treated as an outlier (i.e. ÿ � ( P ) and has no influence on the estimate of ; . Second,

Xu and Yuille use a least squares projection of the data � for computing the distance to the subspace;

that is, the coefficients that reconstruct the data m� are : � ( ; 6 G� . These reconstruction coefficients

can be arbitrarily biased by an outlier. Finally, a binary outlier process is used which either completely

rejects or includes a sample.

To make the robust PCA work efficiently for the image data, Torre and Black [7] proposed a more

general analogue outlier process that has computational advantages and provides a connection to robust

M-estimation. To address these issues they reformulate (37) as÷� ��ø Þ � ; �S�h��¬´�(' � ( )* �ê+.- b*� +.- ) & � � � ©e �� X �� H+* � & � � ��, (43)

where PtÅ & � � Å|� is now an analog outlier process that depends on both images and pixel locations and* � & � � � is a penalty function. The error ©e � � ( ú � � 1?¹ � 1 @ > +.- Ù � g � and - (YÒ X - X � �� X b Ó 6 specifies

a scale parameter for each of the ú pixel locations.

Observe that they explicitly solve for the mean ¬ in the estimation process. In the least-squares

formulation the mean can be computed in closed form and can be subtracted from each column of the

data matrix�

. In the robust case, outliers are defined with respect to the error in the reconstructed images

which include the mean. The mean can no longer be computed by performing a averaging procedure,

instead it is estimated (robustly) analogously to the other bases. Also, recall that PCA assumes an

isotropic noise model. In the formulation here they allow the noise to vary for every row (pixel) of the

data (e � � N$O �/PQ� X �� ).

Exploiting the relationship between outlier processes and the robust statistics [2], minimizing (43) is

equivalent to minimizing the following robust energy function÷� �fø Þ � ; �S�h��¬w� - � ( )* �ê+.- e � ��ø Þ �/G�y1l¬É1 ; : � �.- �( )* �ê+.- b*� +.- è ��ú � � 1�¹ � 1 >* +.- Ù � g � � X � � (44)

for a particular class of robust è -functions [2]. The robust magnitude of a vector : is defined as the sum

of the robust error values for each component, that is,e � �fø Þ � : � - � ( b*� +.- è � g � � X � � (45)

[7] uses the Geman-McClure error function given byè � g � X � � ( g �g � H X �� (46)

14

where X � is a scale parameter that controls the convexity of the robust function and determines the

inlers/outliers separation. Unlike some other è -functions, (46) is twice differentiable which is useful for

optimization methods based on gradient descent.

While many optimization methods exist, it is useful to formulate the minimization of equation (44)

as a weighted least squares problem and solve it using iteratively reweighted squares(IRLS). Define the

residual error in matrix notation as ©/ ( � 1l¬´ 6) 1 ; � 8 (47)

Then, for a given - , a matrix 0 � T b21 ) can be defined such that it contains positive weights for

each pixel and each image. 0 is calculated for each iteration as a function of the previous residuals©e � � ( ú � � 1 ¹ � 1 @ > +.- Ù � g � and it is related to the influence of pixels on the solution. Each element,# � � , of 0 will be equal to # � � (43 � ©e � � � X � � A ©e � � (48)

where 3 � ©e � � � X � � (65 è � ©e � � � X � �5 ©e � � ( � ©e � � X �� ©e �� H X �� (49)

for the Geman-McClure è -function. For an iteration of IRLS, (44) can be transformed into a weighted

least-squares problem and rewritten as:÷ � ; �S�¯��¬w�(0 ��7 �fø Þ ( )* �,+.- �/0�21l¬É1 ; : � � 6 0 � �/0�21l¬$1 ; : � � (50)( b*� +.- �/ � 1 ¹ � ) 1Ö� 6 � � � 6 0 � �/ � 1 ¹ � ) 1Ö� 6 � � � (51)

where the 0 � �ÖT b�1�b ( � ��°�±2�98 � � are diagonal matrices containing the positive weighting coefficients

for the data sample � , and recall that 8 � is the � th column of 0 . 0 � �hT ) 1 ) ( � ��°�±2�98 � � are diagonal

matrices containing the weighting factors for the \ th pixel over the whole training set. Note the symmetry

of (51) where, recall, .� represents the � th column of the data matrix�

and � is a column vector

which contains the \ th row. Observe that (51) have non-unique solutions since, for any linear invertible

transformation matrix : , ; :;: a - � would give the same solution (i.e. the reconstruction from the

subspace will be the same). This ambiguity can be solved by imposing the constraint of orthogonality

between the bases ; 6 ; ( R (e.g. with Graham-Schmidt orthogonalization). In order to find a solution

to

÷ � ; �S�¯��¬w�(0 ��7 �fø Þ , we differentiate (51) w.r.t. : � and ¬ and differentiate (51) w.r.t. � � to find

necessary, but not sufficient conditions for the minimum. From these conditions, the following coupled

system of equations

15

¬ ( � )* �ê+.- 0 a -� � )* �,+.- 0 � �/0�21 ; : � � � (52)� ; 6 0 � ; � : � ( ; 6 0 � �/G�y1Ö¬ �=< � ( ��0� � � (53)�/�>0 � 6 � ( �?0 �/ 1 ¹ b ) �=< # ( ��.� � 8 (54)

Giving these updates of the parameters, an approximate algorithm for minimizing equation (44) can

employ a two step method that minimizes

÷ 7 �fø Þ � ; = � = ¬ � using Alternated Least Squares(ALS).

Summarizing, the whole IRLS procedure works as follows,

1. First an initial basis ~ � v � and a set of coefficients @ � v � are given, then the initial error, ©/ � v � , can

be calculated by (47).

2. The weighting matrix A � - � can be computed by (48) and it will be used to successively alter-

nate between minimizing with respect to : � - �� and �/� � � - � � < ��# and ¬ � - � in closed form using

equations (54)(53)(52).

3. Once : � - �� , �/� � � - � and ¬ � - � have converged, recompute the error, ©/ � - � and calculate the weighting

matrix 0 � � � , then proceed in the same manner in steps 2 until convergence of the algorithmn.

It is worth noting that there are several possible ways to update the parameters more efficiently, rather

than a closed form solution.

6 Experiment Results

Experiments were performed to test some of the algorithms discussed in the previous sections. In Section

6.1, we use 2 dimensional and 40 dimensional data separately to show the efficiency of the EM algorithm.

In Section 6.2, we use 40 dimensional data in which 20% are missing and show how Wiberg’s method

works in such incomplete data. The same kind of 40 dimensional data are used in the experiment of

Section 6.3, but some of them are corrupted by outliers. For such data, we compare the results of robust

PCA with those of standard PCA. Another experiment with real images is also provided in that section.

6.1 Test the EM algorithm

First we use 2D synthetic data with Gaussian distribution to test the EM algorithm introduced in Section

3.2 (Figure 1). The data and initial principal axis is shown in Figure 1(a). The first iteration and second

iteration of the principal axis are shown in Figure 1(b)(c). Comparing with the results of standard PCA

16

−30 −20 −10 0 10 20 30

−20

−15

−10

−5

0

5

10

15

20

y1

y 2

−30 −20 −10 0 10 20 30

−20

−15

−10

−5

0

5

10

15

20

y1

y 2

(a) (b)

−30 −20 −10 0 10 20 30

−20

−15

−10

−5

0

5

10

15

20

y1

y 2

−30 −20 −10 0 10 20 30

−20

−15

−10

−5

0

5

10

15

20

y1

y 2

(c) (d)

Figure 1: The EM based PCA for 2D data. (a) The data and initial value of the principal axis; (b) Thefirst iteration; (c) The second iteration; (d) The data and the principal axis by standard PCA.

(Figure 1(d)), we find that EM algorithm converges to the correct solution in only two steps, which is

very efficient.

In the second example, 10 data vectors were used for the PCA algorithm. Each vector contains 40

dimensional data which were sampled from one shifted sinusoid curve. The whole data set is plotted

in Figure 2(a), in which each sinusoid curve is related to one data vector. Figure 2(b)(c) show the

results of standard PCA. The two principal axes found by standard PCA are shown in Figure 2(b). The

reconstructed signals by those two principal components are shown in Figure 2(c). Figure 2(d)(e) give the

principal axes and reconstructed signals found in the first iteration of the EM algorithm. Figure 2(f)(g)

show the principal axes and reconstructed signals found in the fourth iteration of the EM algorithm.

6.2 PCA with missing data

Here we also use a set of vectors which is formed by 10 shifted harmonic sinusoid functions. By ran-

domly removing 20% of the data points, the data set is shown in Figure 3(a). Obviously, standard PCA

can’t deal with such kind of data because some of the pixels are unknown. We use Wiberg’s algorithm to

extract the two principal axes and reconstruct the data by those two principal axes. Figure 3(b)(c) show

17

the results in the third iteration of the algorithm. Figure 3(d)(e) show the results in the fifth iteration. Note

that the functions representing the estimated principal axes are getting smoother after every iteration. Fi-

nally in the seventh iteration, we obtain the very smooth principal axes and a perfect reconstruction of

the input vectors, which are shown in Figure 3(f)(g).

6.3 PCA with outliers

Although several robust PCA methods were described in Section 5. We use Torre and Black’s algorithm

introduced in Section 5.5 to the show that the robust PCA performs better than traditional PCA in case

of outliers.

In the first experiment, we still use the data sampled from sinusoid functions. But 10% of the el-

ements are contaminated with outliers (Figure 4(a)). Figure 4(b)(c) depict the two principal axes and

the reconstructed signals by standard PCA. Figure 4(d)(e) depict the two principal axes and the recon-

structed signals by robust PCA after 30 iterations. Obviously the robust PCA gives much more reliable

reconstruction than standard PCA.

In the second experiment, we use a collection of �CB½� images with size �f��P J �f��P as the training set

of PCA (from ’http://web.salleurl.edu/f̃torre/’). Those images were gathered from a static camera over

the day. There are changes in the illumination of the static background. Also 45% of the images contain

people in different location. Our purpose is to build the model of the background by using PCA. We treat

the people in the image as outliers and use PCA to extract the background model. The left column of

Figure 5 show the D examples of the training images. The middle column shows the reconstructing each

of the illustrated training image using standard PCA with ��P basis vectors. The right column shows the

reconstruction obtained with ��P robust PCA basis vectors. We find that the robust PCA is able to capture

the illumination changes while ignoring the people. Once we get the desired background model which

accounts for illumination variation, we can use it in the application of person detection and tracking.

References

[1] L. P. Ammann. Robust singular value decompositions: A new approach to projection pursuit. J. of

Amer. Stat. Assoc., 88(422):505–514, 1993.

[2] M. J. Black and A. Rangarajan. On the unification of line process, outlier detection, and robust

statistics with applications in early vision. International J. of Computer Vision, 25(19):57–92,

1996.

18

[3] M. J. Black, Y. Yaccob, A. Jepson, and D. J. Fleet. Learning parameterized models of image motion.

In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, San Juan, Puerto Rico, volume I,

pages 561– 567, 1997.

[4] N. A. Campbell. Robust procedures in multivariate analysis I : Robust covariance estimation.

Applied Statistics, 29(3):231–237, 1980.

[5] C. Croux and P. Filzmoser. Robust factorization of data matrix. Proc. in Computational Statistics,

pages 245–249, 1981.

[6] Y. Dodge. Analysis of Experiments with Missing Data. Wiley, 1985.

[7] F.Torre and M. J. Black. Robust principal component analysis for computer vision. In 8th In-

ternational Conference on Computer Vision, volume I, pages 362–349, Vancouver, Canada, July

2001.

[8] K.R. Gabriel and S. Zamir. Lower rank approximation of matrices by least squares with any choice

of weights. technometrics, 21:489–498, 1979.

[9] M. J. Greenacre. Theory and Applications of Correspondence Analysis. Academic Press : London,

1984.

[10] H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of

Educational Psychology, 24:417–441, 1933.

[11] P. J. Huber. Robust Statistics. New York:Wiley, first edition, 1981.

[12] G. Li and Z. Chen. Projection-pursuit approach to robust dispersion matrices and principal compo-

nents: Primary theory and monte carlo. J. of Amer. Stat. Assoc., 80(391):759–766, 1985.

[13] L.Xu and A. L. Yuille. Robust principal component analysis by self-organizing rules based on

statistical physics approach. IEEE Trans. Neural Networks, 6(1):131–143, 1995.

[14] B. Moghaddam and A. Pentland. Probabilistic visual learning for object representation. IEEE

Trans. Pattern Anal. Machine Intell., 19(7):137–143, 1997.

[15] E. Oja. A simplified neuron model as a principal component analyzer. J. Math. Biol., 16:267–273,

1982.

19

[16] E. Oja and J. Karhunen. On stochastic approximation of eigenvectors and eigenvalues of the ex-

pectation of a random matrix. J. Math. Anal. Appl., 106:69–84, 1985.

[17] K. Pearson. On lines and planes of closestfit to systems of points inspace. The London, Edinburgh

and Dublin Philosophical Magazine and Journal of Sciences, 6:559–572, 1901.

[18] S. Roweis. Em algorithm for PCA and SPCA. Neural Information Processing Systems, pages

626–632, 1997.

[19] F. H. Ruymagaart. A robust principal component analysis. Journal of Multivariate Analysis,

11:485–497, 1981.

[20] G. J. Edwards T. F. Cootes and C. J. Taylor. Active appearance models. In Proc. European Conf.

on Computer Vision, volume I, pages 484– 498, 1998.

[21] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Technical Report

NCRG/97/010, Microsoft Research, September 1999.

[22] T. Wiberg. Computation of principal components when data is missing. In Proc. Second Symp.

Computational Statistics, pages 229–236, 1976.

[23] L. Xu. Least mean square error reconstruction for self-organizing neural nets. Neural Networks,

6:627–648, 1993.

20

0 5 10 15 20 25 30 35 40−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

pixel number

pixe

l val

ue

(a)

0 5 10 15 20 25 30 35 40−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

pixel number

pixe

l val

ue

0 5 10 15 20 25 30 35 40−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

pixel number

pixe

l val

ue

(b) (c)

0 5 10 15 20 25 30 35 40−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

pixel number

pixe

l val

ue

0 5 10 15 20 25 30 35 40−1.5

−1

−0.5

0

0.5

1

1.5

(d) (e)

0 5 10 15 20 25 30 35 40−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

pixel number

pixe

l val

ue

0 5 10 15 20 25 30 35 40−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

(f) (g)

Figure 2: The EM based PCA for 40 dimensional data. (a) Input data; (b) Two principal axes found bythe standard PCA; (c) The reconstructed signals by the standard PCA; (d) Two principal axes found inthe first iteration by EM based PCA; (e) The reconstructed signals in the first iteration ; (f) Two principalaxes found in the fourth iteration ; (g) The reconstructed signals in the fourth iteration.

21

0 5 10 15 20 25 30 35 40−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

pixel number

pixe

l val

ue

(a)

0 5 10 15 20 25 30 35 40−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

pixel number

pixe

l val

ue

0 5 10 15 20 25 30 35 40−1.5

−1

−0.5

0

0.5

1

1.5

pixel number

pixe

l val

ue

(b) (c)

0 5 10 15 20 25 30 35 40−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

pixel number

pixe

l val

ue

0 5 10 15 20 25 30 35 40−1.5

−1

−0.5

0

0.5

1

1.5

pixel number

pixe

l val

ue

(b) (c)

0 5 10 15 20 25 30 35 40−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

pixel number

pixe

l val

ue

0 5 10 15 20 25 30 35 40−1.5

−1

−0.5

0

0.5

1

1.5

pixel number

pixe

l val

ue

(b) (c)

Figure 3: PCA for the imcomplete data set. (a) Input data, some pixels are missing; (b) Two principalaxes found in the third iteration; (c) The reconstructed signals in the third iteration ; (d) Two principalaxes found in the fifth iteration; (e) The reconstructed signals in the fifth iteration ; (f) Two principal axesfound in the seventh iteration ; (g) The reconstructed signals in the seventh iteration.

22

0 5 10 15 20 25 30 35 40−4

−3

−2

−1

0

1

2

3

4

pixel number

pixe

l val

ue

(a)

0 5 10 15 20 25 30 35 40−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

pixel number

pixe

l val

ue

0 5 10 15 20 25 30 35 40−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

pixel number

pixe

l val

ue

(b) (c)

0 5 10 15 20 25 30 35 40−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

pixel number

pixe

l val

ue

0 5 10 15 20 25 30 35 40−1.5

−1

−0.5

0

0.5

1

1.5

pixel number

pixe

l val

ue

(b) (c)

Figure 4: Robust PCA. (a) Input data; (b) Two principal axes found by standard PCA; (c) The recon-structed signals by standard PCA; (d) Two principal axes found by robust PCA; (e) The reconstructedsignals by robust PCA.

23

(a) (b) (c)

Figure 5: Robust PCA for the image data. (a) Some of the original data; (b) PCA reconstruction; (c)Robust PCA reconstruction.

24

principal component analysis with missing data …...principal component analysis with missing data...

Documents