component analysis for pr & hsbcc(dc) b = − ∂ ∂ (lee & seung, 1999;lee & seung,...

1

Machine Perception of Human Behavior with CA F. De la Torre/J. Cohn PAVIS school on CV and PR 7

Component Analysis for PR & HS

• Computer Vision & Image Processing– Structure from motion.

– Spectral graph methods for segmentation.

– Appearance and shape models.

– Fundamental matrix estimation and calibration.

– Compression.

– Classification.

– Dimensionality reduction and visualization.

• Signal Processing– Spectral estimation, system identification (e.g. Kalman filter), sensor

array processing (e.g. cocktail problem, eco cancellation), blind source separation, -

• Computer Graphics– Compression (BRDF), synthesis,-

• Speech, bioinformatics, combinatorial problems.






– Compression.

– Classification.






Structure from motion







– Compression.

– Classification.






Spectral graph methods for segmentation.







– Compression.

– Classification.






Appearance and shape models


2






– Compression.

– Classification.






Dimensionality reduction and visualization







– Compression.

– Classification.






cocktail problem



Independent Component Analysis (ICA)

Sound

Source 1

Sound

Source 2

Mixture 1

Mixture 2

Output 1

Output 2

I

C

A

Sound

Source 3

Mixture 3

Output 3


Why CA for PR & HS?

• Learn from high dimensional data and few samples.– Useful for dimensionality reduction specially when functions are

smooth.

features samples

• Efficient methods O( d n< <n2 )

(Everitt,1984)

• Easy to formulate, to solve and to extend

– Non-linearities (Kernel methods) (Scholkopf & Smola,2002; Shawe-Taylor &

Cristianini,2004)

– Probabilistic (latent variable models)

– Multi-factorial (tensors) (Paatero & Tapper, 1994 ;O’Leary & Peleg,1983; Vasilescu & Terzopoulos,2002; Vasilescu & Terzopoulos,2003)

– Exponential family PCA (Gordon,2002; Collins et al. 01)

• Natural geometric interpretation

3


Are CA methods popular/useful/used?

• Still work to do

– Results 1 - 10 of about 83,000 for “Spanish crisis"

– Results 1 - 10 of about 287,000,000 for "Britney Spears"

• About 28% of CVPR-07 papers use CA.

• Google:– Results 1 - 10 of about 1,870,000 for "principal component

analysis".

– Results 1 - 10 of about 506,000 for "independent componentanalysis"

– Results 1 - 10 of about 273,000 for "linear discriminantanalysis"

– Results 1 - 10 of about 46,100 for "negative matrix

factorization"

– Results 1 - 10 of about 491,000 for "kernel methods"


Outline

• Introduction (15 min)

• Generative models (40 min)– (PCA, k-means, spectral clustering, NMF, ICA, MDS)

• Discriminative models (40 min)– (LDA, SVM, OCA, CCA)

• Standard extensions of linear models (30 min)– (Kernel methods, Latent variable models, Tensor

factorization )

• Unified view (20 min)

Generative models (40 min)(PCA, k-means, spectral clustering, NMF, ICA, MDS)


Generative models

• Principal Component Analysis/Singular Value Decomposition

• Non-Negative Matrix Factorization

• Independent Component Analysis

• K-means and spectral clustering

• Multi-dimensional Scaling

BCD ≈


Principal Component Analysis (PCA)

• PCA finds the directions of maximum variation of the data

• PCA decorrelates the original variables

(Pearson, 1901; Hotelling, 1933;Mardia et al., 1979; Jolliffe, 1986; Diamantaras, 1996)

+*+=+

-*- =+

=Σ

9.075.0

75.08.0

1x

2x

=Σ

1.10

01.2+*+=+

-*- =+

+*-=-+*-=-

4


PCA

kccc ++++≈ ......21µ

kbbb

21

[ ] T

nn µ1BCdddD +≈= ...21

d=pixels

n= images

1××

××

ℜ∈ℜ∈

ℜ∈ℜ∈dnk

kdnd

µC

BD

•Assuming zero mean data, the basis B that preserve themaximum variation of the signal is given by the eigenvectorsof DDT.

B ΛBDD =Td

d=pixels


Snap-shot method & SVD• If d>>n (e.g., images 100*100 vs. 300 samples) no DDT.

• DDT and DTD have the same eigenvalues (energy) and

related eigenvectors (by D).

• B is a linear combination of the data!

• [α,L]=eig(DTD) B=D α(diag(diag(L))) -0.5

ΛDαDDαDDDBΛBDD TTTT ==

SVDPCA

diagonal

nnnkkd

T

ΣIVVIUU

VΣU

VUΣD

TT ==

ℜ∈ℜ∈ℜ∈

=×××

• SVD factorizes the data matrix D as:

Λ==

ℜ∈ℜ∈

=××

TT CCIBB

CB

BCD

nkkd

TT UUΛDD =

TT VVΛDD =

(Beltrami, 1873; Schmidt, 1907; Golub & Loan, 1989)

(Sirovich, 1987)DαB =


Error function for PCA

(Eckardt & Young, 1936; Gabriel & Zamir, 1979; Baldi & Hornik, 1989; Shum et al.,

1995; De la Torre & Black, 2003a)

• Not unique solution: kk×− ℜ∈= RBCCBRR 1 (De la Torre 2012)

F

n

i

iiE BCDBcdCB, −=−=∑=1

2

21 )(

• PCA minimizes the following function:

(Baldi & Hornik, 1989)


PCA/SVD in Computer Vision• PCA/SVD has been applied to:

– Recognition (eigenfaces:Turk & Pentland, 1991; Sirovich & Kirby, 1987; Leonardis & Bischof, 2000; Gong et al., 2000; McKenna et al., 1997a)

– Parameterized motion models (Yacoob & Black, 1999; Black et al., 2000; Black, 1999; Black & Jepson, 1998)

– Appearance/shape models (Cootes & Taylor, 2001; Cootes et al., 1998; Pentland et al., 1994; Jones & Poggio, 1998; Casia & Sclaroff, 1999; Black & Jepson, 1998; Blanz & Vetter, 1999; Cootes et al., 1995; McKenna et al., 1997; de la Torre et al., 1998b; de la Torre et al., 1998b)

– Dynamic appearance models (Soatto et al., 2001; Rao, 1997; Orriols & Binefa, 2001; Gong et al., 2000)

– Structure from Motion (Tomasi & Kanade, 1992; Bregler et al., 2000; Sturm & Triggs, 1996; Brand, 2001)

– Illumination based reconstruction (Hayakawa, 1994)

– Visual servoing (Murase & Nayar, 1995; Murase & Nayar, 1994)

– Visual correspondence (Zhang et al., 1995; Jones & Malik, 1992)

– Camera motion estimation (Hartley, 1992; Hartley & Zisserman, 2000)

– Image watermarking (Liu & Tan, 2000)

– Signal processing (Moonen & de Moor, 1995)

– Neural approaches (Oja, 1982; Sanger, 1989; Xu, 1993)

– Bilinear models (Tenenbaum & Freeman, 2000; Marimont & Wandell, 1992)

– Direct extensions (Welling et al., 2003; Penev & Atick, 1996)

5


Generative models






BCD ≈


“Intercorrelations among

variables are the bane of the

multivariate researcher’s struggle

for meaning”Cooley and Lohnes, 1971


Part-based representation

�The firing rates of neurons are never negative

� Independent representations

NMF & ICA


Non-negative Matrix Factorization (NMF)

• Positive factorization.

• Leads to part-based representation.

0||||)( ≥−= CB,BCDCB, FE

(Lee & Seung, 1999)

6

Machine Perception of Human Behavior with CA F. De la Torre/J. Cohn PAVIS school on CV and PR

NMF

• Multiplicative algorithm can be interpreted as

diagonally rescaled gradient descent

∑ −=≥≥

ij

ijijdF2

0,0)(min BC

CB

ij

ij

ijij)(

)(

BVB

DBCC

T

T

←

Inference:

Learning:

ij

T

ij

T

ijij)(

)(

BCC

DCBB ←

Derivatives:

ijij

ij

F)()( CBBCB

C

TT −=∂∂

ij

T

ij

T

ij

F)()( DCBCC

B−=

∂∂

(Lee & Seung, 1999;Lee & Seung, 2000)


Independent Component Analysis (ICA)

• We need more than second order statistics to represent

the signal


ICA

• Look for si that are independent

• PCA finds uncorrelated variables

• For Gaussian distributions independence and

uncorrelated is the same

• Uncorrelated E(sisj)= E(si)E(sj)

• Independent E(g(si)f(sj))= E(g(si))E(f(sj)) for any non-

linear f,g

1−≈=≈= BWWDSCBCD

(Hyvrinen et al., 2001)

PCA ICA

S=(1,0)

S=(0,1)

S=(0.5, 0.5)

S=(-0.5, -0.5)


ICA vs PCA(Hyvrinen et al., 2001)

7


Many optimization criteria

• Minimize high order moments: e.g. kurtosis

kurt(W) = E{s4} -3(E{s2}) 2

• Many other information criteria.

∑∑==

+−n

i

i

n

i

ii S11

)(cBcdSparseness (e.g. S=| |)

(Olhausen & Field, 1996)

(Chennubhotla & Jepson, 2001b; Zou et al., 2005; dAspremont et al., 2004;)

• Also an error function:

• Other sparse PCA.

WDS =


Basis of natural images


Denoising

Original

image Noisy Image

(30% noise)

Denoise

(Wiener filter) ICA


Generative models






BCD ≈

8


The clustering problem

• Partition the data set in c-disjoint “clusters” of data points.

12

1

104

21)1(

1),( ≈

=

=

−= ∑

= c

ni

i

c

ccnS n

c

i

c

• Number of possible partitions

• NP-hard and approximate algorithms (k-means, hierarchical

clustering, mog, -)


K-means

F

TE ||||),(0 MGDGM −=(Ding et al., ‘02, Torre et al ‘06)

=TMGxy

1

2

3

4

5

6

7

8

9

10

x

y

=TG

xy=D

=Mxy


Spectral Clustering

37

Affinity Matrix

(Dhillon et al., ‘04, Zass & Shashua, 2005; Ding et al., 2005, De la Torre et al ‘06)

Eigenvectors


Generative models






BCD ≈

9


Multi-Dimensional Scaling (MDS)

• MDS takes a matrix of pair-wise distances and finds

an embedding that preserves the inter-point

distances.

39

Pair-wise distances

of US cities

Spatial layout of cities

in an embedded space

North/South

East/West


∑ ∑ −i j

ijij dfn

2

},,{

])([min1

δyy K

Pair-wise distances

in original data

space (Given)

Pair-wise distances in

the embedded space

(Unknown)

2

2jiijd yy −=

MDS (II)

Original data space Embedded space


MDS (III)


Outline





factorization )


10


Discriminative models

• Linear Discriminant Analysis (LDA)

• Support Vector Machines (SVM)

• Oriented Component Analysis (OCA)

• Canonical Correlation Analysis (CCA)


Linear Discriminant Analysis (LDA)

• Optimal linear dimensionality reduction if classes are

Gaussian with equal covariance matrix.

BΛSBSBSB

BSBB b

bt

t

T

T

J ==||

||)(

∑∑= =

−−=C

i

C

j

T

jijib

1 1

))(( µµµµS

(Fisher, 1938;Mardia et al., 1979; Bishop, 1995)

∑=

==n

i

T

ii

T

t

1

ddDDS

T

ji

c

j

C

i

jiw

i

)()(1 1

µdµdS −−=∑∑= =


Support Vector Machines (SVM)

• Linear classifier

45

denotes +1

denotes -1

How would you

classify this data?

• Infinite amount lines classify the data well, but which is

the best?

)()( bsignf T += xwx


Large margin linear classifier

“safe zone”

• The linear discriminant

function with the maximum

margin is the best

Margin

x1

x2

denotes +1

denotes -1

• Why is the best?

– Robust to outliers and

shown to improve

generalization

• The margin with is:

2

2

2

w=+= +− ddM

Support

vectors

−d

−d

11


Large margin linear classifier

“safe zone”

• The linear discriminant

function with the maximum

margin is the best

Margin

x1

x2

denotes +1

denotes -1

• Why is the best?

– Robust to outliers and

shown to improve

generalization

• The margin with is:

2

2

2

w=M

Support

vectors

11

11

..min2

2

−≤+−=

≥++=

byfor

byfor

ts

i

T

i

i

T

i

xw

xw

w

Quadratic optimization problem

with linear constraints


SVM formulation

• But what if we have error or non-linear decision boundaries.

x1

x2

• Slack variables ξi can be added

To allow misclassification of difficult

or noisy data points

0

11

11

..min1

2

2

≥

+−≤+−=

−≥++=

+ ∑=

i

ii

T

i

ii

T

i

n

i

i

by

by

tsC

ξ

ξ

ξ

ξ

xw

xw

w

Balance trade-off between margin and

classification error. Controls over-fitting

1ξ2ξ


SVM is a regularized network

• In the noiseless case, LDA on the support vectors is

equivalent to SVM (Shashua 99)

• SVM classifier can be also optimized in the primal

without constraints:

49

∑=

+−+N

i

T

ii xyC1

0

2

2))(1,0max(min wxw

Regularization Training error with hinge-loss function

0 1-1

y=-1y=+1


Other classifiers

• Several other loss-functions for other classifiers (e.g.,

logistic regression, Adaboost)

50

0 1

Hinge-loss 0/1 loss

Penalize examples that are well classified but close to the margin

Logistic loss )1log()( ii fy

ex+

12


Discriminative Models

• Linear Discriminant Analysis (LDA)

• Support Vector Machines (SVM)

• Oriented Component Analysis (OCA)

• Canonical Correlation Analysis (CCA)


Oriented Component Analysis (OCA)

• Generalized eigenvalue problem:

• boca is steered by the distribution of noise.

OCA

T

OCA

T

OCA

noiseOCA

signal

bΣb

bΣb

λkeki bΣbΣ =

signal

noise

OCAb


OCA for face recognition

k

T

k

T

k

ek

i

bΣb

bΣb1

1

j

T

j

j

T

j

e

i

bΣb

bΣb2

2

(De la Torre et al. 2005)


Canonical Correlation Analysis (CCA)

• Perform PCA independently and learn a mapping

PCA PCA

• Independent dimensionality reduction between set can loose signals with small energy but highly correlated

13



• Learn relations between multiple data sets? (e.g. find

features in one set related to another data set)

• Given two sets , CCA finds the pair

of directions wx and wy that maximize the correlation

between the projections (assume zero mean data)

T

y

TT

y

T

x

TT

x

y

TT

x

YwYwXwXw

YwXw=ρ

ndnd and ×× ℜ∈ℜ∈ 21 YX

=ℜ∈

=ℜ∈

= +×++×+

y

xdddd

T

Tdddd

T

T

w

ww

YY0

0XXΒ

0YX

YX0A

)()()()( 21212121 ,

ΒwAw λ=

(Mardia et al., 1979; Borga 98)

• Several ways of optimizing it:

• An stationary point of r is the solution to CCA.


Virtual avatars with CCA(De la Torre & Black 2001)


Outline





factorization )



Linear methods fail

58

• Data points in a non-linear manifold

• There is no good linear mapping to map to a plane

• Linear methods only rotate/translate/scale the data

14


Linear methods fail • Learning a non-linear representation for classification

Cos close to 0

Cos close to -1

Cos close to 1


Linear methods not enough

• Linear methods:– Unique optimal solutions

– Fast learning algorithms

– Better statistical analysis

• Problem:– Insufficient capacity. Minsky and Pappert pointed out in their books

Perceptrons

– Neural networks adding non-linear layers (e.g., MLP). Solve the

capacity problem but hard to train and local minima.

60

• Kernel methods:– Use linear techniques but work in a high-dimensional space.

)(xx Φ→


Kernel methods

• The kernel defines an implicit mapping (usually high dimensional)

from the input to feature space, so the data becomes linearly

separable.

• Computation in the feature space can be costly because it is

(usually) high dimensional

– The feature space can be infinite-dimensional!

Feature spaceInput space

),,(),,(),(321

22

212121

zzzxxxxxx =+→


Kernel methods (II)• Suppose φ(.) is given as follows

• An inner product in the feature space is

• So, if we define the kernel function as follows, there is no need to carry out φ(.) explicitly

• This use of kernel function to avoid carrying out φ(.) explicitly is known as the kernel trick. In any linear algorithm that can be expressed by inner products can be made nonlinear by going to the feature space

15


Kernel PCA(Scholkopf et al., 1998)


Kernel PCA (II)

• Eigenvectors of the cov. Matrix in feature space.

• Eigenvectors lie in the span of data in feature space.

∑=

ΦΦ=n

i

iin 1

T)()(1

ddC λ11 bbC =

∑=

Φ=n

i

ii

1

1 )(db α

λαKα =

(Scholkopf et al., 1998)

λαα ])([),()(1

1 11

∑ ∑∑= ==

Φ=Φn

i

i

n

i

ijii

n

j

i Kn

dddd


Outline• Introduction (5 min)




factorization )


• Extended generative models (50 min)– RPCA, PaCA, ACA

• Extended discriminative models (1 hour)– MODA, Parda, CTW, seg-SVM


Factor Analysis• A Gaussian distribution on the coefficients and noise is

added to PCA� Factor Analysis.

• Inference (Roweis & Ghahramani, 1999;Tipping & Bishop, 1999a)

Ψ+=−−=

=ΨΨ=

Ψ+==

++=

TT

d

k

E

diagNp

NpNp

BBµdµdd

0,cη

BcµdBcdI0,cc

ηBcµd

)))((()cov(

),...,,()|()(

),|(),|()|()(

21 ηηη

(Mardia et al., 1979)

11

1

)(

)()(

),|()(

−−

−

Ψ+=

−Ψ+=

=

BBIV

µdBBBm

Vmcd|c

T

TT

Np

PCA reconstruction low error.

FA high reconstruction error (low likelihood).

),( dcp Jointly Gaussian

16


Ppca• If PPCA.

• If is equivalent to PCA. TTTTBBBBBB

11 )()(0 −− =Ψ+→ε

d

TE Iηη ε==Ψ )(

0→ε

• Probabilistic visual learning (Moghaddam & Pentland, 1997;)

∑

=

Σ

=

Σ

== −

−

=

−−+−−−Σ−−

∏∫

=

−−

2

)(

2

)(

1

21

2

2

1

2

12

)()()(2

1

2

12

)()(2

1

)2()2()2()2(

)()()(

2

1

2

11

kdk

i

i

d

c

dd

eeeedppp

k

i i

iTT

πρλπππ

ρε

λεd

µdIBBµdµdµd T

ccc|dd

i

T

i dBc =


More on PPCA

• Tracking (Yang et al., 1999; Yang et al., 2000a; Lee et al., 2005; de la Torre et

al., 2000b)

• Recognition/Detection (Moghaddam et al., 2000; Shakhnarovich &

Moghaddam, 2004; Everingham & Zisserman, 2006)

• PCA for the exponential family (collins et al., 2001)

(Tipping & Bishop, 1999b; Black et al., 1998; Jebara et al., 1998)

• Extension to mixtures of Ppca (mixture of subspaces).


Outline• Introduction (5 min)




factorization )


• Extended generative models (50 min)– RPCA, PaCA, ACA

• Extended discriminative models (1 hour)– MODA, Parda, CTW, seg-SVM


Tensor faces(Vasilescu & Terzopoulos, 2002; Vasilescu & Terzopoulos, 2003)

viewsilluminations

expressions

people

17


Eigenfaces• Facial images (identity change)

• Eigenfaces bases vectors capture the variability in facial

appearance (do not decouple pose, illumination, -)


Data Organization• Linear/PCA: Data Matrix

– Rpixels x images

– a matrix of image vectors

• Multilinear: Data Tensor

– Rpeople x views x illums x express x pixels

– N-dimensional matrix

– 28 people, 45 images/person

– 5 views, 3 illuminations,

3 expressions per person

exilvpp ,,,iIllu

min

ati

on

s

Views

D

D

Pix

els

ImagesD


N-Mode SVD Algorithm

N = 3

pixelsxexpressx

illums.x

views x

peoplex 51 UUUUU .... ZZZZDDDD 432=

∑∑∑=l p s

lpslpsijk uuuzd


PCA:

TensorFaces:

18


TensorFaces

Mean Sq. Err. = 409.15

3 illum + 11 people param.

33 basis vectors

PCA

Mean Sq. Err. = 85.75

33 parameters

33 basis vectors

Strategic Data Compression =

Perceptual Quality

Original

176 basis vectors

TensorFaces

6 illum + 11 people param.

66 basis vectors

• TensorFaces data reduction in illumination space primarily

degrades illumination effects (cast shadows, highlights)

• PCA has lower mean square error but higher perceptual error


Outline





factorization )



The fundamental equation of CA(De la Torre & Kanade 06, De la Torre 2012)

Fc

T

rE ||)(||),(0 WBAWBA Ψ−Γ=

Weights

for rows

Weights

for columns

Regression

matrices

Given two datasets : nxnd and ×× ℜ∈ℜ∈ XD

XD

AB

)(θ)(ϕ

CC

77 Machine Perception of Human Behavior with CA F. De la Torre/J. Cohn PAVIS school on CV and PR

Properties of the cost function

• E0(A,B) has a unique global minimum (Baldi and Hornik-89).

• Closed form solutions for A and B are:

( ))()()( 2212

0 AWWWAAWAA T

r

TTTT

ccctrE ΨΓΓΨΨΨ= −

( )))(()()( 22122212

0 BWWWWWBBWBB r

T

c

T

c

T

cr

T

r

TtrE ΓΨΨΨΨΓ= −−

78

19


Principal Component Analysis (PCA)

• PCA finds the directionsof maximum variation ofthe data based on linearcorrelation.

(Pearson, 1901; Hotelling, 1933;Mardia et al., 1979; Jolliffe, 1986; Diamantaras, 1996)

• Kernel PCA finds thedirections of maximumvariation of the data inthe feature space.

),,(),,(),(321

22

212121

zzzxxxxxx =+→


PCA-Kernel PCA

• The primal problem:

Fc

T


F

T

kpcaE ||)(||),( BADBA −=)(Dϕ

( )))()(()()( 1ADDAAAA ϕϕ TTT

kpca trE −=

( ))()()( 1BDDBBBB

TTT

kpca trE −=

• Error function for KPCA: (Eckardt & Young, 1936; Gabriel & Zamir, 1979; Baldi

& Hornik, 1989; Shum et al., 1995; de la Torre & Black, 2003a)

• The dual problem:

80


Error function for LDA

F

TTT

LDAE ||)()(||),( 2

1

DBAGGGBA −=−

[d1 d2 ... dn]

Equations n×c Unknowns d×c

[ ]A

d=pixels

K=dim

subspace

kn

nk

ijg

×ℜ∈

=

∈

G

1G1

}1,0{

=

0...0

1...0

0...1TG

c=classes

n=samples

• d>>n an UNDETERMINED system of equations! (over-fitting)

(de la Torre & Kanade, 2006)


• CCA is the same as LDA changing the label matrix by a

new set X

Fc

T


F

TT

CCAE ||)()(||),( 2

1

XBADDDBA −=−


=

0...0

1...0

0...1TG

c=classes

n=samples

82

20


K-means

Fc

T

rE ||)(||),(0 WBADWBA Ψ−=(Ding et al., ‘02, Torre et al ‘06)

=TBA

xy

1

2

3

4

5

6

7

8

9

10

x

y

=TA

xy=D

=Bxy


Normalized cuts

Fc

T

rE ||)(||),(0 WBAΓWBA Ψ−=

)(DΓ ϕ=)](...)()([ 21 ndddΓ ϕϕϕ=

(Dhillon et al., ‘04, Zass & Shashua, 2005; Ding et al., 2005, De la Torre et al ‘06)

Affinity Matrix

Normalized Cuts (Shi & Malik ’00)

Ratio-cuts(Hagen & Kahng ’02)

84


Other Connections

• The LS-KRRR (E0) is also the generative model for:

– Laplacian Eigenmaps, Locality Preserving projections, MDS,

Partial least-squares, -.

• Benefits of LS framework:

– Common framework to understand difference and communalities

between different CA methods (e.g. KPCA, KLDA, KCCA, Ncuts)

– Better understanding of normalization factors and

generalizations

– Efficient numerical optimization less than θ(n3) or θ(d3), where n

is number of samples and d dimensions

85

component analysis for pr & hsbcc(dc) b = − ∂ ∂ (lee & seung, 1999;lee & seung,...

Documents