cs 231a section 1: linear algebra & probability...

Section 1 -

Jonathan Krause

CS 231A Section 1: Linear Algebra & Probability Review

Jonathan Krause

9/28/2012 1

Section 1 -

Jonathan Krause

Topics

• Support Vector Machines • Boosting

– Viola-Jones face detector

• Linear Algebra Review – Notation – Operations & Properties – Matrix Calculus

• Probability – Axioms – Basic Properties – Bayes Theorem, Chain Rule

9/28/2012 2

Section 1 -

Jonathan Krause

Linear classifiers

0:negative

0:positive

b

b

ii

ii

wxx

wxx

9/28/2012

• Find linear function (hyperplane) to separate positive and negative examples

Which hyperplane

is best?

w, b

3

Section 1 -

Jonathan Krause

Support vector machines • Find hyperplane that maximizes the margin between the

positive and negative examples

9/28/2012

Margin Support vectors

4

Section 1 -

Jonathan Krause

Support Vector Machines (SVM) • Wish to perform

binary classification, i.e. find a linear classifier

• Given data and labels where

• When data is linearly separable we can solve the optimization problem to find our linear classifier 9/28/2012 5

Section 1 -

Jonathan Krause

• Datasets that are linearly separable work out great:

•

•

• But what if the dataset is just too hard?

• We can map it to a higher-dimensional space:

0 x

0 x

0 x

x2

Nonlinear SVMs

9/28/2012

Slide credit: Andrew Moore

6

Section 1 -

Jonathan Krause

Φ: x → φ(x)

Nonlinear SVMs

• General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable:

9/28/2012

Slide credit: Andrew Moore lifting transformation

7

Section 1 -

Jonathan Krause

SVM – l1 regularization

• What if data is not linearly separable?

• Can use regularization to solve this problem

• We solve a new optimization problem and “tune” our regularization parameter C

9/28/2012 8

Section 1 -

Jonathan Krause

Solving the SVM

• There are many different packages for solving SVMs

• In PS0 we have you use the liblinear package. This is an efficient implementation but can only use a linear kernel http://www.csie.ntu.edu.tw/~cjlin/liblinear/

• If you wish to have more flexibility with your choice of kernel you can use the LibSVM package http://www.csie.ntu.edu.tw/~cjlin/libsvm/

9/28/2012 9

Section 1 -

Jonathan Krause

Topics





9/28/2012 10

Section 1 -

Jonathan Krause

Each data point has

a class label:

wt =1

and a weight:

+1 ( )

-1 ( ) yt =

Boosting

• It is a sequential procedure:

9/28/2012

xt=1

xt=2

xt

Y. Freund and R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial

Intelligence, 14(5):771-780, September, 1999.

11

http://www.cs.princeton.edu/~schapire/uncompress-papers.cgi/FreundSc99.ps

Section 1 -

Jonathan Krause

Toy example

9/28/2012

Weak learners from the family of lines

h => p(error) = 0.5 it is at chance

Each data point has

a class label:

wt =1

and a weight:

+1 ( )

-1 ( ) yt =

12

Section 1 -

Jonathan Krause

Toy example

9/28/2012

This one seems to be the best

This is a ‘weak classifier’: It performs slightly better than chance.

Each data point has

a class label:

wt =1

and a weight:

+1 ( )

-1 ( ) yt =

13

Section 1 -

Jonathan Krause

Toy example

9/28/2012

Each data point has

a class label:

wt wt exp{-yt Ht}

We update the weights:

+1 ( )

-1 ( ) yt =

14

Section 1 -

Jonathan Krause

Toy example

9/28/2012

Each data point has

a class label:

wt wt exp{-yt Ht}


+1 ( )

-1 ( ) yt =

15

Section 1 -

Jonathan Krause

Toy example

9/28/2012

Each data point has

a class label:

wt wt exp{-yt Ht}


+1 ( )

-1 ( ) yt =

16

Section 1 -

Jonathan Krause

Toy example

9/28/2012

Each data point has

a class label:

wt wt exp{-yt Ht}


+1 ( )

-1 ( ) yt =

17

Section 1 -

Jonathan Krause

Toy example

9/28/2012

The strong (nonlinear) classifier is built as the combination of all the weak (linear) classifiers.

f1 f2

f3

f4

18

Section 1 -

Jonathan Krause

• Defines a classifier using an additive model:

Boosting

9/28/2012

Strong classifier

Weak classifier

Weight Features vector

19

Section 1 -

Jonathan Krause

• Defines a classifier using an additive model:

Boosting

• We need to define a family of weak classifiers

9/28/2012

Strong classifier

Weak classifier

Weight Features vector

form a family of weak classifiers

20

Section 1 -

Jonathan Krause

Why boosting?

• A simple algorithm for learning robust classifiers – Freund & Shapire, 1995 – Friedman, Hastie, Tibshhirani, 1998

• Provides efficient algorithm for sparse visual feature selection – Tieu & Viola, 2000 – Viola & Jones, 2003

• Easy to implement, doesn’t require external optimization tools.

9/28/2012 21

Section 1 -

Jonathan Krause

Boosting - mathematics • Weak learners

9/28/2012

value of rectangle feature

threshold

1 if ( )( )

0 otherwise

j j

j

f xh x

1 1

11 ( )

( ) 2

0 otherwise

T T

t t tt th x

h x

• Final strong classifier

22

Section 1 -

Jonathan Krause

Weak classifier

9/28/2012

•4 kind of Rectangle filters •Value =

∑ (pixels in white area) – ∑ (pixels in black area)

Credit slide: S. Lazebnik

23

Section 1 -

Jonathan Krause

Source

Result

Credit slide: S. Lazebnik

Weak classifier

9/28/2012 24

Section 1 -

Jonathan Krause

…..

1( ,1)x2( ,1)x 3( ,0)x

4( ,0)x

( , )n nx y1

5( ,0)x 6( ,0)x

Weak classifier

threshold

1 if ( )( )

0 otherwise

j j

j

f xh x

1. Evaluate each rectangle filter on each example

Viola & Jones algorithm

9/28/2012

0.8 0.7 0.2 0.3 0.8 0.1

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.

25

Section 1 -

Jonathan Krause

Viola & Jones algorithm • For a 24x24 detection region,

9/28/2012


26

Section 1 -

Jonathan Krause

b. For each feature, j ( )j i j i iiw h x y

1 if ( )( )

0 otherwise

j j

j

f xh x

c. Choose the classifier, ht with the lowest error t

1 ( )

1, ,t i ih x y

t i t i tw w

1

tt

t

2. Select best filter/threshold combination ,

,

,1

t i

t i n

t jj

ww

w

a. Normalize the weights

3. Reweight examples


9/28/2012


27

Section 1 -

Jonathan Krause

4. The final strong classifier is

1 1

11 ( )

( ) 2

0 otherwise

T T

t t tt th x

h x

1

logt

t

The final hypothesis is a weighted linear combination of the T hypotheses where the weights are inversely proportional to the training errors


9/28/2012


28

Section 1 -

Jonathan Krause

Boosting for face detection

• For each round of boosting:

1. Evaluate each rectangle filter on each example

2. Select best filter/threshold combination

3. Reweight examples

9/28/2012 29

Section 1 -

Jonathan Krause

The implemented system

• Training Data – 5000 faces

• All frontal, rescaled to 24x24 pixels

– 300 million non-faces

• 9500 non-face images

– Faces are normalized

• Scale, translation

• Many variations – Across individuals

– Illumination

– Pose

9/28/2012


30

Section 1 -

Jonathan Krause

System performance

• Training time: “weeks” on 466 MHz Sun workstation

• 38 layers, total of 6061 features

• Average of 10 features evaluated per window on test set

• “On a 700 Mhz Pentium III processor, the face detector can process a 384 by 288 pixel image in about .067 seconds” – 15 Hz

– 15 times faster than previous detector of comparable accuracy (Rowley et al., 1998)

9/28/2012


31

Section 1 -

Jonathan Krause

Output of Face Detector on Test Images

9/28/2012


32

Section 1 -

Jonathan Krause

Topics





9/28/2012 33

Section 1 -

Jonathan Krause

Linear Algebra in Computer Vision

• Representation

– 3D points in the scene

– 2D points in the image (Images are matrices)

• Transformations

– Mapping 2D to 2D

– Mapping 3D to 2D

9/28/2012 34

Section 1 -

Jonathan Krause

Notation

• We adopt the notation for a matrix which is a real valued matrix with m rows, and n columns

• We adopt the notation for a column vector, and a row vector respectively

9/28/2012 35

Section 1 -

Jonathan Krause

Notation

• To indicate the element in the ith row and jth column of a matrix we use

• Similarly to indicate the ith entry in a vector we use

9/28/2012 36

Section 1 -

Jonathan Krause

Norms

• Intuitively the norm of a vector is the measure of its “length”

• The l2 norm is defined as in this class we will use the l2 norm unless otherwise noted. Thus we drop the 2 subscript on the norm for convenience.

• Note that

9/28/2012 37

Section 1 -

Jonathan Krause

Linear Independence and Rank

• A set of vectors is linearly independent if no vector in the set can be represented as a linear combination of the remaining vectors in the set

• The rank of a matrix is the maximal number of linearly independent column or rows of a matrix

•

•

•

•

• 9/28/2012 38

Section 1 -

Jonathan Krause

Range and Nullspace

• The range of a matrix is the span of the columns of the matrix, denoted by the set

• The nullspace of a matrix, is the set of vectors that when multiplied by the matrix result in 0, given by the set

9/28/2012 39

Section 1 -

Jonathan Krause

Eigenvalues and Eigenvectors

• Given a matrix, and are said to be an eigenvalue and the corresponding eigenvector of the matrix if

• We can solve for the eigenvalues by solving for the roots of the polynomial generated by

9/28/2012 40

Section 1 -

Jonathan Krause

Eigenvalue Properties

•

• The rank of a matrix is equal to the number of its non-zero eigenvalues

• Eigenvalues of a diagonal matrix, are simply the diagonal entries

• A matrix is said to be diagonalizable if we can write

9/28/2012 41

Section 1 -

Jonathan Krause

Eigenvalues & Eigenvectors of Symmetric Matrices

• Eigenvalues of symmetric matrices are real

• Eigenvectors of symmetric matrices are orthonormal

• Consider the optimization problem involving the symmetric matrix the maximizing is the eigenvector corresponding to the largest eigenvalue

9/28/2012 42

Section 1 -

Jonathan Krause

Generalized Eigenvalues

• Generalized Eigenvalue problem

• Generalized eigenvalues must satisfy

• This reduces to the original eigenvalue problem when exists

• Generalized eigenvalues are used in Fisherfaces 9/28/2012 43

Section 1 -

Jonathan Krause

Singular Value Decomposition (SVD)

• The SVD of matrix is given by

• Where are the columns of and called the left singular vectors is a diagonal matrix whose values are , and called the singular values are the columns of , and are called the right singular vectors

9/28/2012 44

Section 1 -

Jonathan Krause

SVD

• If the matrix has rank , then has non-zero singular values

• are an orthonormal basis for

• are an orthonormal basis for

• Singular values of are the square root of the non-zero eigenvalues of or

9/28/2012 45

Section 1 -

Jonathan Krause

Matlab

• [V,D] = eig(A) The eigenvectors of A are the columns of V. D is a diagonal matrix whose entries are the eigenvalues of A.

• [V,D] = eig(A,B) The generalized eigenvectors are the columns of V. D is a diagonal matrix whose entries of the generalized eigenvalues.

• [U,S,V] = svd(X) The columns of U are the left singular vectors of X. S is a diagonal matrix whose entries are the singular values of X. The columns of V are the right singular vectors of X. Recall X = U*S*V’;

9/28/2012 46

Section 1 -

Jonathan Krause

Matrix Calculus -- Gradient

• Let then the gradient is given by

• is always the same size as , thus if we just have a vector the gradient is simply

9/28/2012 47

Section 1 -

Jonathan Krause

Gradients

• From partial derivatives

• Some common gradients

9/28/2012 48

Section 1 -

Jonathan Krause

Topics





9/28/2012 49

Section 1 -

Jonathan Krause

Probability in Computer Vision

• Foundation for algorithms to solve

– Tracking problems

– Human activity recognition

– Object recognition

– Segmentation

9/28/2012 50

Section 1 -

Jonathan Krause

Probability Axioms

• Sample space: The set of all the outcomes of a random experiment. Denoted by

• Event space: A set whose elements are subsets of . The event space is denoted by . For example

• Probability measure: A function that satisfies

–

–

–

9/28/2012 51

Section 1 -

Jonathan Krause

Basic Properties

•

•

•

•

•

9/28/2012 52

Section 1 -

Jonathan Krause

Conditional Probability

•

• Two events are independent if

• Conditional Independence

9/28/2012 53

Section 1 -

Jonathan Krause

Product Rule

• From the definition of conditional probability we can write

• From the product rule we can derive the chain rule of probability

9/28/2012 54

Section 1 -

Jonathan Krause

Bayes Theorem

•

9/28/2012

Prior Probability

Likelihood

Normalizing Constant

Posterior Probability

55

cs 231a section 1: linear algebra & probability...

Documents