cs 231a section 1: linear algebra & probability...
TRANSCRIPT
Section 1 -
Jonathan Krause
CS 231A Section 1: Linear Algebra & Probability Review
Jonathan Krause
9/28/2012 1
Section 1 -
Jonathan Krause
Topics
• Support Vector Machines • Boosting
– Viola-Jones face detector
• Linear Algebra Review – Notation – Operations & Properties – Matrix Calculus
• Probability – Axioms – Basic Properties – Bayes Theorem, Chain Rule
9/28/2012 2
Section 1 -
Jonathan Krause
Linear classifiers
0:negative
0:positive
b
b
ii
ii
wxx
wxx
9/28/2012
• Find linear function (hyperplane) to separate positive and negative examples
Which hyperplane
is best?
w, b
3
Section 1 -
Jonathan Krause
Support vector machines • Find hyperplane that maximizes the margin between the
positive and negative examples
9/28/2012
Margin Support vectors
4
Section 1 -
Jonathan Krause
Support Vector Machines (SVM) • Wish to perform
binary classification, i.e. find a linear classifier
• Given data and labels where
• When data is linearly separable we can solve the optimization problem to find our linear classifier 9/28/2012 5
Section 1 -
Jonathan Krause
• Datasets that are linearly separable work out great:
•
•
• But what if the dataset is just too hard?
• We can map it to a higher-dimensional space:
0 x
0 x
0 x
x2
Nonlinear SVMs
9/28/2012
Slide credit: Andrew Moore
6
Section 1 -
Jonathan Krause
Φ: x → φ(x)
Nonlinear SVMs
• General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable:
9/28/2012
Slide credit: Andrew Moore lifting transformation
7
Section 1 -
Jonathan Krause
SVM – l1 regularization
• What if data is not linearly separable?
• Can use regularization to solve this problem
• We solve a new optimization problem and “tune” our regularization parameter C
9/28/2012 8
Section 1 -
Jonathan Krause
Solving the SVM
• There are many different packages for solving SVMs
• In PS0 we have you use the liblinear package. This is an efficient implementation but can only use a linear kernel http://www.csie.ntu.edu.tw/~cjlin/liblinear/
• If you wish to have more flexibility with your choice of kernel you can use the LibSVM package http://www.csie.ntu.edu.tw/~cjlin/libsvm/
9/28/2012 9
Section 1 -
Jonathan Krause
Topics
• Support Vector Machines • Boosting
– Viola-Jones face detector
• Linear Algebra Review – Notation – Operations & Properties – Matrix Calculus
• Probability – Axioms – Basic Properties – Bayes Theorem, Chain Rule
9/28/2012 10
Section 1 -
Jonathan Krause
Each data point has
a class label:
wt =1
and a weight:
+1 ( )
-1 ( ) yt =
Boosting
• It is a sequential procedure:
9/28/2012
xt=1
xt=2
xt
Y. Freund and R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial
Intelligence, 14(5):771-780, September, 1999.
11
Section 1 -
Jonathan Krause
Toy example
9/28/2012
Weak learners from the family of lines
h => p(error) = 0.5 it is at chance
Each data point has
a class label:
wt =1
and a weight:
+1 ( )
-1 ( ) yt =
12
Section 1 -
Jonathan Krause
Toy example
9/28/2012
This one seems to be the best
This is a ‘weak classifier’: It performs slightly better than chance.
Each data point has
a class label:
wt =1
and a weight:
+1 ( )
-1 ( ) yt =
13
Section 1 -
Jonathan Krause
Toy example
9/28/2012
Each data point has
a class label:
wt wt exp{-yt Ht}
We update the weights:
+1 ( )
-1 ( ) yt =
14
Section 1 -
Jonathan Krause
Toy example
9/28/2012
Each data point has
a class label:
wt wt exp{-yt Ht}
We update the weights:
+1 ( )
-1 ( ) yt =
15
Section 1 -
Jonathan Krause
Toy example
9/28/2012
Each data point has
a class label:
wt wt exp{-yt Ht}
We update the weights:
+1 ( )
-1 ( ) yt =
16
Section 1 -
Jonathan Krause
Toy example
9/28/2012
Each data point has
a class label:
wt wt exp{-yt Ht}
We update the weights:
+1 ( )
-1 ( ) yt =
17
Section 1 -
Jonathan Krause
Toy example
9/28/2012
The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers.
f1 f2
f3
f4
18
Section 1 -
Jonathan Krause
• Defines a classifier using an additive model:
Boosting
9/28/2012
Strong classifier
Weak classifier
Weight Features vector
19
Section 1 -
Jonathan Krause
• Defines a classifier using an additive model:
Boosting
• We need to define a family of weak classifiers
9/28/2012
Strong classifier
Weak classifier
Weight Features vector
form a family of weak classifiers
20
Section 1 -
Jonathan Krause
Why boosting?
• A simple algorithm for learning robust classifiers – Freund & Shapire, 1995 – Friedman, Hastie, Tibshhirani, 1998
• Provides efficient algorithm for sparse visual feature selection – Tieu & Viola, 2000 – Viola & Jones, 2003
• Easy to implement, doesn’t require external optimization tools.
9/28/2012 21
Section 1 -
Jonathan Krause
Boosting - mathematics • Weak learners
9/28/2012
value of rectangle feature
threshold
1 if ( )( )
0 otherwise
j j
j
f xh x
1 1
11 ( )
( ) 2
0 otherwise
T T
t t tt th x
h x
• Final strong classifier
22
Section 1 -
Jonathan Krause
Weak classifier
9/28/2012
•4 kind of Rectangle filters •Value =
∑ (pixels in white area) – ∑ (pixels in black area)
Credit slide: S. Lazebnik
23
Section 1 -
Jonathan Krause
Source
Result
Credit slide: S. Lazebnik
Weak classifier
9/28/2012 24
Section 1 -
Jonathan Krause
…..
1( ,1)x2( ,1)x 3( ,0)x
4( ,0)x
( , )n nx y1
5( ,0)x 6( ,0)x
Weak classifier
threshold
1 if ( )( )
0 otherwise
j j
j
f xh x
1. Evaluate each rectangle filter on each example
Viola & Jones algorithm
9/28/2012
0.8 0.7 0.2 0.3 0.8 0.1
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.
25
Section 1 -
Jonathan Krause
Viola & Jones algorithm • For a 24x24 detection region,
9/28/2012
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.
26
Section 1 -
Jonathan Krause
b. For each feature, j ( )j i j i iiw h x y
1 if ( )( )
0 otherwise
j j
j
f xh x
c. Choose the classifier, ht with the lowest error t
1 ( )
1, ,t i ih x y
t i t i tw w
1
tt
t
2. Select best filter/threshold combination ,
,
,1
t i
t i n
t jj
ww
w
a. Normalize the weights
3. Reweight examples
Viola & Jones algorithm
9/28/2012
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.
27
Section 1 -
Jonathan Krause
4. The final strong classifier is
1 1
11 ( )
( ) 2
0 otherwise
T T
t t tt th x
h x
1
logt
t
The final hypothesis is a weighted linear combination of the T hypotheses where the weights are inversely proportional to the training errors
Viola & Jones algorithm
9/28/2012
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.
28
Section 1 -
Jonathan Krause
Boosting for face detection
• For each round of boosting:
1. Evaluate each rectangle filter on each example
2. Select best filter/threshold combination
3. Reweight examples
9/28/2012 29
Section 1 -
Jonathan Krause
The implemented system
• Training Data – 5000 faces
• All frontal, rescaled to 24x24 pixels
– 300 million non-faces
• 9500 non-face images
– Faces are normalized
• Scale, translation
• Many variations – Across individuals
– Illumination
– Pose
9/28/2012
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.
30
Section 1 -
Jonathan Krause
System performance
• Training time: “weeks” on 466 MHz Sun workstation
• 38 layers, total of 6061 features
• Average of 10 features evaluated per window on test set
• “On a 700 Mhz Pentium III processor, the face detector can process a 384 by 288 pixel image in about .067 seconds” – 15 Hz
– 15 times faster than previous detector of comparable accuracy (Rowley et al., 1998)
9/28/2012
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.
31
Section 1 -
Jonathan Krause
Output of Face Detector on Test Images
9/28/2012
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001.
32
Section 1 -
Jonathan Krause
Topics
• Support Vector Machines • Boosting
– Viola-Jones face detector
• Linear Algebra Review – Notation – Operations & Properties – Matrix Calculus
• Probability – Axioms – Basic Properties – Bayes Theorem, Chain Rule
9/28/2012 33
Section 1 -
Jonathan Krause
Linear Algebra in Computer Vision
• Representation
– 3D points in the scene
– 2D points in the image (Images are matrices)
• Transformations
– Mapping 2D to 2D
– Mapping 3D to 2D
9/28/2012 34
Section 1 -
Jonathan Krause
Notation
• We adopt the notation for a matrix which is a real valued matrix with m rows, and n columns
• We adopt the notation for a column vector, and a row vector respectively
9/28/2012 35
Section 1 -
Jonathan Krause
Notation
• To indicate the element in the ith row and jth column of a matrix we use
• Similarly to indicate the ith entry in a vector we use
9/28/2012 36
Section 1 -
Jonathan Krause
Norms
• Intuitively the norm of a vector is the measure of its “length”
• The l2 norm is defined as in this class we will use the l2 norm unless otherwise noted. Thus we drop the 2 subscript on the norm for convenience.
• Note that
9/28/2012 37
Section 1 -
Jonathan Krause
Linear Independence and Rank
• A set of vectors is linearly independent if no vector in the set can be represented as a linear combination of the remaining vectors in the set
• The rank of a matrix is the maximal number of linearly independent column or rows of a matrix
•
•
•
•
• 9/28/2012 38
Section 1 -
Jonathan Krause
Range and Nullspace
• The range of a matrix is the span of the columns of the matrix, denoted by the set
• The nullspace of a matrix, is the set of vectors that when multiplied by the matrix result in 0, given by the set
9/28/2012 39
Section 1 -
Jonathan Krause
Eigenvalues and Eigenvectors
• Given a matrix, and are said to be an eigenvalue and the corresponding eigenvector of the matrix if
• We can solve for the eigenvalues by solving for the roots of the polynomial generated by
9/28/2012 40
Section 1 -
Jonathan Krause
Eigenvalue Properties
•
• The rank of a matrix is equal to the number of its non-zero eigenvalues
• Eigenvalues of a diagonal matrix, are simply the diagonal entries
• A matrix is said to be diagonalizable if we can write
9/28/2012 41
Section 1 -
Jonathan Krause
Eigenvalues & Eigenvectors of Symmetric Matrices
• Eigenvalues of symmetric matrices are real
• Eigenvectors of symmetric matrices are orthonormal
• Consider the optimization problem involving the symmetric matrix the maximizing is the eigenvector corresponding to the largest eigenvalue
9/28/2012 42
Section 1 -
Jonathan Krause
Generalized Eigenvalues
• Generalized Eigenvalue problem
• Generalized eigenvalues must satisfy
• This reduces to the original eigenvalue problem when exists
• Generalized eigenvalues are used in Fisherfaces 9/28/2012 43
Section 1 -
Jonathan Krause
Singular Value Decomposition (SVD)
• The SVD of matrix is given by
• Where are the columns of and called the left singular vectors is a diagonal matrix whose values are , and called the singular values are the columns of , and are called the right singular vectors
9/28/2012 44
Section 1 -
Jonathan Krause
SVD
• If the matrix has rank , then has non-zero singular values
• are an orthonormal basis for
• are an orthonormal basis for
• Singular values of are the square root of the non-zero eigenvalues of or
9/28/2012 45
Section 1 -
Jonathan Krause
Matlab
• [V,D] = eig(A) The eigenvectors of A are the columns of V. D is a diagonal matrix whose entries are the eigenvalues of A.
• [V,D] = eig(A,B) The generalized eigenvectors are the columns of V. D is a diagonal matrix whose entries of the generalized eigenvalues.
• [U,S,V] = svd(X) The columns of U are the left singular vectors of X. S is a diagonal matrix whose entries are the singular values of X. The columns of V are the right singular vectors of X. Recall X = U*S*V’;
9/28/2012 46
Section 1 -
Jonathan Krause
Matrix Calculus -- Gradient
• Let then the gradient is given by
• is always the same size as , thus if we just have a vector the gradient is simply
9/28/2012 47
Section 1 -
Jonathan Krause
Gradients
• From partial derivatives
• Some common gradients
9/28/2012 48
Section 1 -
Jonathan Krause
Topics
• Support Vector Machines • Boosting
– Viola-Jones face detector
• Linear Algebra Review – Notation – Operations & Properties – Matrix Calculus
• Probability – Axioms – Basic Properties – Bayes Theorem, Chain Rule
9/28/2012 49
Section 1 -
Jonathan Krause
Probability in Computer Vision
• Foundation for algorithms to solve
– Tracking problems
– Human activity recognition
– Object recognition
– Segmentation
9/28/2012 50
Section 1 -
Jonathan Krause
Probability Axioms
• Sample space: The set of all the outcomes of a random experiment. Denoted by
• Event space: A set whose elements are subsets of . The event space is denoted by . For example
• Probability measure: A function that satisfies
–
–
–
9/28/2012 51
Section 1 -
Jonathan Krause
Basic Properties
•
•
•
•
•
9/28/2012 52
Section 1 -
Jonathan Krause
Conditional Probability
•
• Two events are independent if
• Conditional Independence
9/28/2012 53
Section 1 -
Jonathan Krause
Product Rule
• From the definition of conditional probability we can write
• From the product rule we can derive the chain rule of probability
9/28/2012 54
Section 1 -
Jonathan Krause
Bayes Theorem
•
9/28/2012
Prior Probability
Likelihood
Normalizing Constant
Posterior Probability
55