Fisher kernels for image representation & generative classification models
Jakob Verbeek
December 11, 2009
Plan for this course Introduction to machine learning
Clustering techniques k-means, Gaussian mixture density
Gaussian mixture density continued Parameter estimation with EM
4) Classification techniques 1 Introduction, generative methods, semi-supervised Fisher kernels
5) Classification techniques 2 Discriminative methods, kernels
6) Decomposition of images Topic models, …
Classification• Training data consists of “inputs”, denoted x, and corresponding output
“class labels”, denoted as y.
• Goal is to correctly predict for a test data input the corresponding class label.
• Learn a “classifier” f(x) from the input data that outputs the class label or a probability over the class labels.
• Example: – Input: image– Output: category label, eg “cat” vs. “no cat”
• Classification can be binary (two classes), or over a larger number of classes (multi-class).
– In binary classification we often refer to one class as “positive”, and the other as “negative”
• Binary classifier creates a boundaries in the input space between areas assigned to each class
Example of classification
Given: training images and their categories What are the categories of these test images?
Discriminative vs generative methods• Generative probabilistic methods
– Model the density of inputs x from each class p(x|y)– Estimate class prior probability p(y)– Use Bayes’ rule to infer distribution over class given input
• Discriminative (probabilistic) methods– Directly estimate class probability given input: p(y|x)– Some methods do not have probabilistic interpretation,
• eg. they fit a function f(x), and assign to class 1 if f(x)>0,
and to class 2 if f(x)<0
)(
)()|()|(
xp
ypyxpxyp
y
yxpypxp )|()()(
Generative classification methods• Generative probabilistic methods
– Model the density of inputs x from each class p(x|y)– Estimate class prior probability p(y)– Use Bayes’ rule to infer distribution over class given input
• Modeling class-conditional densities over the inputs x– Selection of model class:
• Parametric models: such as Gaussian (for continuous), Bernoulli (for binary), …
• Semi-parametric models: mixtures of Gaussian, Bernoulli, …
• Non-parametric models: Histograms over one-dimensional, or multi-dimensional data, nearest-neighbor method, kernel density estimator
• Given class conditional model, classification is trivial: just apply Bayes’ rule
• Adding new classes can be done by adding a new class conditional model– Existing class conditional models stay as they are
)(
)()|()|(
xp
ypyxpxyp
c
cyxpcypxp )|()()(
Histogram methods• Suppose we
– have N data points– use a histogram with C cells
• How to set the density level in each cell ?– Maximum (log)-likelihood estimator.– Proportional to nr of points n in cell– Inversely proportional to volume V of cell
• Problems with histogram method:– # cells scales exponentially with the dimension of the data– Discontinuous density estimate– How to choose cell size?
pc nc /(NVc )
The ‘curse of dimensionality’• Number of bins increases exponentially with the dimensionality of the data.
– Fine division of each dimension: many empty bins– Rough division of each dimension: poor density model
• Probability distribution of D discrete variables takes at least 2D values– At least 2 values for each variable
• The number of cells may be reduced assuming independency between the components of x: the naïve Bayes model
– Model is “naïve” since it assumes that all variables are independent… – Unrealistic for high dimensional data, where variables tend to be dependent
• Poor density estimator• Classification performance can still be good using derived p(y|x)
Example of generative classification• Hand-written digit classification
– Input: binary 28x28 scanned digit images, collect in 784 long vector
– Desired output: class label of image
• Generative model– Independent Bernoulli model for each class
– Probability per pixel per class
– Maximum likelihood estimator is average value
per pixel per class
• Classify using Bayes’ rule:
)|()|( 1 cyxpcyxp dDd
dc
d
dc
d
cyxp
cyxp
1)|0(
)|1(
)(
)()|()|(
xp
ypyxpxyp
c
cyxpcypxp )|()()(
k-nearest-neighbor estimation method• Idea: fix number of samples in the cell, find the right cell size.
• Probability to find a point in a sphere A centered on x with volume v is
• Smooth density approximately constant in small region, and thus
• Alternatively: estimate P from the fraction of training data in a sphere on x
• Combine the above to obtain estimate
k-nearest-neighbor estimation method• Method in practice:
– Choose k – For given x, compute the volume v which contain k samples.– Estimate density with
• Volume of a sphere with radius r in d dimensions is
• What effect does k have?– Data sampled from mixture
of Gaussians plotted in green– Larger k, larger region,
smoother estimate
• Selection of k– Leave-one-out cross validation– Select k that maximizes data
log-likelihood
k-nearest-neighbor classification rule• Use k-nearest neighbor density estimation to find p(x|category)• Apply Bayes rule for classification: k-nearest neighbor classification
– Find sphere volume v to capture k data points for estimate
– Use the same sphere for each class for estimates
– Estimate global class priors
– Calculate class posterior distribution
)12/(/2),( 2/ drdrv dd
vN
kcyxp
c
c )|(
)(
)()|()|(
xp
cypcyxpxcyp
N
Ncyp c )(
Nv
k
xpc
)(
1
k
kc
Nv
kxp )(
k-nearest-neighbor classification rule• Effect of k on classification boundary
– Larger number of neighbors
– Larger regions
– Smoother class boundaries
Kernel density estimation methods• Consider a simple estimator of the cumulative
distribution function:
• Derivative gives an estimator of the density function, but this is just a set of delta peaks.
• Derivative is defined as
• Consider a non-limiting value of h:
• Each data point adds 1/(2hN) in region of size h around it, sum of “blocks” gives estimate
Kernel density estimation methods• Can use other than “block” function to obtain smooth estimator.
• Widely used kernel function is the (multivariate) Gaussian – Contribution decreases smoothly as a function of the distance to data point.
• Choice of smoothing parameter – Larger size of “kernel” function gives
smoother desnity estimator
– Use the average distance between samples.– Use cross-validation.
• Method can be used for multivariate data– Or in naïve bayes model
Summary generative classification methods
histogram k-nn k.d.e.
• (Semi-) Parametric models (eg p(data |category) = gaussian or mixture)– No need to store data, but possibly too strong assumptions on data density– Can lead to poor fit on data, and poor classification result
• Non-parametric models– Histograms:
• Only practical in low dimensional space (<5 or so)
• High dimensional space will lead to many cells, many of which will be empty
• Naïve Bayes modeling in higher dimensional cases
– K-nearest neighbor & kernel density estimation:• Need to store all training data
• Need to find nearest neighbors or points with non-zero kernel evaluation (costly)
Discriminative vs generative methods• Generative probabilistic methods
– Model the density of inputs x from each class p(x|y)– Estimate class prior probability p(y)– Use Bayes’ rule to infer distribution over class given input
• Discriminative (probabilistic) methods (next week)– Directly estimate class probability given input: p(y|x)– Some methods do not have probabilistic interpretation,
• eg. they fit a function f(x), and assign to class 1 if f(x)>0,
and to class 2 if f(x)<0
• Hybrid generative-discriminative models – Fit density model to data– Use properties of this model as input for classifier– Example: Fisher-vectors for image respresentation
)(
)()|()|(
xp
ypyxpxyp
y
yxpypxp )|()()(
Clustering for visual vocabulary construction• Clustering of local image descriptors
– using k-means or mixture of Gaussians
• Recap of the image representation pipe-line– Extract image regions at various locations and scales Compute
descriptor for each region (eg SIFT)
– (Soft) assignment each descriptors to clusters
– Make histogram for complete image• Summing of vector representations of each descriptor
• Input to image classification method
Cluster indexes
Ima
ge r
egio
ns
0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0
… … … … … … … …
1 20 0 0 10 3 2 1
0 .5 .5 0 0 0 0 0
0 0 0 0 .9 0 .1 0
… … … … … … … …
1.2 20.0 0.2 0.6 10.0 2.5 2.5 1.0
Fisher Vector motivation• Feature vector quantization is computationally expensive in practice
• Run-time linear in– N: nr. of feature vectors ~ 10^3 per image– D: nr. of dimensions ~ 10^2 (SIFT)– K: nr. of clusters ~ 10^3 for recognition
• So in total in the order of 10^8 multiplications per image to assign SIFT descriptors to visual words
• We use histogram of visual word counts
• Can we do this more efficiently ?!
• Reading material: “Fisher Kernels on Visual
Vocabularies for Image Categorization”
F. Perronnin and C. Dance, in CVPR'07
Xerox Research Centre Europe, Meylan
Fisher vector image representation
• MoG / k-means stores nr of points per cell– Need many clusters to represent distribution of descriptors in image– But increases computational cost
• Fischer vector adds 1st & 2nd order moments– More precise description regions assigned to cluster– Fewer clusters needed for same accuracy– Representation (2D+1) times larger, at same computational cost
– Terms already calculated when computing soft-assignment
1
2
4
2
2
2 3
3
8
1
22
4
51
2
22
2
2
24
4
3
35
8
1
2)( knnk mxq nkqnkq )( knnk mxq
qnk: soft-assignment of image region to cluster (Gaussian mixture component)
Image representation using Fisher kernels• General idea of Fischer vector representation
– Fit probabilistic model to data– Use derivative of data log-likelihood as data representation, eg.for classification
[Jaakkola & Haussler. “Exploiting generative models in discriminative classifiers”, in Advances in Neural Information Processing Systems 11, 1999.]
• Here, we use Mixture of Gaussians to cluster the region descriptors
• Concatenate derivatives to obtain data representation
K
kkknk
N
n
N
nn CmxNxpL
111
),;(log)(log)(
N
nknnkk
k
mxqCLm 1
1 )()(
N
nnk
k
qL1
)(
N
n
Tknknknk
k
mxmxCqLC 1
1))((
2
1
2
1)(
Image representation using Fisher kernels• Extended representation of image descriptors using MoG
– Displacement of descriptor from center– Squares of displacement from center
– From 1 number per descriptor per cluster, to 1+D+D2 (D = data dimension)
• Simplified version obtained when– Using this representation for a linear classifier
– Diagonal covariance matrices, variance in dimensions given by vector vk
– For a single image region descriptor
– Summed over all descriptors this gives us• 1: Soft count of regions assigned to cluster
• D: Weighted average of assigned descriptors
• D: Weighted variance of descriptors in all dimensions
k
nkqkm
1
kC
2)( knnk mxq nkqnkq )( knnk mxq
Fisher vector image representation
• MoG / k-means stores nr of points per cell– Need many clusters to represent distribution of descriptors in image
• Fischer vector adds 1st & 2nd order moments– More precise description regions assigned to cluster
– Fewer clusters needed for same accuracy
– Representation (2D+1) times larger, at same computational cost– Terms already calculated when computing soft-assignment– Comp. cost is O(NKD), need difference between all clusters and data
1
2
4
2
2
2 3
3
8
1
22
4
51
2
22
2
2
24
4
3
35
8
1
2)( knnk mxq nkqnkq )( knnk mxq
Images from categorization task PASCAL VOC• Yearly “competition” for image classification (also object localization,
segmentation, and body-part localization)
Fisher Vector: results• BOV-supervised learns separate mixture model for each image class, makes
that some of the visual words are class-specific
• MAP: assign image to class for which the corresponding MoG assigns maximum likelihood to the region descriptors
• Other results: based on linear classifier of the image descriptions
• Similar performance, using 16x fewer Gaussians• Unsupervised/universal representation good
Plan for this course Introduction to machine learning
Clustering techniques• k-means, Gaussian mixture density
Gaussian mixture density continued• Parameter estimation with EM
Classification techniques 1 Introduction, generative methods, semi-supervised
Reading for next week: Previous papers (!), nothing new Available on course website http://lear.inrialpes.fr/~verbeek/teaching
5) Classification techniques 2• Discriminative methods, kernels
6) Decomposition of images• Topic models, …