machine learning in machine vision - ernetval.serc.iisc.ernet.in/dav/ml_in_vision.pdf · machine...

Machine Learning in Machine

Vision

R. Venkatesh Babu

Video Analytics Lab, SERC

Indian Institute of Science, Bangalore

Can Machines Replace Human?

Semantic Gap

How do we interpret image data?

What is an Image?

What do we see?

What is an Image?

What do machines see?

Semantic Gap

Organization

• Machine Vision – Challenges

• Discriminative and Generative Approaches

• ML Applications in Vision

• Deep Learning • Inspiration from Neuroscience

• Deep Architecture

• Applications

Machine Vision -

Challenges

Challenges 1: view point variation

Michelangelo 1475-1564

Challenges 2: illumination

slide credit: S. Ullman

Challenges 3: occlusion

Magritte, 1957

Challenges 4: scale

slide by Fei Fei, Fergus & Torralba

Challenges 5: deformation

Xu, Beihong 1943

Challenges 6: background clutter

Klimt, 1913

Challenges 7: object intra-class variation

slide by Fei-Fei, Fergus & Torralba

Object Categorization Discriminative model p(Object | image)

Generative models p(image | Object)

Slides from: Fei-Fei Li

Discriminative

Generative

p(image | zebra) p(image | no zebra)

Object Detection Pipeline

Object Representation Which features are suitable for the task

Learning

Which machine learning algorithm to choose

Bag-of-words Approach

Features

Pixels

Texture

Color Histograms

SIFT/SURF

HoG …

Requirements: Invariance to challenges (illumination, scale,

orientation …), computational and memory burden

Machine Learning Algorithms

Nearest Neighbor

Naïve Bayes

ANN

SVM

Ada- Boost

CNN …

Face Detection

Neural Network-Based Face Detection

Rowley, Baluja and Kanade, PAMI ’98

Object Detection Using the Statistics of Parts

H. Schneiderman, & T. Kanade, CVPR’00, IJCV’04

Robust Real-time Object Detection

Paul Viola and Michael Jones (IJCV’04)

Neural Network-Based Face

Detection

(Henry A. Rowley, Shumeet Baluja, and Takeo Kanade, PAMI ‘98)

System

Stage 1: Applies a set of neural network-based filters to an

image.

The filters examine each location in the image at several scales,

Stage 2: Uses an arbitrator to combine the outputs

Merges detections from individual filters and eliminates

overlapping detections.

Overview

Detection Time

#NWs: Two networks

•Image Size: 320 x 240 pixel image

• 246,766 (20x20) windows

•Machine : 200 MHz R4400 SGI Indigo 2

•Time Taken: 383 seconds (approx) ( > 6mins!)

Object Detection Using the Statistics of Parts H. Schneiderman, & T. Kanade, CVPR’00, IJCV’04

Object Detection Using the Statistics of

Parts

•Represent appearance statistics as a product of histogram

•Each histogram represents the joint statistics of a subset of

wavelet coefficients and their position on the object.

•Use many such histograms representing a wide variety of visual

attributes

Number of orientations

Face – 2

Cars – 8

There are too many parameters to learn

)(

)(?

)|,...,(

)|,...,(

1?)()|,...,(

)()|,...,(

1?),...,|(

),...,|(

1

1

1

1

1

1

ObjectP

ObjectP

ObjectxxP

ObjectxxP

ObjectPObjectxxP

ObjectPObjectxxP

xxObjectP

xxObjectP

n

n

n

n

n

n

Bayes optimal classifier

Image is defined by n attrs: x1,x2,…,xn

SE 263 R. Venkatesh Babu

Reported results for faces

Kodak dataset: Test set: 17 images, 46 faces, 36 profile views.

ϒ=λ2

SE 263 R. Venkatesh Babu

A bigger dataset From multiple sources 208 images, 441 faces, about 347

profiles.

Robust Real-time Object Detection Paul Viola and Michael Jones (IJCV’04)

Integral Image with Haar Features

Training via AdaBoost

Speed-up through Attentional cascades

Integral Image

The integral image at location (x,y), is the sum

of the pixel values above and to the left of (x,y),

inclusive.

Rapid evaluation of rectangular

features

Using the integral image

representation one can compute the

value of any rectangular sum in

constant time.

For example the integral sum inside

rectangle D we can compute as:

ii(4) + ii(1) – ii(2) – ii(3)

As a result two-, three-, and four-rectangular features can be computed with 6, 8 and 9 array

references respectively.

Haar Features 3 rectangular features types:

• two-rectangle feature type

(horizontal/vertical)

• three-rectangle feature type

• four-rectangle feature type

Using a 24x24 pixel base detection window, with all the possible

combination of horizontal and vertical location and scale of these feature

types the full set of features has 49,396 features.

The motivation behind using rectangular features, as opposed to more

expressive steerable filters is due to their extreme computational efficiency.

Scanning at many Scales

At base scale objects are detected at 24x24 size

Scanned at 11 scales with a factor of 1.25 (24x24, 30x30, 38x38,

47x47 ….)

Conventional Approach:

• Compute a pyramid of 11 images, each 1.25 times

smaller than the previous

• Requires significant time (< 15fps)

AdaBoost: Intuition

39 K. Grauman, B. Leibe

Figure adapted from Freund and Schapire

Consider a 2-d feature

space with positive and

negative examples.

Each weak classifier splits

the training examples with

at least 50% accuracy.

Examples misclassified by

a previous weak learner

are given more emphasis

at future rounds.


AdaBoost: Intuition

AdaBoost Algorithm Start with uniform

weights on training

examples

Evaluate weighted

error for each

feature, pick best.

Incorrectly classified -> more weight

Correctly classified -> less weight

Final classifier is combination of the weak ones,

weighted according to error they had.

Freund & Schapire 1995

{x1,…xn}

Boosting Example

First classifier

First 2 classifiers

First 3 classifiers

Final Classifier learned by Boosting

-0.42-0.65+0.92 = -0.15

-0.42+0.65+0.92 = 1.15

Recall: Perceptron Operation Equations of “thresholded” operation:

= 1 (if w1x1 +… wd xd + wd+1 > 0)

o(x1, x2,…, xd-1, xd)

= -1 (otherwise)

Performance of 200 feature face

detector The ROC curve of the constructed classifies

indicates that a reasonable detection rate of 0.95

can be achieved while maintaining an extremely

low false positive rate of approximately 10-4 (1 in

14084).

• First features selected by AdaBoost are meaningful and have high

discriminative power

• By varying the threshold of the final classifier one can construct a

two-feature classifier which has a detection rate of 1 and a false

positive rate of 0.4.

•Requires 0.7 sec to scan 384x288 image !

Speed-up through the Attentional

Cascade • Simple, boosted classifiers can reject many of negative sub-windows

while detecting all positive instances.

• Series of such simple classifiers can achieve good detection

performance while eliminating the need for further processing of

negative sub-windows.

more difficult examples faced by deeper classifiers

Single Vs Cascade Classifier

The Cascaded

Classifier is

nearly

10 times faster!

Experiments (dataset for training)

4916 positive training example

were hand picked aligned,

normalized, and scaled to a base

resolution of 24x24

10,000 negative examples were

selected by randomly picking sub-

windows from 9500 images which

did not contain faces

Results cont.

More Detection Examples

Practical implementation

Details discussed in Viola-Jones paper

•Training time = weeks (with 5k faces and 9.5k non-faces)

•Final detector has 32 layers in the cascade, 4297 features

•700 Mhz Pentium III processor :

Can process a 384 x 288 image in 0.067 seconds (in 2002

when paper was written)

Ensemble Tracking Shai Avidan – CVPR 05

(Adaboost in Tracking)

Object Localization

Ensemble of weak learners is used to create a per-pixel

confidence map

Optimal location found by mean shift algorithm

Ensemble is updated in new location

Weak Classifiers Linear classifiers are used as weak classifiers

Find the best hyperplane to separate data

Strong classifier calculated using AdaBoost

Determines weights of each weak classifier

Trains iteratively on “harder” examples

Experimental Results

SVMs in Machine Vision

Ensemble of Exemplar-SVMs for Object

Detection and Beyond (Malisiewicz et al.,

ICCV’11)

Discriminative Object Detectors

Linear SVM on HOG

Hard-Negative Mining

Sliding Window Detection

Exemplar SVMs

Learn a separate linear SVM for each instance

(exemplar) in the dataset

Exemplar SVM

Advantages: we can use different features for each exemplar

Adapt features to each exemplar’s aspect ratio

Ensemble of Exemplar SVMs

Results

Image Parsing

Tighe et al., Finding Things: Image Parsing with Regions and Per-Exemplar Detectors,

CVPR’13

Results

Representation Learning

using CNNs

Video Analytics Lab, SERC, IISc

Why Deep Learning??

❖ To learn feature hierarchies

❖ In Vision

➢Mainly for recognition

➢But, is being applied in almost all the vision

tasks

Conventional Recognition approach

Hand designed

feature extraction Trainable classifier Object

Class

Features are not learned

Image/Video

Pixels

Conventional Recognition approach

❖ Classifiers are often generic

❖ Features are key to progress in recognition until now

❖ Multitude of hand-designed features

➢ SIFT, HOG, LBP, MSER, Color-SIFT etc.

But, Why learn features ??

❖ Better performance

❖ Other new domains (unclear how to hand engineer)

➢ Kinect

➢ Video

➢ Multi spectral

❖ Feature computation time

Deep Learning??

Learning

multiple levels of representation and abstraction

that help to make sense of data

such as images, sound, and text.

Hierarchical Structure of Visual Cortex

N. Kruger et al.

Lateral Geniculate Nucleus (LGN)

Primary Visual Cortex (V1)

David Hubel and Torsten Wiesel won the Nobel prize for discovering

the functional organization and basic physiology of neurons in V1.

• Simple Cells

• Complex Cells

• Hypercomplex Cells

Simple Cell: Hubel-Wiesel Model

Complex Cell

Deep Architecture

Theoretical:

“Many functions can be much more efficiently represented with deeper

architectures…” [Bengio & LeCun 2007]

fl takes as input a datum xl and parameter set wl and outputs xl+1

Learning a Hierarchy of Feature

Extractors

❖ Each layer extracts features from output of previous layer

❖ All the way from pixels to classifier

❖ Layers have (nearly) the same structure

❖ Train all layers jointly

layer 1 Layer 2 Layer 3 Simple

Classifier

Image/Video

Pixels

Learning a Hierarchy of Feature

Extractors

❖ Stack multiple stages of simple cells / complex cells layers

❖ Higher stages compute more global, more invariant features

❖ Classification layer on top

Natural progression from

low level to high level structures.

Can share the lower-level

representations for multiple tasks.

Deep architectures can be

representationally efficient.

Typical CNN Operations

❖ Filtering (Convolution)

❖ Contrast Normalization

❖ Local Pooling (Sub-sampling)

2D Convolution

Image from http://developer.amd.com

Image Convolution / Filtering

❖ Convolutional

➢ Translation equivariance

➢ Tied filter weights

(same at each position: few

parameters)

Feature Maps

Translation Equivariance

❖ Input translation results in translation of features

➢ Fewer filters needed: no translated replications

➢ But still need to cover orientation/frequency

Convolutional FIlters

CNN: Convolution in 3D

Image from http://deeplearning.net

Normalization

❖ Contrast normalization

➢ Across feature maps or within the maps

❖ Each feature is scaled by

❖ α and β are parameters, n: size of the local region

❖ Induces local competition between features to explain input

Local Pooling

Images by Zhu et al. and http://vaaaaaanquish.hatenablog.com

Pooling

❖ Spatial Pooling

❖ Non-overlapping / overlapping regions

❖ Sum or max

❖ In-variance to small transformations

Sum Max

Example Nets

CNN Applications

❖ Image recognition, speech recognition, photo taggers

❖ Have won several competitions

➢ ImageNet, Kaggle Facial Expression and Multimodal Learning,

German Traffic Signs, Connectomics, Handwriting etc.

❖ Applicable to array data where nearby values are correlated

➢ Images, sound, time-frequency representations, video, volumetric

images, RGB-Depth images etc.

❖ Reading Text in the Wild

❖ One of the few models that can be trained purely supervised

Software Tools

Caffe: From Berkeley

Torch7: www.torch.ch

OverFeat: From NYU

Cuda-Convnet: http://code.google.com/p/cuda-convnet/

MatConvnet: CNNs for MATLAB

Theano:

http://deeplearning.net/software/theano/

http://www.torch.ch/

http://code.google.com/p/cuda-convnet/



machine learning in machine vision - ernetval.serc.iisc.ernet.in/dav/ml_in_vision.pdf · machine...

Documents