principled asymmetric boosting approaches to rapid training and classification in face detection

Principled Asymmetric Boosting Approachesto Rapid Training and Classification

in Face Detection

Minh-Tri PhamPh.D. Candidate and Research AssociateNanyang Technological University, Singapore

presented by

Outline

• Motivation• Contributions

– Automatic Selection of Asymmetric Goal– Fast Weak Classifier Learning– Online Asymmetric Boosting– Generalization Bounds on the Asymmetric Error

• Future Work• Summary

Outline

Problem

Application

Face recognition

Application

3D face reconstruction

Application

Camera auto-focusing

ApplicationWindows face logon

• Lenovo Veriface Technology

Appearance-based Approach• Scan image with probe

window patch (x,y,s)– at different positions and scales– Binary classify each patch into

• face, or• non-face

• Desired output state: – (x,y,s) containing face

Most popular approach•Viola-Jones ‘01-’04, Li et.al. ‘02, Wu et.al. ’04, Brubaker et.al. ‘04, Liu et.al. ’04, Xiao et.al ‘04, •Bourdev-Brandt ‘05, Mita et.al. ‘05, Huang et.al. ’05 – ‘07, Wu et.al. ‘05, Grabner et.al.

’05-’07, •And many more

Appearance-based Approach• Statistics:

– 6,950,440 patches in a 320x240 image

– P(face) < 10-5

• Key requirement:– A very fast classifier

A very fast classifier

A very fast classifier• Cascade of non-face rejectors:

F1 F2 FN….passpasspass pass

reject reject reject

non-face

• Cascade of non-face rejectors:

• F1, F2, …, FN : asymmetric classifiers– FRR(Fk) 0– FAR(Fk) as small as possible (e.g. 0.5 – 0.8)

non-face

F1 F2 FN faceF1 F2

non-face

F1 F2 FN faceF1 F2

non-face

F1 F2 FN faceF1 F2

non-face

F1 F2 FN face

F1 FN….passpasspass pass

non-face

• A strong combination of weak classifiers:

Non-face Rejector

– f1,1, f1,2, …, f1,K : weak classifiers– : threshold

reject

…. +++ yes

f1,1 f1,2 f1,K > ?

Boosting

WeakClassifierLearner

Wrongly classified

Correctly classified

: negative example: positive example

Stage 1 Stage 2

Asymmetric Boosting

Stage 1 Stage 2

• Weight positives times more than negatives

reject

…. +++ yes

f1,2 f1,K > ?

Non-face Rejector

reject

…. +++ yes

f1,2 f1,K > ?

Non-face Rejector

• Classify a Haar-like feature value

Weak classifier

input patch

featurevalue v

Classifyv

Weak classifier

input patch

featurevalue v

Classifyv

• Requires too much intervention from experts

Main issues

F1 FN….passpasspass pass

non-face

How to choose bounds for FRR(Fk) and FAR(Fk)?

Asymmetric Boosting

Stage 1 Stage 2

• Weight positives times more than negativesHow to

choose ?

reject

…. +++ yes

f1,2 f1,K > ?

Non-face Rejector

How to choose ?

• Very long learning time

Main issues

Weak classifier

input patch

featurevalue v

Classifyv

…10 minutes to learn a

weak classifier

• Very long learning time– To learn a face detector ( 4000 weak classifiers):

• 4,000 * 10 minutes 1 month

• Only suitable for objects with small shape variance

Main issues

Outline

Detection with Multi-exit Asymmetric Boosting

CVPR’08 poster paper:Minh-Tri Pham and Viet-Dung D. Hoang and Tat-Jen Cham. Detection with Multi-exit Asymmetric Boosting. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, Alaska, 2008.

• Won Travel Grant Award

Problem overview• Common appearance-based approach:

– F1, F2, …, FN : boosted classifiers

object

non-object

reject

…. +++ yes

f1,1 f1,2 f1,K > ?

Objective

• Find f1,1, f1,2, …, f1,K, and such that:– – – K is minimized proportional to F1’s evaluation time

reject

…. +++ yes

f1,1 f1,2 f1,K > ?

FFRRFFAR

ii xfsignxF

1,11 )()(

Existing trends (1)

Idea• For k from 1 until convergence:

– Let

– Learn new weak classifier f1,k(x):

– Let

– Adjust to see if we can achieve FAR(F1) <= 0 and FRR(F1) <= 0:• Break loop if such exists

Issues• Weak classifiers are sub-

optimal w.r.t. training goal.• Too many weak classifiers

are required in practice.

ii xfsignxF

1,11 )()(

)()(minargˆ11,1

FFRRFFARfkf

ii xfsignxF

1,11 )()(

Existing trends (2)

Idea• For k from 1 until convergence:

– Let

– Learn new weak classifier f1,k(x):

– Break loop if FAR(F1) <= 0 and FRR(F1) <= 0Pros• Reduce FRR at the

cost of increasing FAR – acceptable for cascades

• Fewer weak classifiers

ii xfsignxF

1,11 )()(

)()(minargˆ11,1

FFRRFFARfkf

Cons• How to choose ?• Much longer training

Solution to con• Trial and error:

• choose such that K is minimized.

Our solution

Learn every weak classifier using the same asymmetric goal:

)(,1 xf k

,)()(minargˆ11,1

FFRRFFARfkf

Because…• Consider two desired bounds (or targets) for learning a boosted classifier

– Exact bound: and– Conservative bound:

• (2) is more conservative than (1) because (2) => (1).

0)( MFFAR 0)( MFFRR

0 )()(

MM FFRRFFAR

:)(xFM

(2)(1)

H200H201

exact bound

conservativebound

FRR0 1

exact bound

conservativebound

At for every new weak classifier learned, the ROC operating

point moves the fastest toward the conservative bound

Implication

• When the ROC operation point lies in the conservative bound:– – – Conditions met, therefore = 0.

reject

…. +++ yes

f1,1 f1,2 f1,K > ?

FFRRFFAR

ii xfsignxF

1,11 )()(

Multi-exit BoostingA method to train a single boosted classifier with multiple exit nodes:

: a weak classifier : a weak classifier followed by a decision to continue or reject – an exit node

f1 f2 f3 f4 f5 f6 f7 f8 object

non-obj

pass pass passreject reject reject

+ + + + + + +

• Features:• Weak classifiers are trained with the same goal:• Every pass/reject decision is guaranteed with and• The classifier is a cascade.• Score is propagated from one node to another.

• Main advantages:• Weak classifiers are learned (approximately) optimally.• No training of multiple boosted classifiers.• Much fewer weak classifiers are needed than traditional cascades.

0FAR .0FRR

F2F1 F3

ResultsGoal () vs. Number of weak classifiers (K)

• Toy problem: To learn a (single-exit) boosted classifier F for classifying face/non-face patches such that FAR(F) < 0.8 and FRR(F) < 0.01– Empirically best goal:

– Our method chooses:

• Similar results were obtained for tests on other desired error rates.

.8001.08.0

].100,10[

Ours vs. Others (in Face Detection)

• Use Fast StatBoost as base method for fast-training a weak classifier.

Method No of weak

classifiers

No of exit

Total training

Viola Jones [3] 4,297 32 6h20m

Viola Jones [4] 3,502 29 4h30m

Boosting chain [7] 959 22 2h10m

Nested cascade [5] 894 20 2h

Soft cascade [1] 4,871 4,871 6h40m

Dynamic cascade [6] 1,172 1,172 2h50m

Multi-exit Asymmetric Boosting

575 24 1h20m

Ours vs. Others (in Face Detection)• MIT+CMU Frontal Face Test set:

Conclusion

• Multi-exit Asymmetric Boosting trains every weak classifier approximately optimally.

– Better accuracy

– Much fewer weak classifiers

– Significantly reduces training time• No more trial-and-error for training a boosted classifier

Outline

Fast Training and Selection of Haar-like Features using Statistics

ICCV’07 oral paper:Minh-Tri Pham and Tat-Jen Cham. Fast Training and Selection of Haar Features using Statistics in Boosting-based Face Detection. In Proc. International Conference on on Computer Vision (ICCV), Rio de Janeiro, Brazil, 2007.

• Won Travel Grant Award• Won Second Prize, Best Student Paper in Year 2007 Award, Pattern Recognition and Machine

Intelligence Association (PREMIA), Singapore

Motivation

• Face detectors today– Real-time detection

…but…

– Weeks of training time

Factor

Description Common value

N number of examples 10,000

M number of weak classifiers in total

4,000 - 6,000

T number of Haar-like features

40,000

Why is Training so Slow?

• Time complexity: O(MNT log N)– 15ms to train a feature classifier– 10 minutes to train a weak classifier– 27 days to train a face detector

A view of a face detector training algorithm

for weak classifier m from 1 to M:…update weights – O(N)for feature t from 1 to T:

compute N feature values – O(N)sort N feature values – O(N log N)train feature classifier – O(N)

select best feature classifier – O(T)…

Why Should the Training Time be Improved?• Tradeoff between time and generalization

– E.g. training 100 times slower if we increase both N and T by 10 times

• Trial and error to find key parameters for training– Much longer training time needed

• Online-learning face detectors have the same problem

Existing Approaches to Reduce the Training Time• Sub-sample Haar-like feature set

– Simple but loses generalization

• Use histograms and real-valued boosting (B. Wu et. al. ‘04)– Pro: Reduce from O(MNT log N) to O(MNT)– Con: Raise overfitting concerns:

• Real AdaBoost not known to be overfitting resistant• Weak classifier may overfit if too many histogram bins are used

• Pre-compute feature values’ sorting orders (J. Wu et. al. ‘07)– Pro: Reduce from O(MNT log N) to O(MNT)– Con: Require huge memory storage

• For N = 10,000 and T = 40,000, a total of 800MB is needed.

A view of a face detector training algorithm

for weak classifier m from 1 to M:…update weights – O(N)for feature t from 1 to T:

compute N feature values – O(N)sort N feature values – O(N log N)train feature classifier – O(N)

Factor

4,000 - 6,000

40,000

Why is Training so Slow?

• Time complexity: O(MNT log N)– 15ms to train a feature classifier– 10min to train a weak classifier– 27 days to train a face detector

• Bottleneck:– At least O(NT) to train a weak

classifier

• Can we avoid O(NT)?

Our Proposal

• Fast StatBoost: To train feature classifiers using statistics rather than using input data– Con:

• Less accurate… but not critical for a feature classifier

– Pro: • Much faster training time:

Constant time instead of linear time

Fast StatBoost• Training feature classifiers using

statistics:– Assumption: feature value v(t) is normally

distributed given face class c is known – Closed-form solution for optimal threshold

• Fast linear projections of the statistics of a window’s integral image into 1D statistics of a feature value

Non-faceFace

Optimalthreshold

Featurevalue

)()( tTt gmJ )()(2)( tTtt gg J

constant time to train a feature classifier

: Haar-like feature, a sparse vector with less than 20 non-zero elements

: mean vector and covariance matrix ofJJm , J

: random vector representing a window’s integral imageJ : mean and variance of feature value v(t)2)()( , tt

Fast StatBoost• Integral image’s statistics are obtained directly from the weighted input data

– Input: N training integral images and their current weights w(m):

– We compute:• Sample total weight:

• Sample mean vector:

• Sample covariance matrix:

mm ccc ,,,...,,,,,, )(22

)(1 JwJwJw 11

)(1ˆˆ Jm

wz mmJJ ˆˆˆˆ:

Factor

4,000 - 6,000

40,000

d number of pixels of a window

300-500

Fast StatBoost• To train a weak classifier:

– Extract the class-conditional integral image statistics

• Time complexity: O(Nd2)• Factor d2 negligible because fast algorithms

exist, hence in practice: O(N)

– Train T feature classifiers by projecting the statistics into 1D:

• Time complexity: O(T)

– Select the best feature classifier• Time complexity: O(T)

• Time complexity: O(N+T)

A view of our face detector training algorithm

for weak classifier m from 1 to M:…update weights – O(N)Extract statistics of integral image – O(Nd2)for feature t from 1 to T:

project statistics into 1D – O(1)train feature classifier – O(1)

Experimental Results• Setup

– Intel Pentium IV 2.8GHz– 19 types 295,920 Haar-like

features

• Time for extracting the statistics:– Main factor: covariance matrices

• GotoBLAS: 0.49 seconds per matrix

• Time for training T features:– 2.1 seconds

(1) (2)

(3) (4) (5) (6)

(14)(15)

(8) (9)(10) (11) (12) (13)

(18) (19)

Edge features: Corner features:

Diagonal line features:

Line features: Center-surround features:

Nineteen feature types used in our experiments

Total training time: 3.1 seconds per weak classifier with 300K features• Existing methods: up to 10 minutes with 40K features or fewer

Experimental Results• Comparison with Fast AdaBoost (J. Wu et. al. ‘07), the fastest known

implementation of Viola-Jones’ framework:

0 50000 100000 150000 200000 250000 30000002468

training time of a weak classifier

Fast AdaBoostFast StatBoost

number of features (T)

Experimental Results• Performance of a cascade:

ROC curves of the final cascades for face detection

Method Total training time

Memory requirement

Fast AdaBoost (T=40K)

13h 20m 800 MB

Fast StatBoost (T=40K)

02h 13m 30 MB

Fast StatBoost (T=300K)

03h 02m 30 MB

Conclusions

• Fast StatBoost: use of statistics instead of input data to train feature classifiers

• Time:– Reduction of the face detector training time from up to a month to 3 hours– Significant gain in both N and T with little increase in training time

• Due to O(N+T) per weak classifier

• Accuracy:– Even better accuracy for face detector

• Due to much more members of Haar-like features explored

Outline

Weak classifier

Outline

Summary

• Online Asymmetric Boosting– Integrates Asymmetric Boosting with Online Learning

• Fast Training and Selection of Haar-like Features using Statistics– Dramatically reduce training time from weeks to a few hours

• Multi-exit Asymmetric Boosting– Approximately minimizes the number of weak classifiers

Thank You

principled asymmetric boosting approaches to rapid training and classification in face detection

object detection

different lighting conditions

different positions

different scenes

different locations

single moving face

more10probe different

fast classifier

Documents

principled simplicity 1

a principled leader - bowdoin college

principled probabilistic inference and interactive...

boosting your storage server performance with the …...

principled insightful engaged

the principled patriarch (preview)

some notes on principled pragmatism

designing principled academic discussion tasks

supporting principled humanitarian action - refworld

boosting your storage server performance with the … 2014...

principled pragmatism - ocha

lincoln & principled leadership

principled simplicity

safeguarding investments in asymmetric interorganizational...

principled architecture selection for neural networks:...

boosting virtualization performance with intel ssd … ›...

principled asymmetric boosting approaches to rapid training...

“principled multilateralism” versus “diminished

principled policing: the mayor’s 2018 police …...1...

boosting small engines to high performance – boosting...