template matching techniques in computer...

OverviewDetection as hypothesis testing

Training and testingBibliography

Template Matching Techniques in Computer Vision

Roberto Brunelli

FBK - Fondazione Bruno Kessler

1 Settembre 2008

Roberto Brunelli Template Matching Techniques in Computer Vision



Table of contents

1 Overview

2 Detection as hypothesis testing

3 Training and testing

4 Bibliography




The BasicsAdvanced

Template matching

template/pattern

1 anything fashioned, shaped, or designed to serveas a model from which something is to be made:a model, design, plan, outline;

2 something formed after a model or prototype, acopy; a likeness, a similitude;

3 an example, an instance; esp. a typical model ora representative instance;

matching to compare in respect of similarity; to examine thelikeness of difference of.




The BasicsAdvanced

... template variability ...




The BasicsAdvanced

... and Computer Vision

Many important computer vision tasks can be solved with templatematching techniques:

Objectdetection/recognition

Object comparison

Depth computation

and template matchingdepends on

Physics (imaging)

Probability and statistics

Signal processing


Imaging

Perspective camera

Telecentric camera

Imaging

Photon noise (Poisson)

Quantum nature of light results in appreciable photon noisea

p(n) = e−(r∆t) (r∆t)n

n!

SNR ≤ I

σI=

n√n

=√

n

ar photons per unit time, ∆t gathering time



The BasicsAdvanced

Finding them ...

d(xxx ,yyy) =1

N

N∑i=1

(xi − yi )2

s(xxx ,yyy) =1

1 + d(xxx ,yyy)

A sliding window approach




The BasicsAdvanced

... robustly

Specularities and noise canresult in outliers: abnormallylarge differences that mayadversely affect thecomparison.

Specularities outliers




The BasicsAdvanced

... robustly

We downweight outliers changing the metrics:

N∑i=1

(zi )2 →

N∑i=1

ρ(zi ), zi = xi − yi

with one that has a more favourable influence function

ψ(z) =dρ(z)

dz

ρ(z) = z2 ψ(z) = z

ρ(z) = |z | ψ(z) = signz

ρ(z) = log

(1 +

z2

a2

)ψ(z) =

z

a2 + z2




The BasicsAdvanced

Illumination effects 1/3

Additional Template variability

Illumination variations affect images in a complex way, reducingthe effectiveness of template matching techniques




The BasicsAdvanced

Contrast and edge maps 2/3

Image transforms such as localcontrast can reduce the effectof illumination:

N ′ =I

I ∗ Kσ

N =

N ′ if N ′ ≤ 12− 1

N′ if N ′ > 1

(f ∗ g)(x) =

Zf (y)g(x − y) dy

Local contrast and edge maps




The BasicsAdvanced

Ordinal Transforms 3/3

Let us consider a pixel I (xxx) and itsneighborhood of W (xxx , l) of size l .Denoting with ⊗ the operation ofconcatenation, the Census transformis defined as

C (xxx) =⊗

xxx ′∈W (xxx ,l)\xxx

θ(I (xxx)− I (xxx ′))

CT invariance




The BasicsAdvanced

Matching variable patterns 1/2

a

Patterns of a single class mayspan a complex manifold of ahigh dimensional space: wemay try to find a compactspace enclosing it, possiblyattempting multiple locallinear descriptions.

astep edge, orientation θ andaxial distance ρ

Different criteria, different basis




The BasicsAdvanced

Subspaces approaches 2/2

PCA the eigenvectors of thecovariance matrix;

ICA the directions ontowhich data projectswith maximal nonGaussianity;

LDA the directionsmaximizing betweenclass scatter overwithin class scatter.

PCA, ICA (I and II), LDA




The BasicsAdvanced

Deformable templates 1/2

1 The circle representing the iris, characterizedby its radius r and its center xxxc . The interiorof the circle is attracted to the low intensityvalues while its boundary is attracted toedges in image intensity.

kv =

1 1 11 −8 11 1 1

Eyes potentials




The BasicsAdvanced

Deformable templates 2/2

Diffeomorphic matchinga:

A uuu(xxx) = A(u(xxx)) ≈ B(xxx)

uuu = argminuuu

∫Ω

∆(A uuu,B;xxx)dxxx + ∆(uuu)

∆(A,B) =

∫Ω

(A(xxx)− B(xxx))2 dxxx

∆(uuu) = ‖uuu − IIIuuu‖H1

Ω

‖aaa‖H1

Ω =

∫xxx∈Ω

‖aaa(xxx)‖2 + ‖∂(uuu)/∂(xxx)‖2Fdxxx

aa bijective map uuu(xxx) such that both it and itsinverse uuu−1 are differentiable

Brain warping


Linear structures: Radon/Hough Transforms

Rs(qqq)(I ;qqq) =

∫Rd

δ(K(xxx ;qqq))I (xxx) dxxx

In the Radon approach (left), the supporting evidence for a shapewith parameter qqq is collected by integrating over s(qqq). In theHough approach (right), each potentially supporting pixel (e.g.edge pixels AAA, BBB, CCC ) votes for all shapes to which it can potentiallybelong (all circles whose centers lay respectively on circles aaa, bbb, ccc).



The BasicsAdvanced

Detection as Learning

Given a set (xxx i , yi )i , we search a function f minimizing theempirical (approximation) squared error

EMSEemp =

1

N

∑i

(yi − f (xxx i ))2

f (xxx) = argminf

EMSEemp (f ; (xxx i , yi )i )

This ill posed problem can be regularized, turning the optimizationproblem of Equation 1 into

f (λ) = argminf ∈H

1

N

∑i

(yi − f (xxx i ))2 + λ‖f ‖H

where ‖f ‖H is the norm of f in the (function) space H to which werestrict our quest for a solution.




Hypothesis TestingBayes RiskNeyman Pearson testingCorrelationEstimation

Detection as testing

The problem of template detection fits within game theory.

The game proceeds along thefollowing steps:

1 nature chooses a state θ ∈ Θ;

2 a hint x is generated accordingto the conditional distributionPX (x |θ);

3 the computational agent makesits guess φ(x) = δ;

4 the agent experiences a lossC (θ, δ).

Gaming with nature





Hypothesis testing and Templates

Two cases are relevant to the problem of template matching:

1 ∆ = δ0, δ1, . . . , δK−1, that corresponds to hypothesistesting, and in particular the case K = 2, corresponding tobinary hypothesis testing. Many problems of patternrecognition fall within this category.

2 ∆ = Rn, corresponding to the problem of point estimation ofa real parameter vector: a typical problem being that of modelparameter estimation.

Template detection can be formalized as a binary hypothesis test:

H0 : xxx ∼ pθ(xxx), θ ∈ Θ0

H1 : xxx ∼ pθ(xxx), θ ∈ Θ1





Signal vs. Noise

Template detection in the presence of additive white Gaussiannoise ηηη ∼ N(000, σ2I )

H0 : xxx = ηηη

H1 : xxx =

fff + ηηη simpleαfff + ooo + ηηη composite

An hypothesis test (or classifier) is a mapping φ

φ : (Rnd )N → 0, . . . ,M − 1.

The test φ returns an hypothesis for every possible input,partitioning the input space into a disjoint collection R0, . . . ,RM−1

of decision regions:

Rk = (xxx1, . . . ,xxxN)|φ(xxx1, . . . ,xxxN) = k.





Error types

The probability of a type I (falsealarm) PF (size or α)

α = PF = P(φ = 1|H0)

The detection probability PD (poweror β):

β(θ) = PD = P(φ = 1|θ ∈ Θ1),

The probability of a type II error, ormiss probability PM is

PM = 1− PD .

False alarms and detection





The Bayes Risk

The Bayes approach is characterized by the assumption that theoccurrence probability of each hypothesis πi is known a priori.

The optimal test is the one that minimizes the Bayes risk CB :

CB =∑i ,j

CijP(φ(XXX ) = i |Hj)πj

=∑i ,j

Cij

(∫Ri

pj(xxx)dxxx

)πj

=

∫R0

(C00π0p0(xxx) + C01π1p1(xxx)) dxxx +∫R1

(C10π0p0(xxx) + C11π1p1(xxx)) dxxx .





The likelihood ratio

We may minimize the Bayes risk assigning each possible xxx to theregion whose integrand at xxx is smaller:

L(xxx) ≡ p1(xxx)

p0(xxx)

H1

≷H0

π0(C10 − C00)

π1(C01 − C11)≡ ν

where L(xxx) is called the likelihood ratio.When C00 = C11 = 0 and C10 = C01 = 1

L(xxx) ≡ p1(xxx)

p0(xxx)

H1

≷H0

π0

π1≡ ν

equivalent to the maximum a posteriori (MAP) rule

φ(xxx) = argmaxi∈0,1

πipi (xxx)





Frequentist testing

The alternative to Bayesian hypothesis testing is based on theNeyman-Pearson criterion and follows a classic, frequentistapproach based on

PF =

∫R1

p0(xxx)dxxx

PD =

∫R1

p1(xxx)dxxx .

we should design the decision rule in order to maximize PD

without exceeding a predefined bound on PF :

R1 = argmaxPDR1:PF≤α

.





... likelihood ratio again

The problem can be solved with the method of Lagrangemultipliers:

E = PD + λ(PF − α′)

=

∫R1

p1(xxx)dxxx + λ

(∫R1

p0(xxx)dxxx − α′)

= −λα′ +∫

R1

(p1(xxx) + λp0(xxx)) dxxx

where α′ ≤ α. In order to maximize E , the integrand should bepositive leading to the following condition:

p1(xxx)

p0(xxx)

H1> −λ

as we are considering region R1.Roberto Brunelli Template Matching Techniques in Computer Vision




The Neyman Pearson Lemma

In the binary hypothesis testing problem, if α0 ∈ [0, 1) is the sizeconstraint, the most powerful test of size α ≤ α0 is given by thedecision rule

φ(xxx) =

1 if L(xxx) > νγ if L(xxx) = ν0 if L(xxx) < ν

where ν is the largest constant for which

P0 (L(xxx) ≥ ν) ≥ α0 and P0 (L(xxx) ≤ ν) ≥ 1− α0

The test is unique up to sets of probability zero under H0 and H1.





An important example

Discriminate two deterministic multidimensional signals corruptedby zero average Gaussian noise:

H0 : xxx ∼ N(µµµ0,Σ),

H1 : xxx ∼ N(µµµ1,Σ),

Using the Mahalanobis distance

d2Σ(xxx ,yyy) = (xxx − yyy)TΣ−1(xxx − yyy)

we get

p0(xxx) =1

(2π)n/2|Σ|1/2exp

[−1

2d2Σ(xxx ,µµµ0)

]p1(xxx) =

1

(2π)n/2|Σ|1/2exp

[−1

2d2Σ(xxx ,µµµ1)

]Roberto Brunelli Template Matching Techniques in Computer Vision




... with an explicit solution.

The decision based on the log-likelihood ratio is

φ(xxx) =

1 wwwT (xxx − xxx0) ≥ νΛ

0 wwwT (xxx − xxx0) < νΛ

with

www = Σ−1(µµµ1 −µµµ0), xxx0 =1

2(µµµ1 +µµµ0)

and PF ,PD depend only on the distance of the means of the twoclasses normalized by the amount of noise, which is a measure ofthe SNR of the classification problem. When Σ = σ2I and µµµ0 = 000we have matching by projection:

ru = µµµT1 xxx

H1

≷H0

ν ′Λ





... more details

PF = P0(Λ(xxx) ≥ ν) = Q

(ν + σ2

0/2

σ0

)= Q(z)

PD = P1(Λ(xxx) ≥ ν) = Q

(ν − σ2

0/2

σ0

)= Q(z − σ0)

σ20(Λ(xxx)) = σ2

1(Λ(xxx)) = wwwTΣwww

z = ν/σ0 + σ0/2





Variable patterns ...

A common source of signal variability is its scaling by an unknowngain factor α possibly coupled to a signal offset β

xxx ′ = αxxx + β111

A practical strategy is to normalize both the reference signal andthe pattern to be classified to zero average and unit variance:

xxx ′ =(xxx − x)

σx

x =1

nd

nd∑i=1

xi

σx =1

nd

nd∑i=1

(xi − x)2 =1

nd

nd∑i=1

x2i − x2





Correlation

or, equivalently, replacing matching by projection with

rP(xxx ,yyy) =

∑i (xi − µx)(yi − µy )√∑

i (xi − µx)2√∑

i (yi − µy )2

which is related to the fraction of the variance in y accounted forby a linear fit of x to y yyy = axxx + b

r2P = 1−

s2y |x

s2y

s2y |x =

nd∑i=1

(yi − yi )2 =

nd∑i=1

(yi − axi − b

)2

s2y =

nd∑i=1

(y − y)2





(Maximum likelihood) estimation

The likelihood function is defined as

l(θθθ|xxx iNi=1) =

N∏i=1

p(xxx i |θθθ)

where xxxN = xxx iNi=1 is our (fixed) dataset and it is considered to

be a function of θθθ. The maximum likelihood estimator (MLE) θθθ isdefined as

θθθ = argmaxθθθ

l(θθθ|xxxN)

resulting in the parameter that maximizes the likelihood of ourobservations.





Bias and Variance

Definition

The bias of an estimator θ is

bias(θ) = E (θ)− θ

where θ represents the true value. If bias(θ) = 0 the operator issaid to be unbiased.

Definition

The mean squared error (MSE) of an estimator is

MSE(θ) = E ((θ − θ)2)





MLE properties

1 The MLE is asymptotically unbiased, i.e., its bias tends tozero as the number of samples increases to infinity.

2 The MLE is asymptotically efficient: asymptotically, nounbiased estimator has lower mean squared error than theMLE.

3 The MLE is asymptotically normal.





Shrinkage (James-Stein estimators)

MSE(θ) = var(θ) + bias2(θ)

We may reduce MSE tradingoff bias for variance, using alinear combination ofestimators T and S

Ts = λT + (1− λ)S

shrinking S towards T .

Shrinkage





James-Stein Theorem

Let XXX be distributed according to a nd -variate normal distributionN(θθθ, σ2I ). Under the squared loss, the usual estimator δδδ(XXX ) = XXXexhibits a higher loss for any θθθ, being therefore dominated, than

δδδa(X ) = θθθ0 +

(1− aσ2

‖XXX − θθθ0‖2

)(XXX − θθθ0)

for nd ≥ 3 and 0 < a < 2(nd − 2) and a = nd − 2 gives theuniformly best estimator in the class. The risk of δnd−2 at θθθ0 isconstant and equal to 2σ2 (instead of ndσ

2 of the usual estimator).





JS estimation of covariance matrices

The unbiased sample estimate of the covariance matrix is

Σ =1

N − 1

∑i

(xxx i − xxx)(xxx i − xxx)T

and it benefits from shrinking in the small sample, highdimensionality case, avoiding the singularity problem. The optimalshrinking parameter can be obtained in closed form for many usefulshrinking targets.

Significant improvements are reported in template (face) detectiontasks using similar approaches.




How good is ... goodUnbiased training and testingPerformance analysisOracles

Error breakdown

Detailed error breakdown canbe exploited to improve systemperformance.Error measures should beinvariant to translation,scaling, rotation.

Eyes localization errors





Error scoring

Error weighting or scoring functionscan be tuned to tasks: errors aremapped into the range [0, 1], thelower the score, the worse the error.

A single face detection system canbe scored differently when consideredas a detection or localization systemby changing the parameterscontrolling the weighting functions,using more peaked scoring functionsfor localization.

Task selective penalties





Error impact

The final verification error ∆v

∆v (xxx i) =∑

i

f (δδδ(xxx i );θθθ)

must be expressed as afunction of the detailed errorinformation that can beassociated to each localizationxxx i :

(δx1(xxx i ), δx2(xxx i ), δs(xxx i ), δα(xxx i )).

System impact

δx1, face verification systems, FAR=false

acceptance/impostors, FRR=false rejections/true client,

HTER= (FAR+FRR)/2





Training and testing: concepts

Let X be the space of possible inputs (without label), L the set oflabels, S = X × L the space of labeled samples, andD = sss1, . . . , sssN, where sss i = (xxx i , li ) ∈ S, be our dataset.

A classifier is a function C : X → L, while an inducer is anoperator I : D → C that maps a dataset into a classifier.





... and methods

The accuracy ε of a classifier is the probabilityp(C(xxx) = l , (xxx , l) ∈ S) that its label attribution is correct. Theproblem is to find a low bias and low variance estimate ε(C) of ε.There are three main different approaches to accuracy estimationand model selection:

1 hold-out,

2 bootstrap,

3 k-fold cross validation.





Hold Out

A subset Dh of nh points is extracted from the complete datasetand used as testing set while the remaining set Dt = D \ Dh ofN − nh points is provided to the inducer to train the classifier.The accuracy is estimated as

εh =1

nh

∑xxx i∈Dh

δ[J(Dt ;xxx i ), li ]

where δ(i , j) = 1 when i = j and 0 otherwise. It (approximately)follows a Gaussian distribution N(ε, ε(1− ε)/nh), from which anestimate of the variance (of ε) follows.


Bootstrap

The accuracy and its variance are estimated from the results of theclassifier over a sequence of bootstrap samples, each of themobtained by random sampling with replacement N instances fromthe original dataset.The accuracy εboot is then estimated as

εboot = 0.632εb + 0.368εr

where εr is the re-substitution accuracy, and eb is the accuracy onthe bootstrap subset. Multiple bootstrap subsets Db,i must begenerated, the corresponding values being used to estimate theaccuracy by averaging the results:

εboot =1

nε

nε∑i=1

εboot(Db,i )

and its variance.




Cross validation

k-fold cross validation is based on the subdivision of the datasetinto k mutually exclusive subsets of (approximately) equal size:each one of them is used in turn for testing while the remainingk − 1 groups are given to the inducer to estimate the parametersof the classifier. If we denote with Di the set that includesinstance i

εk =1

N

∑i

δ[J(D \ Di;xxx i ), li ]

Complete cross validation would require averaging over all( NN/k

)possible choices of the N/k testing instances out of N and is tooexpensive with the exception of the case k = 1 which is also knownas leave-one-out (LOO).





ROC representation

The ROC curve describes the performanceof a classifier when varying theNeyman-Pearson constraint on PF :

PD = f (PF ) or Tp = f (Fp)

ROC diagrams are not affected by classskewness, and are invariant also to errorcosts.

ROC points and curves





ROC convex hull

The expected cost of aclassifier can be computedfrom its ROC coordinates:

C = p(p)(1−Tp)Cηp+p(n)FpCπn

Proposition

For any set of cost (Cηp,Cπn)and class distributions(p(p), p(n)), there is a pointon the ROC convex hull(ROCCH) with minimumexpected cost.

Operating conditions





ROC interpolation

Proposition

ROC convex hull hybrid Given twoclassifiers J1 and J2 represented withinROC space by the points aaa1 = (Fp1,Tp1)and aaa2 = (Fp2,Tp2), it is possible togenerate a classifier for each point aaax onthe segment joining aaa1 and aaa1 with arandomized decision rule that samples J1

with probability

p(J1) =‖aaa2 − aaax‖‖aaa2 − aaa1‖

Satisfying operatingconstraints


AUC

The area under the curve (AUC) gives theprobability that the classifier will score, arandomly given positive instance higherthat a randomly chosen one. This value isequivalent to the Wilcoxon rank teststatistic W

W =1

NPNN

∑i :li=p

∑j :lj=n

w(s(xxx i ), s(xxx j))

where, assuming no ties,

w(s(xxx i ), s(xxx j)) = 1 if s(xxx i ) > s(xxx j)

The closer the area to 1, the better theclassifier.

Scoring classifiers




Rendering

The appearance of a surface point is determined by solving therendering equation:

Lo(xxx ,−III , λ) = Le(xxx ,−III , λ)+

∫Ω

fr (xxx , LLL,−III , λ)Li (xxx ,−LLL, λ)(−LLL·NNN)dLLL


Describing reality: RenderMan R©

Projection "perspective" "fov" 35WorldBeginLightSource "pointlight" 1 "intensity" 40 "from" [4 2 4]Translate 0 0 5Color 1 0 0Surface "roughMetal" "roughness" 0.01Cylinder 1 0 1.5 360

WorldEnd

A simple shader

color roughMetal(normal Nf; color basecolor;float Ka, Kd, Ks, roughness;)

extern vector I;return basecolor * (Ka*ambient() + Kd*diffuse(Nf) +

Ks*specular(Nf,-normalize(I),roughness));




How realistic is it?

basic phenomena, including straight propagation, specularreflection, diffuse reflection (Lambertian surfaces), selectivereflection, refraction, reflection and polarization (Fresnel’slaw), exponential absorption of light (Bouguer’s law);

complex phenomena, including non-Lambertian surfaces,anisotropic surfaces, multilayered surfaces, complex volumes,translucent materials, polarization;

spectral effects, including spiky illumination, dispersion,inteference, diffraction, Rayleigh scattering, fluorescence, andphosphorescence.





Thematic rendering

We can shade a pixel so thatits color represents

the temperature of thesurface,

its distance from theobserver,

its surface coordinates,

the material,

an object uniqueidentification code.

Automatic ground truth




References

R. Brunelli and T. Poggio, 1997, Template matching: Matched spatialfilters and beyond. Pattern Recognition 30, 751–768.

R. Brunelli, 2009 Template Matching Techniques in Computer Vision:Theory and Practice. J. Wiley & Sons

T. Moon and W. Stirling, 2000 Mathematical Methods and Algorithmsfor Signal Processing. Prentice-Hall.

J Piper, I. Poole and A. Carothers A, 1994, Stein’s paradox and improvedquadratic discrimination of real and simulated data by covarianceweighting Proc. of the 12th IAPR International Conference on PatternRecognition (ICPR’94), vol. 2, pp. 529–532.


template matching techniques in computer...

Documents