vu pattern recognition ii - uni salzburghelmut/teaching/patternrecognition/prii.pdf · vu pattern...
TRANSCRIPT
VU Pattern Recognition II
VU Pattern Recognition II
Helmut A. Mayer
Department of Computer SciencesUniversity of Salzburg
WS 13/14
VU Pattern Recognition II
Outline
1 Introduction
2 Statistical ClassifiersBayesian Decision Theory
3 Nonparametric TechniquesDensity Estimationk–Nearest–Neigbor Estimation
4 Linear Discriminant FunctionsDecision Surfaces
5 Neural Networks
6 Nonmetric MethodsClassification and Regression Trees
7 Stochastic MethodsSimulated Annealing
8 Projects
VU Pattern Recognition II
Introduction
Human vs. Machine
Human Perception
Senses to neural patterns
Machine Perception
Sensors to value patterns
Patterns are everywhere...
Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)
Features build Model
Fish Example
VU Pattern Recognition II
Introduction
Human vs. Machine
Human Perception
Senses to neural patterns
Machine Perception
Sensors to value patterns
Patterns are everywhere...
Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)
Features build Model
Fish Example
VU Pattern Recognition II
Introduction
Human vs. Machine
Human Perception
Senses to neural patterns
Machine Perception
Sensors to value patterns
Patterns are everywhere...
Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)
Features build Model
Fish Example
VU Pattern Recognition II
Introduction
Human vs. Machine
Human Perception
Senses to neural patterns
Machine Perception
Sensors to value patterns
Patterns are everywhere...
Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)
Features build Model
Fish Example
VU Pattern Recognition II
Introduction
Human vs. Machine
Human Perception
Senses to neural patterns
Machine Perception
Sensors to value patterns
Patterns are everywhere...
Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)
Features build Model
Fish Example
VU Pattern Recognition II
Introduction
Human vs. Machine
Human Perception
Senses to neural patterns
Machine Perception
Sensors to value patterns
Patterns are everywhere...
Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)
Features build Model
Fish Example
VU Pattern Recognition II
Introduction
Human vs. Machine
Human Perception
Senses to neural patterns
Machine Perception
Sensors to value patterns
Patterns are everywhere...
Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)
Features build Model
Fish Example
VU Pattern Recognition II
Introduction
Human vs. Machine
Human Perception
Senses to neural patterns
Machine Perception
Sensors to value patterns
Patterns are everywhere...
Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)
Features build Model
Fish Example
VU Pattern Recognition II
Introduction
Salmon or Sea Bass
FIGURE 1.1. The objects to be classified are first sensed by a transducer (camera),whose signals are preprocessed. Next the features are extracted and finally the clas-sification is emitted, here either “salmon” or “sea bass.” Although the information flowis often chosen to be from the source to the classifier, some systems employ informationflow in which earlier levels of processing can be altered based on the tentative or pre-liminary response in later levels (gray arrows). Yet others combine two or more stagesinto a unified step, such as simultaneous segmentation and feature extraction. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Introduction
Fish Length Histogram
salmon sea bass
length
count
l*
0
2
4
6
8
10
12
16
18
20
22
5 10 2015 25
FIGURE 1.2. Histograms for the length feature for the two categories. No single thresh-old value of the length will serve to unambiguously discriminate between the two cat-egories; using length alone, we will have some errors. The value marked l∗ will lead tothe smallest number of errors, on average. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Introduction
Fish Lightness Histogram
2 4 6 8 100
2
4
6
8
10
12
14
lightness
count
x*
salmon sea bass
FIGURE 1.3. Histograms for the lightness feature for the two categories. No singlethreshold value x∗ (decision boundary) will serve to unambiguously discriminate be-tween the two categories; using lightness alone, we will have some errors. The value x∗
marked will lead to the smallest number of errors, on average. From: Richard O. Duda,Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by JohnWiley & Sons, Inc.
VU Pattern Recognition II
Introduction
Decision Theory
Cost of an Error?
Salmon tastes better..;-)Minimization of cost (risk)Decision Rule/Boundary
Improving Recognition
Feature Vector ~x =
(lightnesswidth
)
2D Decision Boundary
VU Pattern Recognition II
Introduction
Decision Theory
Cost of an Error?
Salmon tastes better..;-)
Minimization of cost (risk)Decision Rule/Boundary
Improving Recognition
Feature Vector ~x =
(lightnesswidth
)
2D Decision Boundary
VU Pattern Recognition II
Introduction
Decision Theory
Cost of an Error?
Salmon tastes better..;-)Minimization of cost (risk)
Decision Rule/Boundary
Improving Recognition
Feature Vector ~x =
(lightnesswidth
)
2D Decision Boundary
VU Pattern Recognition II
Introduction
Decision Theory
Cost of an Error?
Salmon tastes better..;-)Minimization of cost (risk)Decision Rule/Boundary
Improving Recognition
Feature Vector ~x =
(lightnesswidth
)
2D Decision Boundary
VU Pattern Recognition II
Introduction
Decision Theory
Cost of an Error?
Salmon tastes better..;-)Minimization of cost (risk)Decision Rule/Boundary
Improving Recognition
Feature Vector ~x =
(lightnesswidth
)2D Decision Boundary
VU Pattern Recognition II
Introduction
Decision Theory
Cost of an Error?
Salmon tastes better..;-)Minimization of cost (risk)Decision Rule/Boundary
Improving Recognition
Feature Vector ~x =
(lightnesswidth
)
2D Decision Boundary
VU Pattern Recognition II
Introduction
Decision Theory
Cost of an Error?
Salmon tastes better..;-)Minimization of cost (risk)Decision Rule/Boundary
Improving Recognition
Feature Vector ~x =
(lightnesswidth
)2D Decision Boundary
VU Pattern Recognition II
Introduction
2D Feature Space
2 4 6 8 1014
15
16
17
18
19
20
21
22
width
lightness
salmon sea bass
FIGURE 1.4. The two features of lightness and width for sea bass and salmon. The darkline could serve as a decision boundary of our classifier. Overall classification error onthe data shown is lower than if we use only one feature as in Fig. 1.3, but there willstill be some errors. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Introduction
Overfitting
?
2 4 6 8 1014
15
16
17
18
19
20
21
22
width
lightness
salmon sea bass
FIGURE 1.5. Overly complex models for the fish will lead to decision boundaries thatare complicated. While such a decision may lead to perfect classification of our trainingsamples, it would lead to poor performance on future patterns. The novel test pointmarked ? is evidently most likely a salmon, whereas the complex decision boundaryshown leads it to be classified as a sea bass. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Introduction
Generalization
2 4 6 8 1014
15
16
17
18
19
20
21
22
width
lightness
salmon sea bass
FIGURE 1.6. The decision boundary shown might represent the optimal tradeoff be-tween performance on the training set and simplicity of classifier, thereby giving thehighest accuracy on new patterns. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Introduction
Related Fields
Statistical Hypothesis Testing
Image Processing
Regression (age ↔ weight)
Interpolation
Density Estimation
VU Pattern Recognition II
Introduction
Related Fields
Statistical Hypothesis Testing
Image Processing
Regression (age ↔ weight)
Interpolation
Density Estimation
VU Pattern Recognition II
Introduction
Related Fields
Statistical Hypothesis Testing
Image Processing
Regression (age ↔ weight)
Interpolation
Density Estimation
VU Pattern Recognition II
Introduction
Related Fields
Statistical Hypothesis Testing
Image Processing
Regression (age ↔ weight)
Interpolation
Density Estimation
VU Pattern Recognition II
Introduction
Related Fields
Statistical Hypothesis Testing
Image Processing
Regression (age ↔ weight)
Interpolation
Density Estimation
VU Pattern Recognition II
Introduction
Pattern Recognition Systems
post-processing
classification
feature extraction
segmentation
sensing
input
decision
adjustments for missing features
adjustments for context
costs
FIGURE 1.7. Many pattern recognition systems can be partitioned into componentssuch as the ones shown here. A sensor converts images or sounds or other physicalinputs into signal data. The segmentor isolates sensed objects from the background orfrom other objects. A feature extractor measures object properties that are useful forclassification. The classifier uses these features to assign the sensed object to a cate-gory. Finally, a post processor can take account of other considerations, such as theeffects of context and the costs of errors, to decide on the appropriate action. Althoughthis description stresses a one-way or “bottom-up” flow of data, some systems employfeedback from higher levels back down to lower levels (gray arrows). From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.
VU Pattern Recognition II
Introduction
Feature Extraction
Features ↔ Classification
Invariant Features (translation, rotation, scale)
Deformation (e.g. Cropping)
Feature Selection (Filter, Wrapper)
VU Pattern Recognition II
Introduction
Feature Extraction
Features ↔ Classification
Invariant Features (translation, rotation, scale)
Deformation (e.g. Cropping)
Feature Selection (Filter, Wrapper)
VU Pattern Recognition II
Introduction
Feature Extraction
Features ↔ Classification
Invariant Features (translation, rotation, scale)
Deformation (e.g. Cropping)
Feature Selection (Filter, Wrapper)
VU Pattern Recognition II
Introduction
Feature Extraction
Features ↔ Classification
Invariant Features (translation, rotation, scale)
Deformation (e.g. Cropping)
Feature Selection (Filter, Wrapper)
VU Pattern Recognition II
Introduction
Post Processing
Error Rate, Risk (weighted error)
Context (IC* *IN)
Multiple Classifiers (subspaces, fusion)
VU Pattern Recognition II
Introduction
Post Processing
Error Rate, Risk (weighted error)
Context (IC* *IN)
Multiple Classifiers (subspaces, fusion)
VU Pattern Recognition II
Introduction
Post Processing
Error Rate, Risk (weighted error)
Context (IC* *IN)
Multiple Classifiers (subspaces, fusion)
VU Pattern Recognition II
Introduction
Design Cycle
collect data
choose features
choose model
train classifier
evaluate classifier
end
start
prior knowledge(e.g., invariances)
FIGURE 1.8. The design of a pattern recognition system involves a design cycle similarto the one shown here. Data must be collected, both to train and to test the system. Thecharacteristics of the data impact both the choice of appropriate discriminating featuresand the choice of models for the different categories. The training process uses some orall of the data to determine the system parameters. The results of evaluation may callfor repetition of various steps in this process in order to obtain satisfactory results. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Introduction
Learning and Adaptation
Learning is Parameter Tuning
Supervised Learning (teacher)
Reinforcement Learning (critic)
Unsupervised Learning (clustering)
VU Pattern Recognition II
Introduction
Learning and Adaptation
Learning is Parameter Tuning
Supervised Learning (teacher)
Reinforcement Learning (critic)
Unsupervised Learning (clustering)
VU Pattern Recognition II
Introduction
Learning and Adaptation
Learning is Parameter Tuning
Supervised Learning (teacher)
Reinforcement Learning (critic)
Unsupervised Learning (clustering)
VU Pattern Recognition II
Introduction
Learning and Adaptation
Learning is Parameter Tuning
Supervised Learning (teacher)
Reinforcement Learning (critic)
Unsupervised Learning (clustering)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Probabilities
State of Nature ω = ω1 (class)
A Priori Probability P(ω1) (prior)
Decision Rule P(ω1) > P(ω2)→ ω1
Class–Conditional Probability Density Function p(x |ω)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Probabilities
State of Nature ω = ω1 (class)
A Priori Probability P(ω1) (prior)
Decision Rule P(ω1) > P(ω2)→ ω1
Class–Conditional Probability Density Function p(x |ω)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Probabilities
State of Nature ω = ω1 (class)
A Priori Probability P(ω1) (prior)
Decision Rule P(ω1) > P(ω2)→ ω1
Class–Conditional Probability Density Function p(x |ω)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Probabilities
State of Nature ω = ω1 (class)
A Priori Probability P(ω1) (prior)
Decision Rule P(ω1) > P(ω2)→ ω1
Class–Conditional Probability Density Function p(x |ω)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Class–Conditional Probability Density
9 10 11 12 13 14 15
0.1
0.2
0.3
0.4
p(x|ωi)
x
ω1
ω2
FIGURE 2.1. Hypothetical class-conditional probability density functions show theprobability density of measuring a particular feature value x given the pattern is incategory ωi . If x represents the lightness of a fish, the two curves might describe thedifference in lightness of populations of two types of fish. Density functions are normal-ized, and thus the area under each curve is 1.0. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons,Inc.
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Bayes Decision Rule
Joint Probability Densityp(ωj , x) = P(ωj |x)p(x) = p(x |ωj)P(ωj)
Bayes Formula P(ωj |x) =p(x |ωj )P(ωj )
p(x)
Decision Rule P(ω1|x) > P(ω2|x)→ ω1
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Bayes Decision Rule
Joint Probability Densityp(ωj , x) = P(ωj |x)p(x) = p(x |ωj)P(ωj)
Bayes Formula P(ωj |x) =p(x |ωj )P(ωj )
p(x)
Decision Rule P(ω1|x) > P(ω2|x)→ ω1
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Bayes Decision Rule
Joint Probability Densityp(ωj , x) = P(ωj |x)p(x) = p(x |ωj)P(ωj)
Bayes Formula P(ωj |x) =p(x |ωj )P(ωj )
p(x)
Decision Rule P(ω1|x) > P(ω2|x)→ ω1
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Posterior Probabilities
0.2
0.4
0.6
0.8
1
P(ωi|x)
x
ω1
ω2
9 10 11 12 13 14 15
FIGURE 2.2. Posterior probabilities for the particular priors P(ω1) = 2/3 and P(ω2)
= 1/3 for the class-conditional probability densities shown in Fig. 2.1. Thus in thiscase, given that a pattern is measured to have feature value x = 14, the probability it isin category ω2 is roughly 0.08, and that it is in ω1 is 0.92. At every x, the posteriors sumto 1.0. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Error Probabilities
Error P(error |x) =
{P(ω1|x) if ω2
P(ω2|x) if ω1
Average Error ProbabilityP(error) =
∫∞−∞ p(error , x)dx =
∫∞−∞ P(error |x)p(x)dx
Bayes Rule minimizes P(error)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Error Probabilities
Error P(error |x) =
{P(ω1|x) if ω2
P(ω2|x) if ω1
Average Error ProbabilityP(error) =
∫∞−∞ p(error , x)dx =
∫∞−∞ P(error |x)p(x)dx
Bayes Rule minimizes P(error)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Error Probabilities
Error P(error |x) =
{P(ω1|x) if ω2
P(ω2|x) if ω1
Average Error ProbabilityP(error) =
∫∞−∞ p(error , x)dx =
∫∞−∞ P(error |x)p(x)dx
Bayes Rule minimizes P(error)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Generalized Bayes Rule
Feature vector ~x ∈ Rd
Classes ω1 . . . ωc
Bayes Formula P(ωj |~x) =p(~x |ωj )P(ωj )
p(~x)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Generalized Bayes Rule
Feature vector ~x ∈ Rd
Classes ω1 . . . ωc
Bayes Formula P(ωj |~x) =p(~x |ωj )P(ωj )
p(~x)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Generalized Bayes Rule
Feature vector ~x ∈ Rd
Classes ω1 . . . ωc
Bayes Formula P(ωj |~x) =p(~x |ωj )P(ωj )
p(~x)
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Dichotomizer
0
0.1
0.2
0.3
decisionboundary
p(x|ω2)P(ω2)
R1
R2
p(x|ω1)P(ω1)
R2
0
5
0
5
FIGURE 2.6. In this two-dimensional two-category classifier, the probability densitiesare Gaussian, the decision boundary consists of two hyperbolas, and thus the decisionregion R2 is not simply connected. The ellipses mark where the density is 1/e timesthat at the peak of the distribution. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
The Normal Density
Randomized Prototype Vectors with Mean ~µ→Normal Distribution
Expected ValueE [f (x)] =
∫∞−∞ f (x)p(x)dx (continous)
E [f (x)] =∑
x∈D f (x)P(x) (discrete)
Univariate Normal Density
p(x) = 1√2πσ
e−12
( x−µσ
)2
E [x ] = µ E [(x − µ)2] = σ2
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
The Normal Density
Randomized Prototype Vectors with Mean ~µ→Normal Distribution
Expected ValueE [f (x)] =
∫∞−∞ f (x)p(x)dx (continous)
E [f (x)] =∑
x∈D f (x)P(x) (discrete)
Univariate Normal Density
p(x) = 1√2πσ
e−12
( x−µσ
)2
E [x ] = µ E [(x − µ)2] = σ2
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
The Normal Density
Randomized Prototype Vectors with Mean ~µ→Normal Distribution
Expected ValueE [f (x)] =
∫∞−∞ f (x)p(x)dx (continous)
E [f (x)] =∑
x∈D f (x)P(x) (discrete)
Univariate Normal Density
p(x) = 1√2πσ
e−12
( x−µσ
)2
E [x ] = µ E [(x − µ)2] = σ2
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Normal Distribution
x
2.5% 2.5%
σ
p(x)
µ + σ µ + 2σµ - σµ - 2σ µ
FIGURE 2.7. A univariate normal distribution has roughly 95% of its area in the range|x − µ| ≤ 2σ , as shown. The peak of the distribution has value p(µ) = 1/
√2πσ . From:
Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Multivariate Density
Multivariate Normal Density
p(x) = 1
2πd2 |Σ|
12e−
12
(~x−~µ)tΣ−1(~x−~µ)
Covariance Matrix Σ (d × d)E [~x ] = ~µ E [(~x − ~µ)(~x − ~µ)t ] = ΣE [xi ] = µi E [(xi − µi )(xj − µj)] = σij
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Multivariate Density
Multivariate Normal Density
p(x) = 1
2πd2 |Σ|
12e−
12
(~x−~µ)tΣ−1(~x−~µ)
Covariance Matrix Σ (d × d)E [~x ] = ~µ E [(~x − ~µ)(~x − ~µ)t ] = ΣE [xi ] = µi E [(xi − µi )(xj − µj)] = σij
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
2D Gaussian
x2
x1
µ
FIGURE 2.9. Samples drawn from a two-dimensional Gaussian lie in a cloud centeredon the mean �. The ellipses show lines of equal probability density of the Gaussian.From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copy-right c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Four Categories
R3
R2
R1
R4
R4
FIGURE 2.16. The decision regions for four normal distributions. Even with such a lownumber of categories, the shapes of the boundary regions can be rather complex. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Classification Errors
Bayes Error: overlapping densities, inherent problem property
Model Error: incorrect model
Estimation Error: finite sample of training data
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Classification Errors
Bayes Error: overlapping densities, inherent problem property
Model Error: incorrect model
Estimation Error: finite sample of training data
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Classification Errors
Bayes Error: overlapping densities, inherent problem property
Model Error: incorrect model
Estimation Error: finite sample of training data
VU Pattern Recognition II
Statistical Classifiers
Bayesian Decision Theory
Bayes Error and Dimensionality
x3
x1
x2
FIGURE 3.3. Two three-dimensional distributions have nonoverlapping densities, andthus in three dimensions the Bayes error vanishes. When projected to a subspace—here,the two-dimensional x1 − x2 subspace or a one-dimensional x1 subspace—there canbe greater overlap of the projected distributions, and hence greater Bayes error. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Nonparametric Techniques
Density Estimation
Unknown Densities
Real problems: multi–modal, parametric densities: uni–modal→ estimation of densities directly from data
P that pattern ~x falls in region R, P =∫R p(~x)d~x
n patterns, probability that k patterns are in RPk =
(nk
)Pk(1− P)n−k E [k] = nP
Assuming small region R →p(~x) ' const →
∫R p(~x)d~x ' p(~x)V → p(~x) '
knV
VU Pattern Recognition II
Nonparametric Techniques
Density Estimation
Unknown Densities
Real problems: multi–modal, parametric densities: uni–modal→ estimation of densities directly from data
P that pattern ~x falls in region R, P =∫R p(~x)d~x
n patterns, probability that k patterns are in RPk =
(nk
)Pk(1− P)n−k E [k] = nP
Assuming small region R →p(~x) ' const →
∫R p(~x)d~x ' p(~x)V → p(~x) '
knV
VU Pattern Recognition II
Nonparametric Techniques
Density Estimation
Unknown Densities
Real problems: multi–modal, parametric densities: uni–modal→ estimation of densities directly from data
P that pattern ~x falls in region R, P =∫R p(~x)d~x
n patterns, probability that k patterns are in RPk =
(nk
)Pk(1− P)n−k E [k] = nP
Assuming small region R →p(~x) ' const →
∫R p(~x)d~x ' p(~x)V → p(~x) '
knV
VU Pattern Recognition II
Nonparametric Techniques
Density Estimation
Unknown Densities
Real problems: multi–modal, parametric densities: uni–modal→ estimation of densities directly from data
P that pattern ~x falls in region R, P =∫R p(~x)d~x
n patterns, probability that k patterns are in RPk =
(nk
)Pk(1− P)n−k E [k] = nP
Assuming small region R →p(~x) ' const →
∫R p(~x)d~x ' p(~x)V → p(~x) '
knV
VU Pattern Recognition II
Nonparametric Techniques
Density Estimation
Relative Probability
1k/n
0.5
1
relativeprobability
0
1005020
P = 0.7
FIGURE 4.1. The relative probability an estimate given by Eq. 4 will yield a particularvalue for the probability density, here where the true probability was chosen to be 0.7.Each curve is labeled by the total number of patterns n sampled, and is scaled to givethe same maximum (at the true probability). The form of each curve is binomial, asgiven by Eq. 2. For large n, such binomials peak strongly at the true probability. In thelimit n → ∞, the curve approaches a delta function, and we are guaranteed that ourestimate will give the true probability. From: Richard O. Duda, Peter E. Hart, and DavidG. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Nonparametric Techniques
Density Estimation
Sample Size
Estimate p(~x) 'knV is dependent on size of V
if V → 0, p(~x) would be exact, but no more samples in V
Assuming infinite pattern set with decreasing Vn
n–th estimate pn(~x) =knnVn
For convergence of pn(~x)→ p(~x)limn→∞ Vn = 0 limn→∞ kn =∞ limn→∞
knn = 0
Decreasing Vn, e.g., Vn = 1√n→ Parzen Windows
Increasing kn, e.g., kn =√n→ kn–Nearest–Neighbors
VU Pattern Recognition II
Nonparametric Techniques
Density Estimation
Sample Size
Estimate p(~x) 'knV is dependent on size of V
if V → 0, p(~x) would be exact, but no more samples in V
Assuming infinite pattern set with decreasing Vn
n–th estimate pn(~x) =knnVn
For convergence of pn(~x)→ p(~x)limn→∞ Vn = 0 limn→∞ kn =∞ limn→∞
knn = 0
Decreasing Vn, e.g., Vn = 1√n→ Parzen Windows
Increasing kn, e.g., kn =√n→ kn–Nearest–Neighbors
VU Pattern Recognition II
Nonparametric Techniques
Density Estimation
Sample Size
Estimate p(~x) 'knV is dependent on size of V
if V → 0, p(~x) would be exact, but no more samples in V
Assuming infinite pattern set with decreasing Vn
n–th estimate pn(~x) =knnVn
For convergence of pn(~x)→ p(~x)limn→∞ Vn = 0 limn→∞ kn =∞ limn→∞
knn = 0
Decreasing Vn, e.g., Vn = 1√n→ Parzen Windows
Increasing kn, e.g., kn =√n→ kn–Nearest–Neighbors
VU Pattern Recognition II
Nonparametric Techniques
Density Estimation
Sample Size
Estimate p(~x) 'knV is dependent on size of V
if V → 0, p(~x) would be exact, but no more samples in V
Assuming infinite pattern set with decreasing Vn
n–th estimate pn(~x) =knnVn
For convergence of pn(~x)→ p(~x)limn→∞ Vn = 0 limn→∞ kn =∞ limn→∞
knn = 0
Decreasing Vn, e.g., Vn = 1√n→ Parzen Windows
Increasing kn, e.g., kn =√n→ kn–Nearest–Neighbors
VU Pattern Recognition II
Nonparametric Techniques
Density Estimation
Point Density Estimation
n = 1 n = 4 n = 9 n = 16 n = 100
...
...
...
...
Vn =1/ √n
kn = √n
FIGURE 4.2. There are two leading methods for estimating the density at a point, hereat the center of each square. The one shown in the top row is to start with a large volumecentered on the test point and shrink it according to a function such as Vn = 1/
√n. The
other method, shown in the bottom row, is to decrease the volume in a data-dependentway, for instance letting the volume enclose some number kn = √
n of sample points.The sequences in both cases represent random variables that generally converge andallow the true density at the test point to be calculated. From: Richard O. Duda, PeterE. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Prototype Estimation
Estimate density at arbitrary ~x by kn nearest neighbors of ~x
pn(~x) =knnVn
(neighbors are training patterns)
Dense neighbors → small Vn → good resolutionSparse neighbors → large Vn → bad resolution
Problem: often∫pn(~x)d~x > 1
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Prototype Estimation
Estimate density at arbitrary ~x by kn nearest neighbors of ~x
pn(~x) =knnVn
(neighbors are training patterns)
Dense neighbors → small Vn → good resolutionSparse neighbors → large Vn → bad resolution
Problem: often∫pn(~x)d~x > 1
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Prototype Estimation
Estimate density at arbitrary ~x by kn nearest neighbors of ~x
pn(~x) =knnVn
(neighbors are training patterns)
Dense neighbors → small Vn → good resolutionSparse neighbors → large Vn → bad resolution
Problem: often∫pn(~x)d~x > 1
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
1D kNN Estimate
x
p(x)
3 5
FIGURE 4.10. Eight points in one dimension and the k-nearest-neighbor density esti-mates, for k = 3 and 5. Note especially that the discontinuities in the slopes in theestimates generally lie away from the positions of the prototype points. From: RichardO. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001by John Wiley & Sons, Inc.
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
2D kNN Estimate
0
p(x)
x1
x2
FIGURE 4.11. The k-nearest-neighbor estimate of a two-dimensional density for k = 5.Notice how such a finite n estimate can be quite “jagged,” and notice that disconti-nuities in the slopes generally occur along lines away from the positions of the pointsthemselves. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classifi-cation. Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Unimodal and Bimodal 1D kNN Estimates
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
n=1
kn=1
n=16
kn=4
n=256
kn=16
n= ∞kn= ∞
FIGURE 4.12. Several k-nearest-neighbor estimates of two unidimensional densities:a Gaussian and a bimodal distribution. Notice how the finite n estimates can be quite“spiky.” From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Estimation of A Posteriori Probabilities
Samples of different classes, what is P(ωi |~x)?
Estimate for pn(~x , ωi ) =kinV (in arbitrary V )
Estimate for P(ωi |~x) = pn(~x ,ωi )∑cj=1 pn(~x ,ωj )
= kik
With n→∞ and Bayes Rule: optimal performance (Parzenand kNN)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Estimation of A Posteriori Probabilities
Samples of different classes, what is P(ωi |~x)?
Estimate for pn(~x , ωi ) =kinV (in arbitrary V )
Estimate for P(ωi |~x) = pn(~x ,ωi )∑cj=1 pn(~x ,ωj )
= kik
With n→∞ and Bayes Rule: optimal performance (Parzenand kNN)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Estimation of A Posteriori Probabilities
Samples of different classes, what is P(ωi |~x)?
Estimate for pn(~x , ωi ) =kinV (in arbitrary V )
Estimate for P(ωi |~x) = pn(~x ,ωi )∑cj=1 pn(~x ,ωj )
= kik
With n→∞ and Bayes Rule: optimal performance (Parzenand kNN)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Nearest Neigbor Rule
Single nearest neigbor is ~x ′ (k = 1)Class label of ~x ′ is θ′ (random variable)P(θ′ = ωi ) = P(ωi |~x ′) ' P(ωi |~x) (for large n)
Assumption of 1NN: P(ωi |~x ′) is largest probabilityIf true (e.g., P ' 1, or P ' 1
c ), then 1NN close to Bayes Error
Average error probability P(e) =∫P(e|~x)p(~x)d~x
P(e|~x) = 1− P(ωi |~x ′) is ”minimum” P∗(e|~x)P∗(e) =
∫P∗(e|~x)p(~x)d~x
1NN error P = limn→∞ Pn(e)P∗ ≤ P ≤ P∗(2− c
c−1P∗)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Nearest Neigbor Rule
Single nearest neigbor is ~x ′ (k = 1)Class label of ~x ′ is θ′ (random variable)P(θ′ = ωi ) = P(ωi |~x ′) ' P(ωi |~x) (for large n)
Assumption of 1NN: P(ωi |~x ′) is largest probabilityIf true (e.g., P ' 1, or P ' 1
c ), then 1NN close to Bayes Error
Average error probability P(e) =∫P(e|~x)p(~x)d~x
P(e|~x) = 1− P(ωi |~x ′) is ”minimum” P∗(e|~x)P∗(e) =
∫P∗(e|~x)p(~x)d~x
1NN error P = limn→∞ Pn(e)P∗ ≤ P ≤ P∗(2− c
c−1P∗)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Nearest Neigbor Rule
Single nearest neigbor is ~x ′ (k = 1)Class label of ~x ′ is θ′ (random variable)P(θ′ = ωi ) = P(ωi |~x ′) ' P(ωi |~x) (for large n)
Assumption of 1NN: P(ωi |~x ′) is largest probabilityIf true (e.g., P ' 1, or P ' 1
c ), then 1NN close to Bayes Error
Average error probability P(e) =∫P(e|~x)p(~x)d~x
P(e|~x) = 1− P(ωi |~x ′) is ”minimum” P∗(e|~x)P∗(e) =
∫P∗(e|~x)p(~x)d~x
1NN error P = limn→∞ Pn(e)P∗ ≤ P ≤ P∗(2− c
c−1P∗)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Nearest Neigbor Rule
Single nearest neigbor is ~x ′ (k = 1)Class label of ~x ′ is θ′ (random variable)P(θ′ = ωi ) = P(ωi |~x ′) ' P(ωi |~x) (for large n)
Assumption of 1NN: P(ωi |~x ′) is largest probabilityIf true (e.g., P ' 1, or P ' 1
c ), then 1NN close to Bayes Error
Average error probability P(e) =∫P(e|~x)p(~x)d~x
P(e|~x) = 1− P(ωi |~x ′) is ”minimum” P∗(e|~x)P∗(e) =
∫P∗(e|~x)p(~x)d~x
1NN error P = limn→∞ Pn(e)P∗ ≤ P ≤ P∗(2− c
c−1P∗)
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Voronoi Tesselation
x1
x2
x1
x3
FIGURE 4.13. In two dimensions, the nearest-neighbor algorithm leads to a partition-ing of the input space into Voronoi cells, each labeled by the category of the trainingpoint it contains. In three dimensions, the cells are three-dimensional, and the decisionboundary resembles the surface of a crystal. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
1NN Error Rate Bounds
c - 1c
P*
P
P = P
*
P =
2P
*
c - 1c
FIGURE 4.14. Bounds on the nearest-neighbor error rate P in a c-category problemgiven infinite training data, where P∗ is the Bayes error (Eq. 52). At low error rates, thenearest-neighbor error rate is bounded above by twice the Bayes rate. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
k–Nearest–Neigbor Rule
Straight–forward extension: k neighbors
Majority voting: P(ωm|~x) is largest probability (mostprototypes in class m)
If k →∞ then k–NN rule becomes optimal
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
k–Nearest–Neigbor Rule
Straight–forward extension: k neighbors
Majority voting: P(ωm|~x) is largest probability (mostprototypes in class m)
If k →∞ then k–NN rule becomes optimal
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
k–Nearest–Neigbor Rule
Straight–forward extension: k neighbors
Majority voting: P(ωm|~x) is largest probability (mostprototypes in class m)
If k →∞ then k–NN rule becomes optimal
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
5NN in 2D
x
x2
x1
FIGURE 4.15. The k-nearest-neighbor query starts at the test point x and grows a spher-ical region until it encloses k training samples, and it labels the test point by a majorityvote of these samples. In this k = 5 case, the test point x would be labeled the categoryof the black points. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
kNN Error Rate Bounds
0 0.1 0.2 0.3 0.4
0.1
0.2
0.3
0.4
Bayes Rate
P*
P
135
915
0.5
FIGURE 4.16. The error rate for the k-nearest-neighbor rule for a two-category problemis bounded by Ck(P∗) in Eq. 54. Each curve is labeled by k; when k = ∞, the estimatedprobabilities match the true probabilities and thus the error rate is equal to the Bayesrate, that is, P = P∗. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Metrics
What is a distance?
Properties of Metrics
Nonnegativity: D(~a,~b) ≥ 0
Reflexivity: D(~a,~b) = 0 iff ~a = ~b
Symmetry: D(~a,~b) = D(~b,~a)
Triangle inequality: D(~a,~b) + D(~b,~c) ≥ D(~a,~c)
Scaling of feature values equivalent to changing the metric
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Metrics
What is a distance?
Properties of Metrics
Nonnegativity: D(~a,~b) ≥ 0
Reflexivity: D(~a,~b) = 0 iff ~a = ~b
Symmetry: D(~a,~b) = D(~b,~a)
Triangle inequality: D(~a,~b) + D(~b,~c) ≥ D(~a,~c)
Scaling of feature values equivalent to changing the metric
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Metrics
What is a distance?
Properties of Metrics
Nonnegativity: D(~a,~b) ≥ 0
Reflexivity: D(~a,~b) = 0 iff ~a = ~b
Symmetry: D(~a,~b) = D(~b,~a)
Triangle inequality: D(~a,~b) + D(~b,~c) ≥ D(~a,~c)
Scaling of feature values equivalent to changing the metric
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Scaling is Change of Metric
x1
x2 x2
x x
αx1
FIGURE 4.18. Scaling the coordinates of a feature space can change the distance rela-tionships computed by the Euclidean metric. Here we see how such scaling can changethe behavior of a nearest-neighbor classifer. Consider the test point x and its nearestneighbor. In the original space (left), the black prototype is closest. In the figure at theright, the x1 axis has been rescaled by a factor 1/3; now the nearest prototype is the redone. If there is a large disparity in the ranges of the full data in each dimension, a com-mon procedure is to rescale all the data to equalize such ranges, and this is equivalentto changing the metric in the original space. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Class of Metrics
Minkowski Metric (Lk Norm) Lk(~a,~b) = (∑d
i=1 |ai − bi |k)1k
L1 Norm: Manhattan distance
L2 Norm: Euclidean distance
L∞ Norm: Maximum of projected distances
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Class of Metrics
Minkowski Metric (Lk Norm) Lk(~a,~b) = (∑d
i=1 |ai − bi |k)1k
L1 Norm: Manhattan distance
L2 Norm: Euclidean distance
L∞ Norm: Maximum of projected distances
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Class of Metrics
Minkowski Metric (Lk Norm) Lk(~a,~b) = (∑d
i=1 |ai − bi |k)1k
L1 Norm: Manhattan distance
L2 Norm: Euclidean distance
L∞ Norm: Maximum of projected distances
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Class of Metrics
Minkowski Metric (Lk Norm) Lk(~a,~b) = (∑d
i=1 |ai − bi |k)1k
L1 Norm: Manhattan distance
L2 Norm: Euclidean distance
L∞ Norm: Maximum of projected distances
VU Pattern Recognition II
Nonparametric Techniques
k–Nearest–Neigbor Estimation
Minkowski Metric
1
42
∞
0,0,0
1,0,0
0,1,01,1,1
FIGURE 4.19. Each colored surface consists of points a distance 1.0 from the origin,measured using different values for k in the Minkowski metric (k is printed in red). Thusthe white surfaces correspond to the L1 norm (Manhattan distance), the light gray spherecorresponds to the L2 norm (Euclidean distance), the dark gray ones correspond to theL4 norm, and the pink box corresponds to the L∞ norm. From: Richard O. Duda, PeterE. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Discriminant Functions
Assumption: we know the form of discriminant functions(not probability densities)
Problem: determine parameters of discriminant functions
Method: gradient descent of criterion functions(based on training set)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Discriminant Functions
Assumption: we know the form of discriminant functions(not probability densities)
Problem: determine parameters of discriminant functions
Method: gradient descent of criterion functions(based on training set)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Discriminant Functions
Assumption: we know the form of discriminant functions(not probability densities)
Problem: determine parameters of discriminant functions
Method: gradient descent of criterion functions(based on training set)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Linear Classifier
x0=1
x1
. . .w2
w0
w1
wd
g(x)
x2 xd. . .
bias unit
input units
output unit
FIGURE 5.1. A simple linear classifier having d input units, each corresponding to thevalues of the components of an input vector. Each input feature value xi is multipliedby its corresponding weight wi; the effective input at the output unit is the sum all theseproducts,
∑wixi. We show in each unit its effective input-output function. Thus each of
the d input units is linear, emitting exactly the value of its corresponding feature value.The single bias unit unit always emits the constant value 1.0. The single output unitemits a +1 if wtx + w0 > 0 or a −1 otherwise. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons,Inc.
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Linear Discriminant Functions
Linear discriminant function g(~x) = ~w t~x + w0
(weight vector ~w , bias w0)
Two classes: g(~x) > 0→ ω1, else ω2
or ~w t~x > −w0
Decision surface is hyperplane, ~x1, ~x2 on boundary~w t~x1 + w0 = ~w t~x2 + w0 → ~w t(~x1 − ~x2) = 0(~w is normal vector)
Hyperplane H divides space in two half–spacesR1 is positive side (g(~x) > 0), R2 is negative side (g(~x) < 0)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Linear Discriminant Functions
Linear discriminant function g(~x) = ~w t~x + w0
(weight vector ~w , bias w0)
Two classes: g(~x) > 0→ ω1, else ω2
or ~w t~x > −w0
Decision surface is hyperplane, ~x1, ~x2 on boundary~w t~x1 + w0 = ~w t~x2 + w0 → ~w t(~x1 − ~x2) = 0(~w is normal vector)
Hyperplane H divides space in two half–spacesR1 is positive side (g(~x) > 0), R2 is negative side (g(~x) < 0)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Linear Discriminant Functions
Linear discriminant function g(~x) = ~w t~x + w0
(weight vector ~w , bias w0)
Two classes: g(~x) > 0→ ω1, else ω2
or ~w t~x > −w0
Decision surface is hyperplane, ~x1, ~x2 on boundary~w t~x1 + w0 = ~w t~x2 + w0 → ~w t(~x1 − ~x2) = 0(~w is normal vector)
Hyperplane H divides space in two half–spacesR1 is positive side (g(~x) > 0), R2 is negative side (g(~x) < 0)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Linear Discriminant Functions
Linear discriminant function g(~x) = ~w t~x + w0
(weight vector ~w , bias w0)
Two classes: g(~x) > 0→ ω1, else ω2
or ~w t~x > −w0
Decision surface is hyperplane, ~x1, ~x2 on boundary~w t~x1 + w0 = ~w t~x2 + w0 → ~w t(~x1 − ~x2) = 0(~w is normal vector)
Hyperplane H divides space in two half–spacesR1 is positive side (g(~x) > 0), R2 is negative side (g(~x) < 0)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Multiple Classes
Variant: c dichotomizers (ωi , not ωi )
Variant: c(c−1)2 dichotomizers (all class pairs)
Variant: linear machine, discriminant functionsgi (~x), i = 1, . . . , c
Decision boundarygi (~x) = gj(~x)→ (~wi − ~wj)
t~x + (wi0 − wj0) = 0
(~wi − ~wj) ⊥ Hij , r =gi (~x)−gj (~x)||~wi−~wj ||
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Multiple Classes
Variant: c dichotomizers (ωi , not ωi )
Variant: c(c−1)2 dichotomizers (all class pairs)
Variant: linear machine, discriminant functionsgi (~x), i = 1, . . . , c
Decision boundarygi (~x) = gj(~x)→ (~wi − ~wj)
t~x + (wi0 − wj0) = 0
(~wi − ~wj) ⊥ Hij , r =gi (~x)−gj (~x)||~wi−~wj ||
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Multiple Classes
Variant: c dichotomizers (ωi , not ωi )
Variant: c(c−1)2 dichotomizers (all class pairs)
Variant: linear machine, discriminant functionsgi (~x), i = 1, . . . , c
Decision boundarygi (~x) = gj(~x)→ (~wi − ~wj)
t~x + (wi0 − wj0) = 0
(~wi − ~wj) ⊥ Hij , r =gi (~x)−gj (~x)||~wi−~wj ||
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Multiple Classes
Variant: c dichotomizers (ωi , not ωi )
Variant: c(c−1)2 dichotomizers (all class pairs)
Variant: linear machine, discriminant functionsgi (~x), i = 1, . . . , c
Decision boundarygi (~x) = gj(~x)→ (~wi − ~wj)
t~x + (wi0 − wj0) = 0
(~wi − ~wj) ⊥ Hij , r =gi (~x)−gj (~x)||~wi−~wj ||
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Dichotomizers in a Four–class Problem
ω1
not ω1
ω1
not ω2ω2
not ω3ω3
not ω4
ω4
ambiguous region
ambiguous region
ω1 ω1
ω1
ω2
ω2
ω2
ω3
ω3
ω3
ω4
ω4ω4
ω2
ω4
ω3
ω3ω2
ω1
ω4
H13H12
H14
H23 H24
H34
FIGURE 5.3. Linear decision boundaries for a four-class problem. The top figure showsωi/not ωi dichotomies while the bottom figure shows ωi/ωj dichotomies and the corre-sponding decision boundaries Hij. The pink regions have ambiguous category assign-ments. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Linear Machines in Multi–class Problems
R1 R2
R3
R4
R5
ω1 R2
R3
R1
ω2 ω1ω3
ω5
ω2ω3
ω4
H15 H25
H24
H14
H35
H13
H34
H23
H12
H23
H13
FIGURE 5.4. Decision boundaries produced by a linear machine for a three-class prob-lem and a five-class problem. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Generalized Linear Discriminant Functions
More complex decision boundariese.g., quadratic discriminantg(~x) = w0 +
∑di=1 wixi +
∑di=1
∑dj=1 wijxixj
Generalized LDF g(~x) =∑d
i=1 aiyi (~x) = ~at~yd yi (~x) functions map points from d–dimensional ~x–space tod–dimensional ~y–space
Example g(x) = a1 + a2x + a3x2, ~y =
1xx2
Decision boundary is linear in ~y–spaceTransformed density p(x) is degenerateIf d is large, huge number of parameters(requires large training data set)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Generalized Linear Discriminant Functions
More complex decision boundariese.g., quadratic discriminantg(~x) = w0 +
∑di=1 wixi +
∑di=1
∑dj=1 wijxixj
Generalized LDF g(~x) =∑d
i=1 aiyi (~x) = ~at~yd yi (~x) functions map points from d–dimensional ~x–space tod–dimensional ~y–space
Example g(x) = a1 + a2x + a3x2, ~y =
1xx2
Decision boundary is linear in ~y–spaceTransformed density p(x) is degenerateIf d is large, huge number of parameters(requires large training data set)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Generalized Linear Discriminant Functions
More complex decision boundariese.g., quadratic discriminantg(~x) = w0 +
∑di=1 wixi +
∑di=1
∑dj=1 wijxixj
Generalized LDF g(~x) =∑d
i=1 aiyi (~x) = ~at~yd yi (~x) functions map points from d–dimensional ~x–space tod–dimensional ~y–space
Example g(x) = a1 + a2x + a3x2, ~y =
1xx2
Decision boundary is linear in ~y–spaceTransformed density p(x) is degenerateIf d is large, huge number of parameters(requires large training data set)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Generalized Linear Discriminant Functions
More complex decision boundariese.g., quadratic discriminantg(~x) = w0 +
∑di=1 wixi +
∑di=1
∑dj=1 wijxixj
Generalized LDF g(~x) =∑d
i=1 aiyi (~x) = ~at~yd yi (~x) functions map points from d–dimensional ~x–space tod–dimensional ~y–space
Example g(x) = a1 + a2x + a3x2, ~y =
1xx2
Decision boundary is linear in ~y–spaceTransformed density p(x) is degenerateIf d is large, huge number of parameters(requires large training data set)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
From 1D to 3D
0
-1
0
1
2
y2
0
2
4
y3
0.51
1.5
2
2.5
y1
1-1 20-2x
R1R1 R2
y = ( )1x
x2
R2
R1
ˆ
ˆ
FIGURE 5.5. The mapping y = (1, x, x2)t takes a line and transforms it to a parabolain three dimensions. A plane splits the resulting y-space into regions corresponding totwo categories, and this in turn gives a nonsimply connected decision region in theone-dimensional x-space. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
From 2D to 3D
y2w
R2R1
R1
R2
R1
x1
x2
x1
x2y1
y3
Hy =
( )x 1
x 2
αx 1x 2
ˆ
ˆˆ
FIGURE 5.6. The two-dimensional input space x is mapped through a polynomial func-tion f to y. Here the mapping is y1 = x1, y2 = x2 and y3 ∝ x1x2. A linear discriminantin this transformed space is a hyperplane, which cuts the surface. Points to the positiveside of the hyperplane H correspond to category ω1, and those beneath it correspond tocategory ω2. Here, in terms of the x space, R1 is a not simply connected. From: RichardO. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001by John Wiley & Sons, Inc.
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Linearly Separable Dichotomy
Two classes, samples ~yi , ~at~yi > 0→ ω1, ~at~yi < 0→ ω2
”Normalization” of ω2: ~yi = −~yi → ~at~yi > 0 ∀~yi
Solution region defines all possible values of ~aintersection of n half–spaces (~at~yi = 0)
Margin b > 0, ~at~yi ≥ b, new solution region has distance b||~yi ||
from old boundaries
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Linearly Separable Dichotomy
Two classes, samples ~yi , ~at~yi > 0→ ω1, ~at~yi < 0→ ω2
”Normalization” of ω2: ~yi = −~yi → ~at~yi > 0 ∀~yi
Solution region defines all possible values of ~aintersection of n half–spaces (~at~yi = 0)
Margin b > 0, ~at~yi ≥ b, new solution region has distance b||~yi ||
from old boundaries
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Linearly Separable Dichotomy
Two classes, samples ~yi , ~at~yi > 0→ ω1, ~at~yi < 0→ ω2
”Normalization” of ω2: ~yi = −~yi → ~at~yi > 0 ∀~yi
Solution region defines all possible values of ~aintersection of n half–spaces (~at~yi = 0)
Margin b > 0, ~at~yi ≥ b, new solution region has distance b||~yi ||
from old boundaries
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Linearly Separable Dichotomy
Two classes, samples ~yi , ~at~yi > 0→ ω1, ~at~yi < 0→ ω2
”Normalization” of ω2: ~yi = −~yi → ~at~yi > 0 ∀~yi
Solution region defines all possible values of ~aintersection of n half–spaces (~at~yi = 0)
Margin b > 0, ~at~yi ≥ b, new solution region has distance b||~yi ||
from old boundaries
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Solution Region and Normalization
y1
y2
separating plane
solution region
y1
y2
"separating" plane
solution region
aa
FIGURE 5.8. Four training samples (black for ω1, red for ω2) and the solution region infeature space. The figure on the left shows the raw data; the solution vectors leads to aplane that separates the patterns from the two categories. In the figure on the right, thered points have been “normalized”—that is, changed in sign. Now the solution vectorleads to a plane that places all “normalized” points on the same side. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Solution Region with Margins
solutionregion
y1
y2
y3
a1
a2
solutionregion
a2
a1
y1
y2
y3
b/||y2 ||
b/||y 1||
b/||y
3||
}
}
}
FIGURE 5.9. The effect of the margin on the solution region. At the left is the case ofno margin (b = 0) equivalent to a case such as shown at the left in Fig. 5.8. At the rightis the case b > 0, shrinking the solution region by margins b/‖yi‖. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Gradient Descent Solutions
Set of linear inequalities ~at~yi > 0, define criterion functionJ(~a), which is minimized for a solution vector ~a∗
Minimizing a scalar function J(~a) by gradient descent~a(k + 1) = ~a(k)− η(k)~∇J(~a(k))
Second–order expansionJ(~a) ' J(~a(k)) + ~∇Jt(~a−~a(k)) + 1
2 (~a−~a(k))tH(~a−~a(k))H is Hessian Matrix
Minimize J(~a(k + 1)) with η(k) = ||~∇J||2~∇JtH ~∇J
J(~a) ∼ ~a2 → H = const.→ η = const.
Minimize second–order expansion with ~a(k + 1)→Newton Descent ~a(k + 1) = ~a(k)− H−1~∇J (expensive)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Gradient Descent Solutions
Set of linear inequalities ~at~yi > 0, define criterion functionJ(~a), which is minimized for a solution vector ~a∗
Minimizing a scalar function J(~a) by gradient descent~a(k + 1) = ~a(k)− η(k)~∇J(~a(k))
Second–order expansionJ(~a) ' J(~a(k)) + ~∇Jt(~a−~a(k)) + 1
2 (~a−~a(k))tH(~a−~a(k))H is Hessian Matrix
Minimize J(~a(k + 1)) with η(k) = ||~∇J||2~∇JtH ~∇J
J(~a) ∼ ~a2 → H = const.→ η = const.
Minimize second–order expansion with ~a(k + 1)→Newton Descent ~a(k + 1) = ~a(k)− H−1~∇J (expensive)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Gradient Descent Solutions
Set of linear inequalities ~at~yi > 0, define criterion functionJ(~a), which is minimized for a solution vector ~a∗
Minimizing a scalar function J(~a) by gradient descent~a(k + 1) = ~a(k)− η(k)~∇J(~a(k))
Second–order expansionJ(~a) ' J(~a(k)) + ~∇Jt(~a−~a(k)) + 1
2 (~a−~a(k))tH(~a−~a(k))H is Hessian Matrix
Minimize J(~a(k + 1)) with η(k) = ||~∇J||2~∇JtH ~∇J
J(~a) ∼ ~a2 → H = const.→ η = const.
Minimize second–order expansion with ~a(k + 1)→Newton Descent ~a(k + 1) = ~a(k)− H−1~∇J (expensive)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Gradient Descent Solutions
Set of linear inequalities ~at~yi > 0, define criterion functionJ(~a), which is minimized for a solution vector ~a∗
Minimizing a scalar function J(~a) by gradient descent~a(k + 1) = ~a(k)− η(k)~∇J(~a(k))
Second–order expansionJ(~a) ' J(~a(k)) + ~∇Jt(~a−~a(k)) + 1
2 (~a−~a(k))tH(~a−~a(k))H is Hessian Matrix
Minimize J(~a(k + 1)) with η(k) = ||~∇J||2~∇JtH ~∇J
J(~a) ∼ ~a2 → H = const.→ η = const.
Minimize second–order expansion with ~a(k + 1)→Newton Descent ~a(k + 1) = ~a(k)− H−1~∇J (expensive)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Gradient Descent Solutions
Set of linear inequalities ~at~yi > 0, define criterion functionJ(~a), which is minimized for a solution vector ~a∗
Minimizing a scalar function J(~a) by gradient descent~a(k + 1) = ~a(k)− η(k)~∇J(~a(k))
Second–order expansionJ(~a) ' J(~a(k)) + ~∇Jt(~a−~a(k)) + 1
2 (~a−~a(k))tH(~a−~a(k))H is Hessian Matrix
Minimize J(~a(k + 1)) with η(k) = ||~∇J||2~∇JtH ~∇J
J(~a) ∼ ~a2 → H = const.→ η = const.
Minimize second–order expansion with ~a(k + 1)→Newton Descent ~a(k + 1) = ~a(k)− H−1~∇J (expensive)
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Gradient and Newton Descent
a1
a2
J(a)
FIGURE 5.10. The sequence of weight vectors given by a simple gradient descentmethod (red) and by Newton’s (second order) algorithm (black). Newton’s method typi-cally leads to greater improvement per step, even when using optimal learning rates forboth methods. However the added computational burden of inverting the Hessian ma-trix used in Newton’s method is not always justified, and simple gradient descent maysuffice. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Perceptron Criterion Function
”Normalized” inequalities ~at~yi > 0Perceptron criterion Jp(~a) =
∑~y∈Y −~at~y
(Y is set of misclassified patterns)
Gradient ~∇Jp =∑
~y∈Y −~y
Update rule ~a(k + 1) = ~a(k) + η(k)∑
~y∈Yk ~y
Batch vs. single–sample correction
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Perceptron Criterion Function
”Normalized” inequalities ~at~yi > 0Perceptron criterion Jp(~a) =
∑~y∈Y −~at~y
(Y is set of misclassified patterns)
Gradient ~∇Jp =∑
~y∈Y −~y
Update rule ~a(k + 1) = ~a(k) + η(k)∑
~y∈Yk ~y
Batch vs. single–sample correction
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Perceptron Criterion Function
”Normalized” inequalities ~at~yi > 0Perceptron criterion Jp(~a) =
∑~y∈Y −~at~y
(Y is set of misclassified patterns)
Gradient ~∇Jp =∑
~y∈Y −~y
Update rule ~a(k + 1) = ~a(k) + η(k)∑
~y∈Yk ~y
Batch vs. single–sample correction
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Perceptron Criterion Function
”Normalized” inequalities ~at~yi > 0Perceptron criterion Jp(~a) =
∑~y∈Y −~at~y
(Y is set of misclassified patterns)
Gradient ~∇Jp =∑
~y∈Y −~y
Update rule ~a(k + 1) = ~a(k) + η(k)∑
~y∈Yk ~y
Batch vs. single–sample correction
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Minimum Squared–Error Procedures
Set of equalities ~at~yi = bibi > 0 are arbitrary constants
Solve Y~a = ~bY is n × (d + 1) matrix containing all training vectors
If Y nonsingular ~a = Y−1~b, however Y mostly rectangular!
Minimizing ~e = Y~a− ~b leads toY tY~a = Y t~b → ~a = (Y tY )−1Y t~b = Y †~bY † is pseudoinverse (d + 1)× n matrix
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Minimum Squared–Error Procedures
Set of equalities ~at~yi = bibi > 0 are arbitrary constants
Solve Y~a = ~bY is n × (d + 1) matrix containing all training vectors
If Y nonsingular ~a = Y−1~b, however Y mostly rectangular!
Minimizing ~e = Y~a− ~b leads toY tY~a = Y t~b → ~a = (Y tY )−1Y t~b = Y †~bY † is pseudoinverse (d + 1)× n matrix
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Minimum Squared–Error Procedures
Set of equalities ~at~yi = bibi > 0 are arbitrary constants
Solve Y~a = ~bY is n × (d + 1) matrix containing all training vectors
If Y nonsingular ~a = Y−1~b, however Y mostly rectangular!
Minimizing ~e = Y~a− ~b leads toY tY~a = Y t~b → ~a = (Y tY )−1Y t~b = Y †~bY † is pseudoinverse (d + 1)× n matrix
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Minimum Squared–Error Procedures
Set of equalities ~at~yi = bibi > 0 are arbitrary constants
Solve Y~a = ~bY is n × (d + 1) matrix containing all training vectors
If Y nonsingular ~a = Y−1~b, however Y mostly rectangular!
Minimizing ~e = Y~a− ~b leads toY tY~a = Y t~b → ~a = (Y tY )−1Y t~b = Y †~bY † is pseudoinverse (d + 1)× n matrix
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Support Vector Machines
Transform patterns to (much) higher dimensionvia nonlinear mapping ϕ(.)
Linear discriminant g(~y) = ~at~y
Distance of ~yk to H is zkg(~yk )||a|| ≥ b
zk = ±1 (normalization), b is margin
Maximize b with constrained ||a|| = 1b → minimize ||a|| with
inequality constraints
Kuhn–Tucker theorem, optimization with inequalityconstraints, generalization of Lagrange Multipliers
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Support Vector Machines
Transform patterns to (much) higher dimensionvia nonlinear mapping ϕ(.)
Linear discriminant g(~y) = ~at~y
Distance of ~yk to H is zkg(~yk )||a|| ≥ b
zk = ±1 (normalization), b is margin
Maximize b with constrained ||a|| = 1b → minimize ||a|| with
inequality constraints
Kuhn–Tucker theorem, optimization with inequalityconstraints, generalization of Lagrange Multipliers
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Support Vector Machines
Transform patterns to (much) higher dimensionvia nonlinear mapping ϕ(.)
Linear discriminant g(~y) = ~at~y
Distance of ~yk to H is zkg(~yk )||a|| ≥ b
zk = ±1 (normalization), b is margin
Maximize b with constrained ||a|| = 1b → minimize ||a|| with
inequality constraints
Kuhn–Tucker theorem, optimization with inequalityconstraints, generalization of Lagrange Multipliers
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Support Vector Machines
Transform patterns to (much) higher dimensionvia nonlinear mapping ϕ(.)
Linear discriminant g(~y) = ~at~y
Distance of ~yk to H is zkg(~yk )||a|| ≥ b
zk = ±1 (normalization), b is margin
Maximize b with constrained ||a|| = 1b → minimize ||a|| with
inequality constraints
Kuhn–Tucker theorem, optimization with inequalityconstraints, generalization of Lagrange Multipliers
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Support Vector Machines
Transform patterns to (much) higher dimensionvia nonlinear mapping ϕ(.)
Linear discriminant g(~y) = ~at~y
Distance of ~yk to H is zkg(~yk )||a|| ≥ b
zk = ±1 (normalization), b is margin
Maximize b with constrained ||a|| = 1b → minimize ||a|| with
inequality constraints
Kuhn–Tucker theorem, optimization with inequalityconstraints, generalization of Lagrange Multipliers
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Maximal Margin SVM
Maximize margin b using the Kuhn–Tucker functionalL(~a, ~α) = 1
2 ||~a||2 −
∑nk=1 αk [zk~a
t~yk − 1]
Resulting in dual problem (quadratic optimization)L(~α) =
∑nk=1 αk − 1
2
∑nk,j αkαjzkzj~y
tj ~yk
with constraints∑nk=1 zkαk = 0 αk ≥ 0
Then ~a∗ =∑n
i=1 ziα∗i ~yi (non–zero αi indicates support
vector)
Maximal margin b∗ = (∑n
i=1 αi∗)−
12
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Maximal Margin SVM
Maximize margin b using the Kuhn–Tucker functionalL(~a, ~α) = 1
2 ||~a||2 −
∑nk=1 αk [zk~a
t~yk − 1]
Resulting in dual problem (quadratic optimization)L(~α) =
∑nk=1 αk − 1
2
∑nk,j αkαjzkzj~y
tj ~yk
with constraints∑nk=1 zkαk = 0 αk ≥ 0
Then ~a∗ =∑n
i=1 ziα∗i ~yi (non–zero αi indicates support
vector)
Maximal margin b∗ = (∑n
i=1 αi∗)−
12
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Maximal Margin SVM
Maximize margin b using the Kuhn–Tucker functionalL(~a, ~α) = 1
2 ||~a||2 −
∑nk=1 αk [zk~a
t~yk − 1]
Resulting in dual problem (quadratic optimization)L(~α) =
∑nk=1 αk − 1
2
∑nk,j αkαjzkzj~y
tj ~yk
with constraints∑nk=1 zkαk = 0 αk ≥ 0
Then ~a∗ =∑n
i=1 ziα∗i ~yi (non–zero αi indicates support
vector)
Maximal margin b∗ = (∑n
i=1 αi∗)−
12
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Maximal Margin SVM
Maximize margin b using the Kuhn–Tucker functionalL(~a, ~α) = 1
2 ||~a||2 −
∑nk=1 αk [zk~a
t~yk − 1]
Resulting in dual problem (quadratic optimization)L(~α) =
∑nk=1 αk − 1
2
∑nk,j αkαjzkzj~y
tj ~yk
with constraints∑nk=1 zkαk = 0 αk ≥ 0
Then ~a∗ =∑n
i=1 ziα∗i ~yi (non–zero αi indicates support
vector)
Maximal margin b∗ = (∑n
i=1 αi∗)−
12
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Maximal Margin Hyperplane
y1
y2
R2
optimal hyperplane
max
imum
mar
gin
b
max
imum
mar
gin
b
R1
FIGURE 5.19. Training a support vector machine consists of finding the optimal hyper-plane, that is, the one with the maximum distance from the nearest training patterns.The support vectors are those (nearest) patterns, a distance b from the hyperplane. Thethree support vectors are shown as solid dots. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons,Inc.
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Soft Margin SVM
Maximal margin SVM is sensitive to outliers, demands linearseparability for solution
Soft Margin SVM introducing slack variables ξzkg(~yk) ≥ b − ξk (relaxed margin)
Maximize relaxed margin b with Kuhn–Tucker functionalL(~a, ~α, ~ξ) = 1
2 ||~a||2 + C
2
∑nk=1 ξ
2i −
∑nk=1 αk [zk~a
t~yk − 1 + ξi ]
Again ~a∗ =∑n
i=1 ziα∗i ~yi
Maximal margin b∗ = (∑n
i=1 αi∗ − 1
C |α∗i |2)−
12
Depends on parameter C !
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Soft Margin SVM
Maximal margin SVM is sensitive to outliers, demands linearseparability for solution
Soft Margin SVM introducing slack variables ξzkg(~yk) ≥ b − ξk (relaxed margin)
Maximize relaxed margin b with Kuhn–Tucker functionalL(~a, ~α, ~ξ) = 1
2 ||~a||2 + C
2
∑nk=1 ξ
2i −
∑nk=1 αk [zk~a
t~yk − 1 + ξi ]
Again ~a∗ =∑n
i=1 ziα∗i ~yi
Maximal margin b∗ = (∑n
i=1 αi∗ − 1
C |α∗i |2)−
12
Depends on parameter C !
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Soft Margin SVM
Maximal margin SVM is sensitive to outliers, demands linearseparability for solution
Soft Margin SVM introducing slack variables ξzkg(~yk) ≥ b − ξk (relaxed margin)
Maximize relaxed margin b with Kuhn–Tucker functionalL(~a, ~α, ~ξ) = 1
2 ||~a||2 + C
2
∑nk=1 ξ
2i −
∑nk=1 αk [zk~a
t~yk − 1 + ξi ]
Again ~a∗ =∑n
i=1 ziα∗i ~yi
Maximal margin b∗ = (∑n
i=1 αi∗ − 1
C |α∗i |2)−
12
Depends on parameter C !
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Soft Margin SVM
Maximal margin SVM is sensitive to outliers, demands linearseparability for solution
Soft Margin SVM introducing slack variables ξzkg(~yk) ≥ b − ξk (relaxed margin)
Maximize relaxed margin b with Kuhn–Tucker functionalL(~a, ~α, ~ξ) = 1
2 ||~a||2 + C
2
∑nk=1 ξ
2i −
∑nk=1 αk [zk~a
t~yk − 1 + ξi ]
Again ~a∗ =∑n
i=1 ziα∗i ~yi
Maximal margin b∗ = (∑n
i=1 αi∗ − 1
C |α∗i |2)−
12
Depends on parameter C !
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Soft Margin SVM
Maximal margin SVM is sensitive to outliers, demands linearseparability for solution
Soft Margin SVM introducing slack variables ξzkg(~yk) ≥ b − ξk (relaxed margin)
Maximize relaxed margin b with Kuhn–Tucker functionalL(~a, ~α, ~ξ) = 1
2 ||~a||2 + C
2
∑nk=1 ξ
2i −
∑nk=1 αk [zk~a
t~yk − 1 + ξi ]
Again ~a∗ =∑n
i=1 ziα∗i ~yi
Maximal margin b∗ = (∑n
i=1 αi∗ − 1
C |α∗i |2)−
12
Depends on parameter C !
VU Pattern Recognition II
Linear Discriminant Functions
Decision Surfaces
Soft Margin SVM
Maximal margin SVM is sensitive to outliers, demands linearseparability for solution
Soft Margin SVM introducing slack variables ξzkg(~yk) ≥ b − ξk (relaxed margin)
Maximize relaxed margin b with Kuhn–Tucker functionalL(~a, ~α, ~ξ) = 1
2 ||~a||2 + C
2
∑nk=1 ξ
2i −
∑nk=1 αk [zk~a
t~yk − 1 + ξi ]
Again ~a∗ =∑n
i=1 ziα∗i ~yi
Maximal margin b∗ = (∑n
i=1 αi∗ − 1
C |α∗i |2)−
12
Depends on parameter C !
VU Pattern Recognition II
Neural Networks
Multilayer Neural Networks
Real-world problems: linear discriminant often not sufficient
NNs also implement nonlinear mapping to higher dimension
Learning finds mapping AND linear discriminant
Error–backpropagation is least square fit to Bayes discriminantfunctions
NNs motivated by biology, but can be explained without it
VU Pattern Recognition II
Neural Networks
Multilayer Neural Networks
Real-world problems: linear discriminant often not sufficient
NNs also implement nonlinear mapping to higher dimension
Learning finds mapping AND linear discriminant
Error–backpropagation is least square fit to Bayes discriminantfunctions
NNs motivated by biology, but can be explained without it
VU Pattern Recognition II
Neural Networks
Multilayer Neural Networks
Real-world problems: linear discriminant often not sufficient
NNs also implement nonlinear mapping to higher dimension
Learning finds mapping AND linear discriminant
Error–backpropagation is least square fit to Bayes discriminantfunctions
NNs motivated by biology, but can be explained without it
VU Pattern Recognition II
Neural Networks
Multilayer Neural Networks
Real-world problems: linear discriminant often not sufficient
NNs also implement nonlinear mapping to higher dimension
Learning finds mapping AND linear discriminant
Error–backpropagation is least square fit to Bayes discriminantfunctions
NNs motivated by biology, but can be explained without it
VU Pattern Recognition II
Neural Networks
Multilayer Neural Networks
Real-world problems: linear discriminant often not sufficient
NNs also implement nonlinear mapping to higher dimension
Learning finds mapping AND linear discriminant
Error–backpropagation is least square fit to Bayes discriminantfunctions
NNs motivated by biology, but can be explained without it
VU Pattern Recognition II
Neural Networks
XOR Net
-1 1
-1
1
biashidden j
output k
input i
11
1 1
.5
-1.5
.7-.4-1
x1 x2
x1
x2
z=+1
z=-1
z=-1
0
1-1
0
1
-1
0
1
-1
0
1-1
0
1
-1
0
1
-1
0
1-1
0
1
-1
0
1
-1
R2
R2
R1
y1 y2
z
zk
wkj
wji
x1
x2
x1
x2
x1
x2
y1 y2
FIGURE 6.1. The two-bit parity or exclusive-OR problem can be solved by a three-layer network. At the bottom is the two-dimensional feature x1x2-space, along with thefour patterns to be classified. The three-layer network is shown in the middle. The inputunits are linear and merely distribute their feature values through multiplicative weightsto the hidden units. The hidden and output units here are linear threshold units, eachof which forms the linear sum of its inputs times their associated weight to yield net,and emits a +1 if this net is greater than or equal to 0, and −1 otherwise, as shownby the graphs. Positive or “excitatory” weights are denoted by solid lines, negative or“inhibitory” weights by dashed lines; each weight magnitude is indicated by the line’sthickness, and is labeled. The single output unit sums the weighted signals from thehidden units and bias to form its net, and emits a +1 if its net is greater than or equalto 0 and emits a −1 otherwise. Within each unit we show a graph of its input-outputor activation function—f (net) versus net. This function is linear for the input units, aconstant for the bias, and a step or sign function elsewhere. We say that this networkhas a 2-2-1 fully connected topology, describing the number of units (other than thebias) in successive layers. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Neural Networks
Network Components
Neurons and synaptic connections (weights)
Net activation netj =∑d
i=1 xiwji + wj0 =∑d
i=0 xiwji ≡ ~wjt~x
Neuron output zk = f (netk), activation function
Common activation function class is sigmoid, e.g.,f (x) = 1
1+e−cx
Basic topologies: feed–forward and recurrent
VU Pattern Recognition II
Neural Networks
Network Components
Neurons and synaptic connections (weights)
Net activation netj =∑d
i=1 xiwji + wj0 =∑d
i=0 xiwji ≡ ~wjt~x
Neuron output zk = f (netk), activation function
Common activation function class is sigmoid, e.g.,f (x) = 1
1+e−cx
Basic topologies: feed–forward and recurrent
VU Pattern Recognition II
Neural Networks
Network Components
Neurons and synaptic connections (weights)
Net activation netj =∑d
i=1 xiwji + wj0 =∑d
i=0 xiwji ≡ ~wjt~x
Neuron output zk = f (netk), activation function
Common activation function class is sigmoid, e.g.,f (x) = 1
1+e−cx
Basic topologies: feed–forward and recurrent
VU Pattern Recognition II
Neural Networks
Network Components
Neurons and synaptic connections (weights)
Net activation netj =∑d
i=1 xiwji + wj0 =∑d
i=0 xiwji ≡ ~wjt~x
Neuron output zk = f (netk), activation function
Common activation function class is sigmoid, e.g.,f (x) = 1
1+e−cx
Basic topologies: feed–forward and recurrent
VU Pattern Recognition II
Neural Networks
Network Components
Neurons and synaptic connections (weights)
Net activation netj =∑d
i=1 xiwji + wj0 =∑d
i=0 xiwji ≡ ~wjt~x
Neuron output zk = f (netk), activation function
Common activation function class is sigmoid, e.g.,f (x) = 1
1+e−cx
Basic topologies: feed–forward and recurrent
VU Pattern Recognition II
Neural Networks
A 2–4–1 Network
y1
y2
y4
y3
y3 y4y2y1
x1 x2
z1
z1
x1
x2
FIGURE 6.2. A 2-4-1 network (with bias) along with the response functions at different units; each hiddenoutput unit has sigmoidal activation function f (·). In the case shown, the hidden unit outputs are paired inopposition thereby producing a “bump” at the output unit. Given a sufficiently large number of hidden units,any continuous function from input to output can be approximated arbitrarily well by such a network. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley& Sons, Inc.
VU Pattern Recognition II
Neural Networks
NN Decision Boundaries
two layer
three layer
x1 x2
x1
x2
...
x1 x2
R1
R2
R1
R2
R2
R1
x2
x1
FIGURE 6.3. Whereas a two-layer network classifier can only implement a linear deci-sion boundary, given an adequate number of hidden units, three-, four- and higher-layernetworks can implement arbitrary decision boundaries. The decision regions need notbe convex or simply connected. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Neural Networks
Network Learning
Learning as minimization (of network error)
Error is a function of network parameters
Gradient descent methods reduce error
Problem with hidden layers
Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)
Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally
Weight update ∆wj ,i = ηδjaiGeneralized error term δ
Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation
VU Pattern Recognition II
Neural Networks
Network Learning
Learning as minimization (of network error)
Error is a function of network parameters
Gradient descent methods reduce error
Problem with hidden layers
Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)
Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally
Weight update ∆wj ,i = ηδjaiGeneralized error term δ
Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation
VU Pattern Recognition II
Neural Networks
Network Learning
Learning as minimization (of network error)
Error is a function of network parameters
Gradient descent methods reduce error
Problem with hidden layers
Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)
Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally
Weight update ∆wj ,i = ηδjaiGeneralized error term δ
Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation
VU Pattern Recognition II
Neural Networks
Network Learning
Learning as minimization (of network error)
Error is a function of network parameters
Gradient descent methods reduce error
Problem with hidden layers
Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)
Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally
Weight update ∆wj ,i = ηδjaiGeneralized error term δ
Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation
VU Pattern Recognition II
Neural Networks
Network Learning
Learning as minimization (of network error)
Error is a function of network parameters
Gradient descent methods reduce error
Problem with hidden layers
Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)
Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally
Weight update ∆wj ,i = ηδjaiGeneralized error term δ
Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation
VU Pattern Recognition II
Neural Networks
Network Learning
Learning as minimization (of network error)
Error is a function of network parameters
Gradient descent methods reduce error
Problem with hidden layers
Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)
Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally
Weight update ∆wj ,i = ηδjaiGeneralized error term δ
Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation
VU Pattern Recognition II
Neural Networks
Network Learning
Learning as minimization (of network error)
Error is a function of network parameters
Gradient descent methods reduce error
Problem with hidden layers
Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)
Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally
Weight update ∆wj ,i = ηδjaiGeneralized error term δ
Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation
VU Pattern Recognition II
Neural Networks
Network Learning
Learning as minimization (of network error)
Error is a function of network parameters
Gradient descent methods reduce error
Problem with hidden layers
Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)
Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally
Weight update ∆wj ,i = ηδjaiGeneralized error term δ
Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation
VU Pattern Recognition II
Neural Networks
Error–Backpropagation I
x1
x2
H1
H2
I1
I2 z2
z1
y1
y2
w12
w22
w21
w11v11
v21
v12
v22
Hj =∑n
i=1 vj ,ixi Ik =∑h
j=1 wk,jyjyj = f (Hj), zk = f (Ik)
Error E (p) = 12
∑mk=1 (t
(p)k − z
(p)k )2
Output Layer: ∆wk,j = −η ∂E∂wk,j
∂E∂wk,j
= ∂E∂Ik
∂Ik∂wk,j
= ∂E∂Ik
yj∂E∂Ik
= ∂E∂zk
∂zk∂Ik
= −(tk − zk)f ′(Ik)∂E∂wk,j
= −(tk − zk)f ′(Ik)yj mit δk = (tk − zk)f ′(Ik)
∆wk,j = ηδkyj
VU Pattern Recognition II
Neural Networks
Error–Backpropagation I
x1
x2
H1
H2
I1
I2 z2
z1
y1
y2
w12
w22
w21
w11v11
v21
v12
v22
Hj =∑n
i=1 vj ,ixi Ik =∑h
j=1 wk,jyjyj = f (Hj), zk = f (Ik)
Error E (p) = 12
∑mk=1 (t
(p)k − z
(p)k )2
Output Layer: ∆wk,j = −η ∂E∂wk,j
∂E∂wk,j
= ∂E∂Ik
∂Ik∂wk,j
= ∂E∂Ik
yj∂E∂Ik
= ∂E∂zk
∂zk∂Ik
= −(tk − zk)f ′(Ik)∂E∂wk,j
= −(tk − zk)f ′(Ik)yj mit δk = (tk − zk)f ′(Ik)
∆wk,j = ηδkyj
VU Pattern Recognition II
Neural Networks
Error–Backpropagation I
x1
x2
H1
H2
I1
I2 z2
z1
y1
y2
w12
w22
w21
w11v11
v21
v12
v22
Hj =∑n
i=1 vj ,ixi Ik =∑h
j=1 wk,jyjyj = f (Hj), zk = f (Ik)
Error E (p) = 12
∑mk=1 (t
(p)k − z
(p)k )2
Output Layer: ∆wk,j = −η ∂E∂wk,j
∂E∂wk,j
= ∂E∂Ik
∂Ik∂wk,j
= ∂E∂Ik
yj∂E∂Ik
= ∂E∂zk
∂zk∂Ik
= −(tk − zk)f ′(Ik)∂E∂wk,j
= −(tk − zk)f ′(Ik)yj mit δk = (tk − zk)f ′(Ik)
∆wk,j = ηδkyj
VU Pattern Recognition II
Neural Networks
Error–Backpropagation I
x1
x2
H1
H2
I1
I2 z2
z1
y1
y2
w12
w22
w21
w11v11
v21
v12
v22
Hj =∑n
i=1 vj ,ixi Ik =∑h
j=1 wk,jyjyj = f (Hj), zk = f (Ik)
Error E (p) = 12
∑mk=1 (t
(p)k − z
(p)k )2
Output Layer: ∆wk,j = −η ∂E∂wk,j
∂E∂wk,j
= ∂E∂Ik
∂Ik∂wk,j
= ∂E∂Ik
yj∂E∂Ik
= ∂E∂zk
∂zk∂Ik
= −(tk − zk)f ′(Ik)∂E∂wk,j
= −(tk − zk)f ′(Ik)yj mit δk = (tk − zk)f ′(Ik)
∆wk,j = ηδkyj
VU Pattern Recognition II
Neural Networks
Error–Backpropagation II
Hidden Layer: ∆vj ,i = −η ∂E∂vj,i
∂E∂vj,i
= ∂E∂Hj
∂Hj
∂vj,i= ∂E
∂Hjxi
∂E∂Hj
= ∂E∂yj
∂yj∂Hj
= ∂E∂yj
f ′(Hj)
∂E∂yj
= −12
∑mk=1
∂(tk−f (Ik ))2
∂yj= −
∑mk=1 (tk − zk)f ′(Ik)wk,j
mit δj = f ′(Hj)∑m
k=1 δkwk,j
∆vj ,i = ηδjxi
Local update rules propagating error from output to input
Present all p patterns of the training set = 1 Epoch (completetraining e.g., 1,000 epochs)
Batch Learning (Off–line): accumulate weight changes for allpatterns, then update weights
On–line Learning: update weights after each pattern
VU Pattern Recognition II
Neural Networks
Error–Backpropagation II
Hidden Layer: ∆vj ,i = −η ∂E∂vj,i
∂E∂vj,i
= ∂E∂Hj
∂Hj
∂vj,i= ∂E
∂Hjxi
∂E∂Hj
= ∂E∂yj
∂yj∂Hj
= ∂E∂yj
f ′(Hj)
∂E∂yj
= −12
∑mk=1
∂(tk−f (Ik ))2
∂yj= −
∑mk=1 (tk − zk)f ′(Ik)wk,j
mit δj = f ′(Hj)∑m
k=1 δkwk,j
∆vj ,i = ηδjxi
Local update rules propagating error from output to input
Present all p patterns of the training set = 1 Epoch (completetraining e.g., 1,000 epochs)
Batch Learning (Off–line): accumulate weight changes for allpatterns, then update weights
On–line Learning: update weights after each pattern
VU Pattern Recognition II
Neural Networks
Error–Backpropagation II
Hidden Layer: ∆vj ,i = −η ∂E∂vj,i
∂E∂vj,i
= ∂E∂Hj
∂Hj
∂vj,i= ∂E
∂Hjxi
∂E∂Hj
= ∂E∂yj
∂yj∂Hj
= ∂E∂yj
f ′(Hj)
∂E∂yj
= −12
∑mk=1
∂(tk−f (Ik ))2
∂yj= −
∑mk=1 (tk − zk)f ′(Ik)wk,j
mit δj = f ′(Hj)∑m
k=1 δkwk,j
∆vj ,i = ηδjxi
Local update rules propagating error from output to input
Present all p patterns of the training set = 1 Epoch (completetraining e.g., 1,000 epochs)
Batch Learning (Off–line): accumulate weight changes for allpatterns, then update weights
On–line Learning: update weights after each pattern
VU Pattern Recognition II
Neural Networks
Error–Backpropagation II
Hidden Layer: ∆vj ,i = −η ∂E∂vj,i
∂E∂vj,i
= ∂E∂Hj
∂Hj
∂vj,i= ∂E
∂Hjxi
∂E∂Hj
= ∂E∂yj
∂yj∂Hj
= ∂E∂yj
f ′(Hj)
∂E∂yj
= −12
∑mk=1
∂(tk−f (Ik ))2
∂yj= −
∑mk=1 (tk − zk)f ′(Ik)wk,j
mit δj = f ′(Hj)∑m
k=1 δkwk,j
∆vj ,i = ηδjxi
Local update rules propagating error from output to input
Present all p patterns of the training set = 1 Epoch (completetraining e.g., 1,000 epochs)
Batch Learning (Off–line): accumulate weight changes for allpatterns, then update weights
On–line Learning: update weights after each pattern
VU Pattern Recognition II
Neural Networks
Error–Backpropagation II
Hidden Layer: ∆vj ,i = −η ∂E∂vj,i
∂E∂vj,i
= ∂E∂Hj
∂Hj
∂vj,i= ∂E
∂Hjxi
∂E∂Hj
= ∂E∂yj
∂yj∂Hj
= ∂E∂yj
f ′(Hj)
∂E∂yj
= −12
∑mk=1
∂(tk−f (Ik ))2
∂yj= −
∑mk=1 (tk − zk)f ′(Ik)wk,j
mit δj = f ′(Hj)∑m
k=1 δkwk,j
∆vj ,i = ηδjxi
Local update rules propagating error from output to input
Present all p patterns of the training set = 1 Epoch (completetraining e.g., 1,000 epochs)
Batch Learning (Off–line): accumulate weight changes for allpatterns, then update weights
On–line Learning: update weights after each pattern
VU Pattern Recognition II
Neural Networks
Learning Curves
J/n
epochs
trainingtest
validation
1 2 3 4 5 6 7 8 9 10 11
FIGURE 6.6. A learning curve shows the criterion function as a function of the amountof training, typically indicated by the number of epochs or presentations of the full train-ing set. We plot the average error per pattern, that is, 1/n
∑np=1 Jp. The validation error
and the test or generalization error per pattern are virtually always higher than the train-ing error. In some protocols, training is stopped at the first minimum of the validationset. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Neural Networks
XOR Learning Details
-1 1x1
-1
1
x2
-1.5 -1 -0.5 0 0.5 1 1.5-1.5
-1
-0.5
0
0.5
1
1.5
2
final
decis
ion b
ound
ary
y1
y2 x2x1
y2y1
bias
10 20 30 40 50 60epoch
0.5
1
1.5
2
J
1
1
1515
15
15
1
1
30
30
30
45
45
45
45
60
60
60
60
total error
error onindividual patterns
inpu
tre
pres
enta
tion
hidd
enre
pres
enta
tion
FIGURE 6.10. A 2-2-1 backpropagation network with bias and the four patterns of theXOR problem are shown at the top. The middle figure shows the outputs of the hid-den units for each of the four patterns; these outputs move across the y1y2-space asthe network learns. In this space, early in training (epoch 1) the two categories are notlinearly separable. As the input-to-hidden weights learn, as marked by the number ofepochs, the categories become linearly separable. The dashed line is the linear decisionboundary determined by the hidden-to-output weights at the end of learning; indeedthe patterns of the two classes are separated by this boundary. The bottom graph showsthe learning curves—the error on individual patterns and the total error as a functionof epoch. Note that, as frequently happens, the total training error decreases monoton-ically, even though this is not the case for the error on each individual pattern. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Neural Networks
Backpropagation Variants I
Standard Backpropagation: ~wt = ~wt−1 − η~∇E
Gradient Reuse: use ~∇E as long as error drops
BP with variable stepsize (learn rate) η
BP with momentum: ∆ ~wt = −η~∇E + α∆ ~wt−1
VU Pattern Recognition II
Neural Networks
Backpropagation Variants I
Standard Backpropagation: ~wt = ~wt−1 − η~∇EGradient Reuse: use ~∇E as long as error drops
BP with variable stepsize (learn rate) η
BP with momentum: ∆ ~wt = −η~∇E + α∆ ~wt−1
VU Pattern Recognition II
Neural Networks
Backpropagation Variants I
Standard Backpropagation: ~wt = ~wt−1 − η~∇EGradient Reuse: use ~∇E as long as error drops
BP with variable stepsize (learn rate) η
BP with momentum: ∆ ~wt = −η~∇E + α∆ ~wt−1
VU Pattern Recognition II
Neural Networks
Backpropagation Variants I
Standard Backpropagation: ~wt = ~wt−1 − η~∇EGradient Reuse: use ~∇E as long as error drops
BP with variable stepsize (learn rate) η
BP with momentum: ∆ ~wt = −η~∇E + α∆ ~wt−1
VU Pattern Recognition II
Nonmetric Methods
Decision Trees
Real problems: nominal data, e.g., car = green, red,
blue
Rule–based or syntactic methods
Decision tree (DT): series of questions (nodes) lead to answerat leaf (category)
DT is interpretable (decisions and categories)
VU Pattern Recognition II
Nonmetric Methods
Decision Trees
Real problems: nominal data, e.g., car = green, red,
blue
Rule–based or syntactic methods
Decision tree (DT): series of questions (nodes) lead to answerat leaf (category)
DT is interpretable (decisions and categories)
VU Pattern Recognition II
Nonmetric Methods
Decision Trees
Real problems: nominal data, e.g., car = green, red,
blue
Rule–based or syntactic methods
Decision tree (DT): series of questions (nodes) lead to answerat leaf (category)
DT is interpretable (decisions and categories)
VU Pattern Recognition II
Nonmetric Methods
Decision Trees
Real problems: nominal data, e.g., car = green, red,
blue
Rule–based or syntactic methods
Decision tree (DT): series of questions (nodes) lead to answerat leaf (category)
DT is interpretable (decisions and categories)
VU Pattern Recognition II
Nonmetric Methods
Monothetic Decision Tree
Color?
Size? Size?Shape?
round
Size?
yellowredgreen
thin mediumsmall smallbig
Grapefruit
big small
Watermelon Banana AppleApple
Lemon
Grape Taste?
sweet sour
Cherry Grape
medium
level 0
level 1
level 2
level 3
root
FIGURE 8.1. Classification in a basic decision tree proceeds from top to bottom. The questions asked ateach node concern a particular property of the pattern, and the downward links correspond to the possiblevalues. Successive nodes are visited until a terminal or leaf node is reached, where the category label is read.Note that the same question, Size?, appears in different places in the tree and that different questions canhave different numbers of branches. Moreover, different leaf nodes, shown in pink, can be labeled by thesame category (e.g., Apple). From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
CART
Goal: construct pure nodes (ideally, all leaf nodes are pure)
A pure leaf node resembles only patterns of single category
Design Issues
Branching factor = splits?Which query (property) at which node?Termination (leaf node)?Pruning (simplification)?Missing data?
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
CART
Goal: construct pure nodes (ideally, all leaf nodes are pure)
A pure leaf node resembles only patterns of single category
Design Issues
Branching factor = splits?Which query (property) at which node?Termination (leaf node)?Pruning (simplification)?Missing data?
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
CART
Goal: construct pure nodes (ideally, all leaf nodes are pure)
A pure leaf node resembles only patterns of single category
Design Issues
Branching factor = splits?Which query (property) at which node?Termination (leaf node)?Pruning (simplification)?Missing data?
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
CART
Goal: construct pure nodes (ideally, all leaf nodes are pure)
A pure leaf node resembles only patterns of single category
Design Issues
Branching factor = splits?
Which query (property) at which node?Termination (leaf node)?Pruning (simplification)?Missing data?
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
CART
Goal: construct pure nodes (ideally, all leaf nodes are pure)
A pure leaf node resembles only patterns of single category
Design Issues
Branching factor = splits?Which query (property) at which node?
Termination (leaf node)?Pruning (simplification)?Missing data?
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
CART
Goal: construct pure nodes (ideally, all leaf nodes are pure)
A pure leaf node resembles only patterns of single category
Design Issues
Branching factor = splits?Which query (property) at which node?Termination (leaf node)?
Pruning (simplification)?Missing data?
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
CART
Goal: construct pure nodes (ideally, all leaf nodes are pure)
A pure leaf node resembles only patterns of single category
Design Issues
Branching factor = splits?Which query (property) at which node?Termination (leaf node)?Pruning (simplification)?
Missing data?
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
CART
Goal: construct pure nodes (ideally, all leaf nodes are pure)
A pure leaf node resembles only patterns of single category
Design Issues
Branching factor = splits?Which query (property) at which node?Termination (leaf node)?Pruning (simplification)?Missing data?
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Monothetic Decision Boundaries
R1
R2R
2
R2
R2
R1 R1
R1
x1
x3
x2
x2
x1
R1
R2
R1
FIGURE 8.3. Monothetic decision trees create decision boundaries with portions perpendicular to the featureaxes. The decision regions are marked R1 and R2 in these two-dimensional and three-dimensional two-category examples. With a sufficiently large tree, any decision boundary can be approximated arbitrarilywell in this way. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Entropy Impurity
Each non-binary tree can be transformed to binary tree
Monothetic (single feature node) and polythetic (multiplefeatures node) trees
Any query at a node should gain maximal purity (or minimalimpurity)
Entropy impurity of a node N with class “probabilities” Pi(N) = −
∑j P(ωj)ldP(ωj)
i(N) = 0→ pure node
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Entropy Impurity
Each non-binary tree can be transformed to binary tree
Monothetic (single feature node) and polythetic (multiplefeatures node) trees
Any query at a node should gain maximal purity (or minimalimpurity)
Entropy impurity of a node N with class “probabilities” Pi(N) = −
∑j P(ωj)ldP(ωj)
i(N) = 0→ pure node
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Entropy Impurity
Each non-binary tree can be transformed to binary tree
Monothetic (single feature node) and polythetic (multiplefeatures node) trees
Any query at a node should gain maximal purity (or minimalimpurity)
Entropy impurity of a node N with class “probabilities” Pi(N) = −
∑j P(ωj)ldP(ωj)
i(N) = 0→ pure node
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Entropy Impurity
Each non-binary tree can be transformed to binary tree
Monothetic (single feature node) and polythetic (multiplefeatures node) trees
Any query at a node should gain maximal purity (or minimalimpurity)
Entropy impurity of a node N with class “probabilities” Pi(N) = −
∑j P(ωj)ldP(ωj)
i(N) = 0→ pure node
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Entropy Impurity
Each non-binary tree can be transformed to binary tree
Monothetic (single feature node) and polythetic (multiplefeatures node) trees
Any query at a node should gain maximal purity (or minimalimpurity)
Entropy impurity of a node N with class “probabilities” Pi(N) = −
∑j P(ωj)ldP(ωj)
i(N) = 0→ pure node
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Other Impurity Measures
Gini impurity (generalization of variance impurity)i(N) =
∑i 6=j P(ωi )P(ωj) = 1
2 [1−∑
j P2(ωj)]
Expected error rate at N (if pattern is selected fromdistribution at N)
Misclassification impurity (discontinous derivative may causeproblems)i(N) = 1−max
jP(ωj)
Minimal probability of a misclassified pattern at N
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Other Impurity Measures
Gini impurity (generalization of variance impurity)i(N) =
∑i 6=j P(ωi )P(ωj) = 1
2 [1−∑
j P2(ωj)]
Expected error rate at N (if pattern is selected fromdistribution at N)
Misclassification impurity (discontinous derivative may causeproblems)i(N) = 1−max
jP(ωj)
Minimal probability of a misclassified pattern at N
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Other Impurity Measures
Gini impurity (generalization of variance impurity)i(N) =
∑i 6=j P(ωi )P(ωj) = 1
2 [1−∑
j P2(ωj)]
Expected error rate at N (if pattern is selected fromdistribution at N)
Misclassification impurity (discontinous derivative may causeproblems)i(N) = 1−max
jP(ωj)
Minimal probability of a misclassified pattern at N
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Other Impurity Measures
Gini impurity (generalization of variance impurity)i(N) =
∑i 6=j P(ωi )P(ωj) = 1
2 [1−∑
j P2(ωj)]
Expected error rate at N (if pattern is selected fromdistribution at N)
Misclassification impurity (discontinous derivative may causeproblems)i(N) = 1−max
jP(ωj)
Minimal probability of a misclassified pattern at N
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Greedy Query Search
Select query with largest impurity decrease from Nto NL (left child) and NR (right child)∆i(N) = i(N)− PLi(NL)− (1− PL)i(NR)
Nominal features (exhaustive search), continous features(gradient descent)
Specific choice of impurity measure is uncritical, moreimportant are stop splitting and pruning methods
Multiway splits (B > 2), simple impurity decrease favors largesplits, scaling of impurity decrease, Gain Ratio Impurity∆i ′(N,B) = ∆i(N,B)
−∑
k Pk ldPk
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Greedy Query Search
Select query with largest impurity decrease from Nto NL (left child) and NR (right child)∆i(N) = i(N)− PLi(NL)− (1− PL)i(NR)
Nominal features (exhaustive search), continous features(gradient descent)
Specific choice of impurity measure is uncritical, moreimportant are stop splitting and pruning methods
Multiway splits (B > 2), simple impurity decrease favors largesplits, scaling of impurity decrease, Gain Ratio Impurity∆i ′(N,B) = ∆i(N,B)
−∑
k Pk ldPk
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Greedy Query Search
Select query with largest impurity decrease from Nto NL (left child) and NR (right child)∆i(N) = i(N)− PLi(NL)− (1− PL)i(NR)
Nominal features (exhaustive search), continous features(gradient descent)
Specific choice of impurity measure is uncritical, moreimportant are stop splitting and pruning methods
Multiway splits (B > 2), simple impurity decrease favors largesplits, scaling of impurity decrease, Gain Ratio Impurity∆i ′(N,B) = ∆i(N,B)
−∑
k Pk ldPk
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Greedy Query Search
Select query with largest impurity decrease from Nto NL (left child) and NR (right child)∆i(N) = i(N)− PLi(NL)− (1− PL)i(NR)
Nominal features (exhaustive search), continous features(gradient descent)
Specific choice of impurity measure is uncritical, moreimportant are stop splitting and pruning methods
Multiway splits (B > 2), simple impurity decrease favors largesplits, scaling of impurity decrease, Gain Ratio Impurity∆i ′(N,B) = ∆i(N,B)
−∑
k Pk ldPk
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Stop Splitting Methods
Naive Stop: each leaf node has impurity 0 (perfectoverfitting), may degenerate to a look–up table (a leaf nodefor each pattern)
Measure split performance with a separate validation set(minimal error on validation set)
Impurity threshold ∆i(N) ≤ β, unbalanced trees, choice of β?
Pattern threshold: stop when a node represents a certain(small) number (percentage) of patterns
Minimum Description Length (regularization reducescomplexity)J(DT ) = α#N +
∑LN i(LN) (LN = leaf nodes)
Statistical significance of impurity reduction(distribution of ∆i)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Stop Splitting Methods
Naive Stop: each leaf node has impurity 0 (perfectoverfitting), may degenerate to a look–up table (a leaf nodefor each pattern)
Measure split performance with a separate validation set(minimal error on validation set)
Impurity threshold ∆i(N) ≤ β, unbalanced trees, choice of β?
Pattern threshold: stop when a node represents a certain(small) number (percentage) of patterns
Minimum Description Length (regularization reducescomplexity)J(DT ) = α#N +
∑LN i(LN) (LN = leaf nodes)
Statistical significance of impurity reduction(distribution of ∆i)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Stop Splitting Methods
Naive Stop: each leaf node has impurity 0 (perfectoverfitting), may degenerate to a look–up table (a leaf nodefor each pattern)
Measure split performance with a separate validation set(minimal error on validation set)
Impurity threshold ∆i(N) ≤ β, unbalanced trees, choice of β?
Pattern threshold: stop when a node represents a certain(small) number (percentage) of patterns
Minimum Description Length (regularization reducescomplexity)J(DT ) = α#N +
∑LN i(LN) (LN = leaf nodes)
Statistical significance of impurity reduction(distribution of ∆i)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Stop Splitting Methods
Naive Stop: each leaf node has impurity 0 (perfectoverfitting), may degenerate to a look–up table (a leaf nodefor each pattern)
Measure split performance with a separate validation set(minimal error on validation set)
Impurity threshold ∆i(N) ≤ β, unbalanced trees, choice of β?
Pattern threshold: stop when a node represents a certain(small) number (percentage) of patterns
Minimum Description Length (regularization reducescomplexity)J(DT ) = α#N +
∑LN i(LN) (LN = leaf nodes)
Statistical significance of impurity reduction(distribution of ∆i)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Stop Splitting Methods
Naive Stop: each leaf node has impurity 0 (perfectoverfitting), may degenerate to a look–up table (a leaf nodefor each pattern)
Measure split performance with a separate validation set(minimal error on validation set)
Impurity threshold ∆i(N) ≤ β, unbalanced trees, choice of β?
Pattern threshold: stop when a node represents a certain(small) number (percentage) of patterns
Minimum Description Length (regularization reducescomplexity)J(DT ) = α#N +
∑LN i(LN) (LN = leaf nodes)
Statistical significance of impurity reduction(distribution of ∆i)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Stop Splitting Methods
Naive Stop: each leaf node has impurity 0 (perfectoverfitting), may degenerate to a look–up table (a leaf nodefor each pattern)
Measure split performance with a separate validation set(minimal error on validation set)
Impurity threshold ∆i(N) ≤ β, unbalanced trees, choice of β?
Pattern threshold: stop when a node represents a certain(small) number (percentage) of patterns
Minimum Description Length (regularization reducescomplexity)J(DT ) = α#N +
∑LN i(LN) (LN = leaf nodes)
Statistical significance of impurity reduction(distribution of ∆i)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Pruning
Stop Splitting: insufficient look–ahead (horizon effect) due togreedy search
Pruning: merge nodes, starts at leaf nodes, but any node ispossible
Uses complete data set, huge cost with large data sets
Rule pruning: construct and simplify rules (conjunctions) foreach leaf
Context pruning: prune specific rules for specific patterns
Improved interpretability
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Pruning
Stop Splitting: insufficient look–ahead (horizon effect) due togreedy search
Pruning: merge nodes, starts at leaf nodes, but any node ispossible
Uses complete data set, huge cost with large data sets
Rule pruning: construct and simplify rules (conjunctions) foreach leaf
Context pruning: prune specific rules for specific patterns
Improved interpretability
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Pruning
Stop Splitting: insufficient look–ahead (horizon effect) due togreedy search
Pruning: merge nodes, starts at leaf nodes, but any node ispossible
Uses complete data set, huge cost with large data sets
Rule pruning: construct and simplify rules (conjunctions) foreach leaf
Context pruning: prune specific rules for specific patterns
Improved interpretability
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Pruning
Stop Splitting: insufficient look–ahead (horizon effect) due togreedy search
Pruning: merge nodes, starts at leaf nodes, but any node ispossible
Uses complete data set, huge cost with large data sets
Rule pruning: construct and simplify rules (conjunctions) foreach leaf
Context pruning: prune specific rules for specific patterns
Improved interpretability
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Pruning
Stop Splitting: insufficient look–ahead (horizon effect) due togreedy search
Pruning: merge nodes, starts at leaf nodes, but any node ispossible
Uses complete data set, huge cost with large data sets
Rule pruning: construct and simplify rules (conjunctions) foreach leaf
Context pruning: prune specific rules for specific patterns
Improved interpretability
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Pruning
Stop Splitting: insufficient look–ahead (horizon effect) due togreedy search
Pruning: merge nodes, starts at leaf nodes, but any node ispossible
Uses complete data set, huge cost with large data sets
Rule pruning: construct and simplify rules (conjunctions) foreach leaf
Context pruning: prune specific rules for specific patterns
Improved interpretability
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Feature Extraction
.2 .4 .6 .8 10
.2
.4
.6
.8
1 - 1.2 x1 + x2 < 0.1
x1 < 0.27
x2 < 0.32
x1 < 0.07
x2 < 0.6
x1 < 0.55
x2 < 0.86
x1 < 0.81
x1
x2
ω2 ω1
ω2
ω1
ω1
ω1
ω1
ω2
ω2
ω2R2
R1
R2
R1
.2 .4 .6 .8 10
.2
.4
.6
.8
1
x1
x2
FIGURE 8.5. If the class of node decisions does not match the form of the training data,a very complicated decision tree will result, as shown at the top. Here decisions areparallel to the axes while in fact the data is better split by boundaries along anotherdirection. If, however, “proper” decision forms are used (here, linear combinations ofthe features), the tree can be quite simple, as shown at the bottom. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Potential Improvements
At each node train a linear classifier, arbitrary linear decisionboundaries
Long training, (again) fast recall
Integrate priors and/or costs by weights
Weighted Gini Impurity with cost λiji(N) =
∑ij λijP(ωi )P(ωj)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Potential Improvements
At each node train a linear classifier, arbitrary linear decisionboundaries
Long training, (again) fast recall
Integrate priors and/or costs by weights
Weighted Gini Impurity with cost λiji(N) =
∑ij λijP(ωi )P(ωj)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Potential Improvements
At each node train a linear classifier, arbitrary linear decisionboundaries
Long training, (again) fast recall
Integrate priors and/or costs by weights
Weighted Gini Impurity with cost λiji(N) =
∑ij λijP(ωi )P(ωj)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Potential Improvements
At each node train a linear classifier, arbitrary linear decisionboundaries
Long training, (again) fast recall
Integrate priors and/or costs by weights
Weighted Gini Impurity with cost λiji(N) =
∑ij λijP(ωi )P(ωj)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Multivariate Decision Trees
0.2 0.4 0.6 0.8 1
0.2
0.4
0.6
0.8
1
0.04 x1 + 0.16 x2 < 0.11
0.27 x1 - 0.44 x2 < -0.02
0.96 x1 - 1.77x2 < -0.45
5.43 x1 - 13.33 x2 < -6.03
x2 < 0.5
x2 < 0.56x1 < 0.95
x2 < 0.54
x1
ω1ω2
R2
R1
0
ω2
ω1
ω1
ω1
ω1
ω2
ω2
ω2
R1
R2
R2
R1
x2
0.2 0.4 0.6 0.8 1
0.2
0.4
0.6
0.8
1
x10
x2
FIGURE 8.6. One form of multivariate tree employs general linear decisions at eachnode, giving splits along arbitrary directions in the feature space. In virtually all inter-esting cases the training data are not linearly separable, and thus the LMS algorithm ismore useful than methods that require the data to be linearly separable, even though theLMS need not yield a minimum in classification error (Chapter 5). The tree at the bottomcan be simplified by methods outlined in Section 8.4.2. From: Richard O. Duda, PeterE. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Missing Attributes
Naive approach: use only non-deficient patterns
Better: use only non–deficient attributes
Works with training, but how to classify a deficient pattern?
Surrogate splits: find alternative splits using different featureshaving maximal predictive association (correlation)
Virtual values, e.g., mean value of non–deficient feature values
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Missing Attributes
Naive approach: use only non-deficient patterns
Better: use only non–deficient attributes
Works with training, but how to classify a deficient pattern?
Surrogate splits: find alternative splits using different featureshaving maximal predictive association (correlation)
Virtual values, e.g., mean value of non–deficient feature values
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Missing Attributes
Naive approach: use only non-deficient patterns
Better: use only non–deficient attributes
Works with training, but how to classify a deficient pattern?
Surrogate splits: find alternative splits using different featureshaving maximal predictive association (correlation)
Virtual values, e.g., mean value of non–deficient feature values
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Missing Attributes
Naive approach: use only non-deficient patterns
Better: use only non–deficient attributes
Works with training, but how to classify a deficient pattern?
Surrogate splits: find alternative splits using different featureshaving maximal predictive association (correlation)
Virtual values, e.g., mean value of non–deficient feature values
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
Missing Attributes
Naive approach: use only non-deficient patterns
Better: use only non–deficient attributes
Works with training, but how to classify a deficient pattern?
Surrogate splits: find alternative splits using different featureshaving maximal predictive association (correlation)
Virtual values, e.g., mean value of non–deficient feature values
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
ID3
ID3 stems from third interactive dichotomizer
Nominal features (real are binned)
Branch factor is number of attributes
Train until all nodes pure or no more features
Results in tree depth = number of features
No pruning
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
ID3
ID3 stems from third interactive dichotomizer
Nominal features (real are binned)
Branch factor is number of attributes
Train until all nodes pure or no more features
Results in tree depth = number of features
No pruning
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
ID3
ID3 stems from third interactive dichotomizer
Nominal features (real are binned)
Branch factor is number of attributes
Train until all nodes pure or no more features
Results in tree depth = number of features
No pruning
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
ID3
ID3 stems from third interactive dichotomizer
Nominal features (real are binned)
Branch factor is number of attributes
Train until all nodes pure or no more features
Results in tree depth = number of features
No pruning
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
ID3
ID3 stems from third interactive dichotomizer
Nominal features (real are binned)
Branch factor is number of attributes
Train until all nodes pure or no more features
Results in tree depth = number of features
No pruning
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
ID3
ID3 stems from third interactive dichotomizer
Nominal features (real are binned)
Branch factor is number of attributes
Train until all nodes pure or no more features
Results in tree depth = number of features
No pruning
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
C4.5
Refinement of ID3
B > 2 with nominal features, B = 2 with real features
Pruning based on statistical significance of splits
Missing features: sample all subtrees of missing feature usingtraining data
Additional rule pruning, can prune any node (see Figure 8.6)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
C4.5
Refinement of ID3
B > 2 with nominal features, B = 2 with real features
Pruning based on statistical significance of splits
Missing features: sample all subtrees of missing feature usingtraining data
Additional rule pruning, can prune any node (see Figure 8.6)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
C4.5
Refinement of ID3
B > 2 with nominal features, B = 2 with real features
Pruning based on statistical significance of splits
Missing features: sample all subtrees of missing feature usingtraining data
Additional rule pruning, can prune any node (see Figure 8.6)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
C4.5
Refinement of ID3
B > 2 with nominal features, B = 2 with real features
Pruning based on statistical significance of splits
Missing features: sample all subtrees of missing feature usingtraining data
Additional rule pruning, can prune any node (see Figure 8.6)
VU Pattern Recognition II
Nonmetric Methods
Classification and Regression Trees
C4.5
Refinement of ID3
B > 2 with nominal features, B = 2 with real features
Pruning based on statistical significance of splits
Missing features: sample all subtrees of missing feature usingtraining data
Additional rule pruning, can prune any node (see Figure 8.6)
VU Pattern Recognition II
Stochastic Methods
Stochastic Search
Analytical methods problematic in high dimensions or withcomplex models
Large number of local optima makes gradient descent verycostly
Stochastic methods try to localize promising search regions
Pure random search is often not sufficient
Simulated Annealing and Boltzmann Learning motivated bystatistical mechanics
Evolutionary Computation motivated by evolutionaryprinciples from biology
VU Pattern Recognition II
Stochastic Methods
Stochastic Search
Analytical methods problematic in high dimensions or withcomplex models
Large number of local optima makes gradient descent verycostly
Stochastic methods try to localize promising search regions
Pure random search is often not sufficient
Simulated Annealing and Boltzmann Learning motivated bystatistical mechanics
Evolutionary Computation motivated by evolutionaryprinciples from biology
VU Pattern Recognition II
Stochastic Methods
Stochastic Search
Analytical methods problematic in high dimensions or withcomplex models
Large number of local optima makes gradient descent verycostly
Stochastic methods try to localize promising search regions
Pure random search is often not sufficient
Simulated Annealing and Boltzmann Learning motivated bystatistical mechanics
Evolutionary Computation motivated by evolutionaryprinciples from biology
VU Pattern Recognition II
Stochastic Methods
Stochastic Search
Analytical methods problematic in high dimensions or withcomplex models
Large number of local optima makes gradient descent verycostly
Stochastic methods try to localize promising search regions
Pure random search is often not sufficient
Simulated Annealing and Boltzmann Learning motivated bystatistical mechanics
Evolutionary Computation motivated by evolutionaryprinciples from biology
VU Pattern Recognition II
Stochastic Methods
Stochastic Search
Analytical methods problematic in high dimensions or withcomplex models
Large number of local optima makes gradient descent verycostly
Stochastic methods try to localize promising search regions
Pure random search is often not sufficient
Simulated Annealing and Boltzmann Learning motivated bystatistical mechanics
Evolutionary Computation motivated by evolutionaryprinciples from biology
VU Pattern Recognition II
Stochastic Methods
Stochastic Search
Analytical methods problematic in high dimensions or withcomplex models
Large number of local optima makes gradient descent verycostly
Stochastic methods try to localize promising search regions
Pure random search is often not sufficient
Simulated Annealing and Boltzmann Learning motivated bystatistical mechanics
Evolutionary Computation motivated by evolutionaryprinciples from biology
VU Pattern Recognition II
Stochastic Methods
Energy Minimization
Example: minimizing (model) energy in a (Hopfield) network
Energy E = −12
∑Ni ,j=1 wijsi sj si = ±1
Minimize energy of spin–glass model
Probability of energy state, Boltzmann factor
P(γ) = e−EγT
Z(T )
VU Pattern Recognition II
Stochastic Methods
Energy Minimization
Example: minimizing (model) energy in a (Hopfield) network
Energy E = −12
∑Ni ,j=1 wijsi sj si = ±1
Minimize energy of spin–glass model
Probability of energy state, Boltzmann factor
P(γ) = e−EγT
Z(T )
VU Pattern Recognition II
Stochastic Methods
Energy Minimization
Example: minimizing (model) energy in a (Hopfield) network
Energy E = −12
∑Ni ,j=1 wijsi sj si = ±1
Minimize energy of spin–glass model
Probability of energy state, Boltzmann factor
P(γ) = e−EγT
Z(T )
VU Pattern Recognition II
Stochastic Methods
Energy Minimization
Example: minimizing (model) energy in a (Hopfield) network
Energy E = −12
∑Ni ,j=1 wijsi sj si = ±1
Minimize energy of spin–glass model
Probability of energy state, Boltzmann factor
P(γ) = e−EγT
Z(T )
VU Pattern Recognition II
Stochastic Methods
Recurrent Nethi
dden
wij
i
j
αβ
visible
visible
s1 s2 s3 s4 s5 s6
s14 s15 s16 s17
s8
s10
s9
s11
s12s13
s7
FIGURE 7.1. The class of optimization problems of Eq. 1 can be viewed in terms of anetwork of nodes or units, each of which can be in the si = +1 or si = −1 state. Everypair of nodes i and j is connected by bi-directional weights wij; if a weight between twonodes is zero, then no connection is drawn. (Because the networks we shall discuss canhave an arbitrary interconnection, there is no notion of layers as in multilayer neuralnetworks.) The optimization problem is to find a configuration (i.e., assignment of allsi) that minimizes the energy described by Eq. 1. While our convention was to showfunctions inside each node’s circle, our convention in so-called Boltzmann networks isto indicate the state of each node. The configuration of the full network is indexed byan integer γ , and because here there are 17 binary nodes, γ is bounded 0 ≤ γ < 217.When such a network is used for pattern recognition, the input and output nodes aresaid to be visible, and the remaining nodes are said to be hidden. The states of thevisible nodes and hidden nodes are indexed by α and β, respectively, and in this caseare bounded 0 ≤ α ≤ 210 and 0 ≤ β < 27. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Stochastic Methods
Energy Landscape
x1 x1
x2 x2
E E
FIGURE 7.2. The energy function or energy “landscape” on the left is meant to suggest the types of opti-mization problems addressed by simulated annealing. The method uses randomness, governed by a controlparameter or “temperature” T to avoid getting stuck in local energy minima and thus to find the global mini-mum, like a small ball rolling in the landscape as it is shaken. The pathological “golf course” landscape at theright is, generally speaking, not amenable to solution via simulated annealing because the region of lowestenergy is so small and is surrounded by energetically unfavorable configurations. The configuration spaces ofthe problems we shall address are discrete and are more accurately displayed in Fig. 7.6. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Stochastic Methods
Simulated Annealing
Simulated Annealing Basics
Stochastic search for state of lower energy
Basic idea: occasionally go to higher energy to possiblyescape local minima
After random change of parameter si∆Eab = Eb − Ea
accept Eb, if Eb < Ea or
accept Ea with P = e−∆Eab
T
Annealing Schedule, e.g., T (k + 1) = cT (k) 0 < c < 1typically 0.8 < c < 0.99
High initial temperature, large c and large kmax (number ofiterations) leads to good results (but also computational cost)
VU Pattern Recognition II
Stochastic Methods
Simulated Annealing
Simulated Annealing Basics
Stochastic search for state of lower energy
Basic idea: occasionally go to higher energy to possiblyescape local minima
After random change of parameter si∆Eab = Eb − Ea
accept Eb, if Eb < Ea or
accept Ea with P = e−∆Eab
T
Annealing Schedule, e.g., T (k + 1) = cT (k) 0 < c < 1typically 0.8 < c < 0.99
High initial temperature, large c and large kmax (number ofiterations) leads to good results (but also computational cost)
VU Pattern Recognition II
Stochastic Methods
Simulated Annealing
Simulated Annealing Basics
Stochastic search for state of lower energy
Basic idea: occasionally go to higher energy to possiblyescape local minima
After random change of parameter si∆Eab = Eb − Ea
accept Eb, if Eb < Ea or
accept Ea with P = e−∆Eab
T
Annealing Schedule, e.g., T (k + 1) = cT (k) 0 < c < 1typically 0.8 < c < 0.99
High initial temperature, large c and large kmax (number ofiterations) leads to good results (but also computational cost)
VU Pattern Recognition II
Stochastic Methods
Simulated Annealing
Simulated Annealing Basics
Stochastic search for state of lower energy
Basic idea: occasionally go to higher energy to possiblyescape local minima
After random change of parameter si∆Eab = Eb − Ea
accept Eb, if Eb < Ea or
accept Ea with P = e−∆Eab
T
Annealing Schedule, e.g., T (k + 1) = cT (k) 0 < c < 1typically 0.8 < c < 0.99
High initial temperature, large c and large kmax (number ofiterations) leads to good results (but also computational cost)
VU Pattern Recognition II
Stochastic Methods
Simulated Annealing
Simulated Annealing Basics
Stochastic search for state of lower energy
Basic idea: occasionally go to higher energy to possiblyescape local minima
After random change of parameter si∆Eab = Eb − Ea
accept Eb, if Eb < Ea or
accept Ea with P = e−∆Eab
T
Annealing Schedule, e.g., T (k + 1) = cT (k) 0 < c < 1typically 0.8 < c < 0.99
High initial temperature, large c and large kmax (number ofiterations) leads to good results (but also computational cost)
VU Pattern Recognition II
Stochastic Methods
Simulated Annealing
Simulated Annealing Experiment
−−−−−−
+−−−−−
++−−−−
−+−−−−
−++−−−
+++−−−
+−+−−−
−−+−−−
−−++−−
+−++−−
++++−−
−+++−−
−+−+−−
++−+−−
+−−+−−
−−−+−−
−−−++−
+−−++−
++−++−
−+−++−
−++++−
+++++−
+−+++−
−−+++−
−−+−+−
+−+−+−
+++−+−
−++−+−
−+−−+−
++−−+−
+−−−+−
−−−−+−
−−−−++
+−−−++
++−−++
−+−−++
−++−++
+++−++
+−+−++
−−+−++
−−++++
+−++++
++++++
−+++++
−+−+++
++−+++
+−−+++
−−−+++
−−−+−+
+−−+−+
++−+−+
−+−+−+
−+++−+
++++−+
+−++−+
−−++−+
−−+−−+
+−+−−+
+++−−+
−++−−+
−+−−−+
++−−−+
+−−−−+
−−−−−+
T(k
)E
(k)
k k
E
s1
s2
s3
s4
s5
s6
begin
end
γ
FIGURE 7.3. Stochastic simulated annealing (Algorithm 1) uses randomness, governed by a control parameteror “temperature” T (k) to search through a discrete space for a minimum of an energy function. In this examplethere are N = 6 variables; the 26 = 64 configurations are shown along the bottom as a column of + and− symbols. The plot of the associated energy of each configuration given by Eq. 1 for randomly chosenweights. Every transition corresponds to the change of just a single si. (The configurations have been arrangedso that adjacent ones differ by the state of just a single node; nevertheless, most transitions correspondingto a single node appear far apart in this ordering.) Because the system energy is invariant with respect to aglobal interchange si ↔ −si, there are two “global” minima. The graph at the upper left shows the annealingschedule—the decreasing temperature versus iteration number k. The middle portion shows the configurationversus iteration number generated by Algorithm 1. The trajectory through the configuration space is coloredred for transitions that increase the energy and black for those that decrease the energy. Such energeticallyunfavorable (red) transitions become rarer later in the anneal. The graph at the right shows the full energyE(k), which decreases to the global minimum. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Stochastic Methods
Simulated Annealing
Empirical Energy States
−−−−−−
+−−−−−
++−−−−
−+−−−−
−++−−−
+++−−−
+−+−−−
−−+−−−
−−++−−
+−++−−
++++−−
−+++−−
−+−+−−
++−+−−
+−−+−−
−−−+−−
−−−++−
+−−++−
++−++−
−+−++−
−++++−
+++++−
+−+++−
−−+++−
−−+−+−
+−+−+−
+++−+−
−++−+−
−+−−+−
++−−+−
+−−−+−
−−−−+−
−−−−++
+−−−++
++−−++
−+−−++
−++−++
+++−++
+−+−++
−−+−++
−−++++
+−++++
++++++
−+++++
−+−+++
++−+++
+−−+++
−−−+++
−−−+−+
+−−+−+
++−+−+
−+−+−+
−+++−+
++++−+
+−++−+
−−++−+
−−+−−+
+−+−−+
+++−−+
−++−−+
−+−−−+
++−−−+
+−−−−+
−−−−−+
T(k
)
k E
P(γ)
ε[E
]
k
s1
s2
s3
s4
s5
s6
γ
FIGURE 7.4. An estimate of the probability P(γ ) of being in a configuration denoted by γ is shown forfour temperatures during a slow anneal. (These estimates, based on a large number of runs, are nearly thetheoretical values e−Eγ /T.) Early, at high T, each configuration is roughly equal in probability while late, atlow T, the probability is strongly concentrated at the global minima. The expected value of the energy, E[E](i.e., averaged at temperature T), decreases gradually during the anneal. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
VU Pattern Recognition II
Projects
Project Teams
2 students form a team
Implementation of a pattern classification method
Use existing (free) software (e.g., WEKA)
Data import, pre–processing, results (graphics, tables)
Methods must be understood, parameters!
Project report (February 10, 2014)
VU Pattern Recognition II
Projects
Project Teams
2 students form a team
Implementation of a pattern classification method
Use existing (free) software (e.g., WEKA)
Data import, pre–processing, results (graphics, tables)
Methods must be understood, parameters!
Project report (February 10, 2014)
VU Pattern Recognition II
Projects
Project Teams
2 students form a team
Implementation of a pattern classification method
Use existing (free) software (e.g., WEKA)
Data import, pre–processing, results (graphics, tables)
Methods must be understood, parameters!
Project report (February 10, 2014)
VU Pattern Recognition II
Projects
Project Teams
2 students form a team
Implementation of a pattern classification method
Use existing (free) software (e.g., WEKA)
Data import, pre–processing, results (graphics, tables)
Methods must be understood, parameters!
Project report (February 10, 2014)
VU Pattern Recognition II
Projects
Project Teams
2 students form a team
Implementation of a pattern classification method
Use existing (free) software (e.g., WEKA)
Data import, pre–processing, results (graphics, tables)
Methods must be understood, parameters!
Project report (February 10, 2014)
VU Pattern Recognition II
Projects
Project Teams
2 students form a team
Implementation of a pattern classification method
Use existing (free) software (e.g., WEKA)
Data import, pre–processing, results (graphics, tables)
Methods must be understood, parameters!
Project report (February 10, 2014)
VU Pattern Recognition II
Projects
Project Topics
k-NN Classifier (different metrics)Kauba, Mayer
Artificial Neural Networks (Boone)Reissig, DiStolfo
Support Vector MachineLinortner, N.N.
Decision Tree (C4.5)
Simulated Annealing (meta)
Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser
Genetic Programming (optional, JEvolution)
VU Pattern Recognition II
Projects
Project Topics
k-NN Classifier (different metrics)Kauba, Mayer
Artificial Neural Networks (Boone)Reissig, DiStolfo
Support Vector MachineLinortner, N.N.
Decision Tree (C4.5)
Simulated Annealing (meta)
Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser
Genetic Programming (optional, JEvolution)
VU Pattern Recognition II
Projects
Project Topics
k-NN Classifier (different metrics)Kauba, Mayer
Artificial Neural Networks (Boone)Reissig, DiStolfo
Support Vector MachineLinortner, N.N.
Decision Tree (C4.5)
Simulated Annealing (meta)
Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser
Genetic Programming (optional, JEvolution)
VU Pattern Recognition II
Projects
Project Topics
k-NN Classifier (different metrics)Kauba, Mayer
Artificial Neural Networks (Boone)Reissig, DiStolfo
Support Vector MachineLinortner, N.N.
Decision Tree (C4.5)
Simulated Annealing (meta)
Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser
Genetic Programming (optional, JEvolution)
VU Pattern Recognition II
Projects
Project Topics
k-NN Classifier (different metrics)Kauba, Mayer
Artificial Neural Networks (Boone)Reissig, DiStolfo
Support Vector MachineLinortner, N.N.
Decision Tree (C4.5)
Simulated Annealing (meta)
Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser
Genetic Programming (optional, JEvolution)
VU Pattern Recognition II
Projects
Project Topics
k-NN Classifier (different metrics)Kauba, Mayer
Artificial Neural Networks (Boone)Reissig, DiStolfo
Support Vector MachineLinortner, N.N.
Decision Tree (C4.5)
Simulated Annealing (meta)
Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser
Genetic Programming (optional, JEvolution)
VU Pattern Recognition II
Projects
Project Topics
k-NN Classifier (different metrics)Kauba, Mayer
Artificial Neural Networks (Boone)Reissig, DiStolfo
Support Vector MachineLinortner, N.N.
Decision Tree (C4.5)
Simulated Annealing (meta)
Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser
Genetic Programming (optional, JEvolution)
VU Pattern Recognition II
Projects
Project Data Sets
Data Sets
UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html
Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic
Leave–one–out validation (common partitioning)
Confusion matrix
VU Pattern Recognition II
Projects
Project Data Sets
Data Sets
UCI Machine Learning Archive
http://www.ics.uci.edu/~mlearn/MLRepository.html
Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic
Leave–one–out validation (common partitioning)
Confusion matrix
VU Pattern Recognition II
Projects
Project Data Sets
Data Sets
UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html
Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic
Leave–one–out validation (common partitioning)
Confusion matrix
VU Pattern Recognition II
Projects
Project Data Sets
Data Sets
UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html
Ionosphere: Radar Signals
Semeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic
Leave–one–out validation (common partitioning)
Confusion matrix
VU Pattern Recognition II
Projects
Project Data Sets
Data Sets
UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html
Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit Recognition
Wine Quality: Wine Critic
Leave–one–out validation (common partitioning)
Confusion matrix
VU Pattern Recognition II
Projects
Project Data Sets
Data Sets
UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html
Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic
Leave–one–out validation (common partitioning)
Confusion matrix
VU Pattern Recognition II
Projects
Project Data Sets
Data Sets
UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html
Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic
Leave–one–out validation (common partitioning)
Confusion matrix
VU Pattern Recognition II
Projects
Project Data Sets
Data Sets
UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html
Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic
Leave–one–out validation (common partitioning)
Confusion matrix