vu pattern recognition ii - uni salzburghelmut/teaching/patternrecognition/prii.pdf · vu pattern...

VU Pattern Recognition II

Helmut A. Mayer

Department of Computer SciencesUniversity of Salzburg

WS 13/14

Outline

1 Introduction

2 Statistical ClassifiersBayesian Decision Theory

3 Nonparametric TechniquesDensity Estimationk–Nearest–Neigbor Estimation

4 Linear Discriminant FunctionsDecision Surfaces

5 Neural Networks

6 Nonmetric MethodsClassification and Regression Trees

7 Stochastic MethodsSimulated Annealing

8 Projects

Introduction

Human vs. Machine

Human Perception

Senses to neural patterns

Machine Perception

Sensors to value patterns

Patterns are everywhere...

Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)

Features build Model

Fish Example

Introduction

Human vs. Machine

Human Perception

Machine Perception

Fish Example

Introduction

Human vs. Machine

Human Perception

Machine Perception

Fish Example

Introduction

Human vs. Machine

Human Perception

Machine Perception

Fish Example

Introduction

Human vs. Machine

Human Perception

Machine Perception

Fish Example

Introduction

Human vs. Machine

Human Perception

Machine Perception

Fish Example

Introduction

Human vs. Machine

Human Perception

Machine Perception

Fish Example

Introduction

Human vs. Machine

Human Perception

Machine Perception

Fish Example

Introduction

Salmon or Sea Bass

FIGURE 1.1. The objects to be classified are first sensed by a transducer (camera),whose signals are preprocessed. Next the features are extracted and finally the clas-sification is emitted, here either “salmon” or “sea bass.” Although the information flowis often chosen to be from the source to the classifier, some systems employ informationflow in which earlier levels of processing can be altered based on the tentative or pre-liminary response in later levels (gray arrows). Yet others combine two or more stagesinto a unified step, such as simultaneous segmentation and feature extraction. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.

Introduction

Fish Length Histogram

salmon sea bass

length

5 10 2015 25

FIGURE 1.2. Histograms for the length feature for the two categories. No single thresh-old value of the length will serve to unambiguously discriminate between the two cat-egories; using length alone, we will have some errors. The value marked l∗ will lead tothe smallest number of errors, on average. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Introduction

Fish Lightness Histogram

2 4 6 8 100

lightness

salmon sea bass

FIGURE 1.3. Histograms for the lightness feature for the two categories. No singlethreshold value x∗ (decision boundary) will serve to unambiguously discriminate be-tween the two categories; using lightness alone, we will have some errors. The value x∗

Introduction

Decision Theory

Cost of an Error?

Salmon tastes better..;-)Minimization of cost (risk)Decision Rule/Boundary

Improving Recognition

Feature Vector ~x =

(lightnesswidth

2D Decision Boundary

Introduction

Decision Theory

Cost of an Error?

Salmon tastes better..;-)

Minimization of cost (risk)Decision Rule/Boundary

Feature Vector ~x =

(lightnesswidth

Introduction

Decision Theory

Cost of an Error?

Salmon tastes better..;-)Minimization of cost (risk)

Decision Rule/Boundary

Feature Vector ~x =

(lightnesswidth

Introduction

Decision Theory

Cost of an Error?

Feature Vector ~x =

(lightnesswidth

Introduction

Decision Theory

Cost of an Error?

Feature Vector ~x =

(lightnesswidth

)2D Decision Boundary

Introduction

Decision Theory

Cost of an Error?

Feature Vector ~x =

(lightnesswidth

Introduction

Decision Theory

Cost of an Error?

Feature Vector ~x =

(lightnesswidth

)2D Decision Boundary

Introduction

2D Feature Space

2 4 6 8 1014

lightness

salmon sea bass

FIGURE 1.4. The two features of lightness and width for sea bass and salmon. The darkline could serve as a decision boundary of our classifier. Overall classification error onthe data shown is lower than if we use only one feature as in Fig. 1.3, but there willstill be some errors. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.

Introduction

Overfitting

2 4 6 8 1014

lightness

salmon sea bass

FIGURE 1.5. Overly complex models for the fish will lead to decision boundaries thatare complicated. While such a decision may lead to perfect classification of our trainingsamples, it would lead to poor performance on future patterns. The novel test pointmarked ? is evidently most likely a salmon, whereas the complex decision boundaryshown leads it to be classified as a sea bass. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Introduction

Generalization

2 4 6 8 1014

lightness

salmon sea bass

FIGURE 1.6. The decision boundary shown might represent the optimal tradeoff be-tween performance on the training set and simplicity of classifier, thereby giving thehighest accuracy on new patterns. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Introduction

Related Fields

Statistical Hypothesis Testing

Image Processing

Regression (age ↔ weight)

Interpolation

Density Estimation

Introduction

Related Fields

Image Processing

Interpolation

Density Estimation

Introduction

Related Fields

Image Processing

Interpolation

Density Estimation

Introduction

Related Fields

Image Processing

Interpolation

Density Estimation

Introduction

Related Fields

Image Processing

Interpolation

Density Estimation

Introduction

Pattern Recognition Systems

post-processing

classification

feature extraction

segmentation

sensing

decision

adjustments for missing features

adjustments for context

FIGURE 1.7. Many pattern recognition systems can be partitioned into componentssuch as the ones shown here. A sensor converts images or sounds or other physicalinputs into signal data. The segmentor isolates sensed objects from the background orfrom other objects. A feature extractor measures object properties that are useful forclassification. The classifier uses these features to assign the sensed object to a cate-gory. Finally, a post processor can take account of other considerations, such as theeffects of context and the costs of errors, to decide on the appropriate action. Althoughthis description stresses a one-way or “bottom-up” flow of data, some systems employfeedback from higher levels back down to lower levels (gray arrows). From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.

Introduction

Feature Extraction

Features ↔ Classification

Invariant Features (translation, rotation, scale)

Deformation (e.g. Cropping)

Feature Selection (Filter, Wrapper)

Introduction

Feature Extraction

Introduction

Feature Extraction

Introduction

Feature Extraction

Introduction

Post Processing

Error Rate, Risk (weighted error)

Context (IC* *IN)

Multiple Classifiers (subspaces, fusion)

Introduction

Post Processing

Context (IC* *IN)

Introduction

Post Processing

Context (IC* *IN)

Introduction

Design Cycle

collect data

choose features

choose model

train classifier

evaluate classifier

prior knowledge(e.g., invariances)

FIGURE 1.8. The design of a pattern recognition system involves a design cycle similarto the one shown here. Data must be collected, both to train and to test the system. Thecharacteristics of the data impact both the choice of appropriate discriminating featuresand the choice of models for the different categories. The training process uses some orall of the data to determine the system parameters. The results of evaluation may callfor repetition of various steps in this process in order to obtain satisfactory results. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.

Introduction

Learning and Adaptation

Learning is Parameter Tuning

Supervised Learning (teacher)

Reinforcement Learning (critic)

Unsupervised Learning (clustering)

Introduction

Statistical Classifiers

Bayesian Decision Theory

Probabilities

State of Nature ω = ω1 (class)

A Priori Probability P(ω1) (prior)

Decision Rule P(ω1) > P(ω2)→ ω1

Class–Conditional Probability Density Function p(x |ω)

Probabilities

Class–Conditional Probability Density

9 10 11 12 13 14 15

p(x|ωi)

FIGURE 2.1. Hypothetical class-conditional probability density functions show theprobability density of measuring a particular feature value x given the pattern is incategory ωi . If x represents the lightness of a fish, the two curves might describe thedifference in lightness of populations of two types of fish. Density functions are normal-ized, and thus the area under each curve is 1.0. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons,Inc.

Bayes Decision Rule

Joint Probability Densityp(ωj , x) = P(ωj |x)p(x) = p(x |ωj)P(ωj)

Bayes Formula P(ωj |x) =p(x |ωj )P(ωj )

Decision Rule P(ω1|x) > P(ω2|x)→ ω1

Bayes Decision Rule

Posterior Probabilities

P(ωi|x)

9 10 11 12 13 14 15

FIGURE 2.2. Posterior probabilities for the particular priors P(ω1) = 2/3 and P(ω2)

= 1/3 for the class-conditional probability densities shown in Fig. 2.1. Thus in thiscase, given that a pattern is measured to have feature value x = 14, the probability it isin category ω2 is roughly 0.08, and that it is in ω1 is 0.92. At every x, the posteriors sumto 1.0. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.

Error Probabilities

Error P(error |x) =

{P(ω1|x) if ω2

P(ω2|x) if ω1

Average Error ProbabilityP(error) =

∫∞−∞ p(error , x)dx =

∫∞−∞ P(error |x)p(x)dx

Bayes Rule minimizes P(error)

Error Probabilities

Error P(error |x) =

{P(ω1|x) if ω2

P(ω2|x) if ω1

Error Probabilities

Error P(error |x) =

{P(ω1|x) if ω2

P(ω2|x) if ω1

Generalized Bayes Rule

Feature vector ~x ∈ Rd

Classes ω1 . . . ωc

Bayes Formula P(ωj |~x) =p(~x |ωj )P(ωj )

Dichotomizer

decisionboundary

p(x|ω2)P(ω2)

p(x|ω1)P(ω1)

FIGURE 2.6. In this two-dimensional two-category classifier, the probability densitiesare Gaussian, the decision boundary consists of two hyperbolas, and thus the decisionregion R2 is not simply connected. The ellipses mark where the density is 1/e timesthat at the peak of the distribution. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

The Normal Density

Randomized Prototype Vectors with Mean ~µ→Normal Distribution

Expected ValueE [f (x)] =

∫∞−∞ f (x)p(x)dx (continous)

E [f (x)] =∑

x∈D f (x)P(x) (discrete)

Univariate Normal Density

p(x) = 1√2πσ

e−12

( x−µσ

E [x ] = µ E [(x − µ)2] = σ2

The Normal Density

E [f (x)] =∑

p(x) = 1√2πσ

e−12

( x−µσ

E [x ] = µ E [(x − µ)2] = σ2

The Normal Density

E [f (x)] =∑

p(x) = 1√2πσ

e−12

( x−µσ

E [x ] = µ E [(x − µ)2] = σ2

Normal Distribution

2.5% 2.5%

µ + σ µ + 2σµ - σµ - 2σ µ

FIGURE 2.7. A univariate normal distribution has roughly 95% of its area in the range|x − µ| ≤ 2σ , as shown. The peak of the distribution has value p(µ) = 1/

√2πσ . From:

Multivariate Density

Multivariate Normal Density

p(x) = 1

2πd2 |Σ|

12e−

(~x−~µ)tΣ−1(~x−~µ)

Covariance Matrix Σ (d × d)E [~x ] = ~µ E [(~x − ~µ)(~x − ~µ)t ] = ΣE [xi ] = µi E [(xi − µi )(xj − µj)] = σij

Multivariate Density

Multivariate Normal Density

p(x) = 1

2πd2 |Σ|

12e−

(~x−~µ)tΣ−1(~x−~µ)

Covariance Matrix Σ (d × d)E [~x ] = ~µ E [(~x − ~µ)(~x − ~µ)t ] = ΣE [xi ] = µi E [(xi − µi )(xj − µj)] = σij

2D Gaussian

FIGURE 2.9. Samples drawn from a two-dimensional Gaussian lie in a cloud centeredon the mean �. The ellipses show lines of equal probability density of the Gaussian.From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copy-right c© 2001 by John Wiley & Sons, Inc.

Four Categories

FIGURE 2.16. The decision regions for four normal distributions. Even with such a lownumber of categories, the shapes of the boundary regions can be rather complex. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.

Classification Errors

Bayes Error: overlapping densities, inherent problem property

Model Error: incorrect model

Estimation Error: finite sample of training data

Bayes Error and Dimensionality

FIGURE 3.3. Two three-dimensional distributions have nonoverlapping densities, andthus in three dimensions the Bayes error vanishes. When projected to a subspace—here,the two-dimensional x1 − x2 subspace or a one-dimensional x1 subspace—there canbe greater overlap of the projected distributions, and hence greater Bayes error. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.

Nonparametric Techniques

Density Estimation

Unknown Densities

Real problems: multi–modal, parametric densities: uni–modal→ estimation of densities directly from data

P that pattern ~x falls in region R, P =∫R p(~x)d~x

n patterns, probability that k patterns are in RPk =

)Pk(1− P)n−k E [k] = nP

Assuming small region R →p(~x) ' const →

∫R p(~x)d~x ' p(~x)V → p(~x) '

Density Estimation

Unknown Densities

)Pk(1− P)n−k E [k] = nP

∫R p(~x)d~x ' p(~x)V → p(~x) '

Density Estimation

Unknown Densities

)Pk(1− P)n−k E [k] = nP

∫R p(~x)d~x ' p(~x)V → p(~x) '

Density Estimation

Unknown Densities

)Pk(1− P)n−k E [k] = nP

∫R p(~x)d~x ' p(~x)V → p(~x) '

Density Estimation

Relative Probability

relativeprobability

1005020

P = 0.7

FIGURE 4.1. The relative probability an estimate given by Eq. 4 will yield a particularvalue for the probability density, here where the true probability was chosen to be 0.7.Each curve is labeled by the total number of patterns n sampled, and is scaled to givethe same maximum (at the true probability). The form of each curve is binomial, asgiven by Eq. 2. For large n, such binomials peak strongly at the true probability. In thelimit n → ∞, the curve approaches a delta function, and we are guaranteed that ourestimate will give the true probability. From: Richard O. Duda, Peter E. Hart, and DavidG. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Density Estimation

Sample Size

Estimate p(~x) 'knV is dependent on size of V

if V → 0, p(~x) would be exact, but no more samples in V

Assuming infinite pattern set with decreasing Vn

n–th estimate pn(~x) =knnVn

For convergence of pn(~x)→ p(~x)limn→∞ Vn = 0 limn→∞ kn =∞ limn→∞

knn = 0

Decreasing Vn, e.g., Vn = 1√n→ Parzen Windows

Increasing kn, e.g., kn =√n→ kn–Nearest–Neighbors

Density Estimation

Sample Size

knn = 0

Density Estimation

Sample Size

knn = 0

Density Estimation

Sample Size

knn = 0

Density Estimation

Point Density Estimation

n = 1 n = 4 n = 9 n = 16 n = 100

Vn =1/ √n

kn = √n

FIGURE 4.2. There are two leading methods for estimating the density at a point, hereat the center of each square. The one shown in the top row is to start with a large volumecentered on the test point and shrink it according to a function such as Vn = 1/

√n. The

other method, shown in the bottom row, is to decrease the volume in a data-dependentway, for instance letting the volume enclose some number kn = √

n of sample points.The sequences in both cases represent random variables that generally converge andallow the true density at the test point to be calculated. From: Richard O. Duda, PeterE. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.

k–Nearest–Neigbor Estimation

Prototype Estimation

Estimate density at arbitrary ~x by kn nearest neighbors of ~x

pn(~x) =knnVn

(neighbors are training patterns)

Dense neighbors → small Vn → good resolutionSparse neighbors → large Vn → bad resolution

Problem: often∫pn(~x)d~x > 1

pn(~x) =knnVn

1D kNN Estimate

FIGURE 4.10. Eight points in one dimension and the k-nearest-neighbor density esti-mates, for k = 3 and 5. Note especially that the discontinuities in the slopes in theestimates generally lie away from the positions of the prototype points. From: RichardO. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001by John Wiley & Sons, Inc.

2D kNN Estimate

FIGURE 4.11. The k-nearest-neighbor estimate of a two-dimensional density for k = 5.Notice how such a finite n estimate can be quite “jagged,” and notice that disconti-nuities in the slopes generally occur along lines away from the positions of the pointsthemselves. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classifi-cation. Copyright c© 2001 by John Wiley & Sons, Inc.

Unimodal and Bimodal 1D kNN Estimates

0 1 2 3 4

n= ∞kn= ∞

FIGURE 4.12. Several k-nearest-neighbor estimates of two unidimensional densities:a Gaussian and a bimodal distribution. Notice how the finite n estimates can be quite“spiky.” From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.

Estimation of A Posteriori Probabilities

Samples of different classes, what is P(ωi |~x)?

Estimate for pn(~x , ωi ) =kinV (in arbitrary V )

Estimate for P(ωi |~x) = pn(~x ,ωi )∑cj=1 pn(~x ,ωj )

With n→∞ and Bayes Rule: optimal performance (Parzenand kNN)

Nearest Neigbor Rule

Single nearest neigbor is ~x ′ (k = 1)Class label of ~x ′ is θ′ (random variable)P(θ′ = ωi ) = P(ωi |~x ′) ' P(ωi |~x) (for large n)

Assumption of 1NN: P(ωi |~x ′) is largest probabilityIf true (e.g., P ' 1, or P ' 1

c ), then 1NN close to Bayes Error

Average error probability P(e) =∫P(e|~x)p(~x)d~x

P(e|~x) = 1− P(ωi |~x ′) is ”minimum” P∗(e|~x)P∗(e) =

∫P∗(e|~x)p(~x)d~x

1NN error P = limn→∞ Pn(e)P∗ ≤ P ≤ P∗(2− c

c−1P∗)

Voronoi Tesselation

FIGURE 4.13. In two dimensions, the nearest-neighbor algorithm leads to a partition-ing of the input space into Voronoi cells, each labeled by the category of the trainingpoint it contains. In three dimensions, the cells are three-dimensional, and the decisionboundary resembles the surface of a crystal. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

1NN Error Rate Bounds

c - 1c

FIGURE 4.14. Bounds on the nearest-neighbor error rate P in a c-category problemgiven infinite training data, where P∗ is the Bayes error (Eq. 52). At low error rates, thenearest-neighbor error rate is bounded above by twice the Bayes rate. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.

k–Nearest–Neigbor Rule

Straight–forward extension: k neighbors

Majority voting: P(ωm|~x) is largest probability (mostprototypes in class m)

If k →∞ then k–NN rule becomes optimal

5NN in 2D

FIGURE 4.15. The k-nearest-neighbor query starts at the test point x and grows a spher-ical region until it encloses k training samples, and it labels the test point by a majorityvote of these samples. In this k = 5 case, the test point x would be labeled the categoryof the black points. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.

kNN Error Rate Bounds

0 0.1 0.2 0.3 0.4

Bayes Rate

FIGURE 4.16. The error rate for the k-nearest-neighbor rule for a two-category problemis bounded by Ck(P∗) in Eq. 54. Each curve is labeled by k; when k = ∞, the estimatedprobabilities match the true probabilities and thus the error rate is equal to the Bayesrate, that is, P = P∗. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.

Metrics

What is a distance?

Properties of Metrics

Nonnegativity: D(~a,~b) ≥ 0

Reflexivity: D(~a,~b) = 0 iff ~a = ~b

Symmetry: D(~a,~b) = D(~b,~a)

Triangle inequality: D(~a,~b) + D(~b,~c) ≥ D(~a,~c)

Scaling of feature values equivalent to changing the metric

Metrics

What is a distance?

Metrics

What is a distance?

Scaling is Change of Metric

FIGURE 4.18. Scaling the coordinates of a feature space can change the distance rela-tionships computed by the Euclidean metric. Here we see how such scaling can changethe behavior of a nearest-neighbor classifer. Consider the test point x and its nearestneighbor. In the original space (left), the black prototype is closest. In the figure at theright, the x1 axis has been rescaled by a factor 1/3; now the nearest prototype is the redone. If there is a large disparity in the ranges of the full data in each dimension, a com-mon procedure is to rescale all the data to equalize such ranges, and this is equivalentto changing the metric in the original space. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Class of Metrics

Minkowski Metric (Lk Norm) Lk(~a,~b) = (∑d

i=1 |ai − bi |k)1k

L1 Norm: Manhattan distance

L2 Norm: Euclidean distance

L∞ Norm: Maximum of projected distances

Class of Metrics

i=1 |ai − bi |k)1k

Class of Metrics

i=1 |ai − bi |k)1k

Class of Metrics

i=1 |ai − bi |k)1k

Minkowski Metric

0,1,01,1,1

FIGURE 4.19. Each colored surface consists of points a distance 1.0 from the origin,measured using different values for k in the Minkowski metric (k is printed in red). Thusthe white surfaces correspond to the L1 norm (Manhattan distance), the light gray spherecorresponds to the L2 norm (Euclidean distance), the dark gray ones correspond to theL4 norm, and the pink box corresponds to the L∞ norm. From: Richard O. Duda, PeterE. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.

Linear Discriminant Functions

Decision Surfaces

Discriminant Functions

Assumption: we know the form of discriminant functions(not probability densities)

Problem: determine parameters of discriminant functions

Method: gradient descent of criterion functions(based on training set)

Decision Surfaces

Linear Classifier

. . .w2

x2 xd. . .

bias unit

input units

output unit

FIGURE 5.1. A simple linear classifier having d input units, each corresponding to thevalues of the components of an input vector. Each input feature value xi is multipliedby its corresponding weight wi; the effective input at the output unit is the sum all theseproducts,

∑wixi. We show in each unit its effective input-output function. Thus each of

the d input units is linear, emitting exactly the value of its corresponding feature value.The single bias unit unit always emits the constant value 1.0. The single output unitemits a +1 if wtx + w0 > 0 or a −1 otherwise. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons,Inc.

Decision Surfaces

Linear discriminant function g(~x) = ~w t~x + w0

(weight vector ~w , bias w0)

Two classes: g(~x) > 0→ ω1, else ω2

or ~w t~x > −w0

Decision surface is hyperplane, ~x1, ~x2 on boundary~w t~x1 + w0 = ~w t~x2 + w0 → ~w t(~x1 − ~x2) = 0(~w is normal vector)

Hyperplane H divides space in two half–spacesR1 is positive side (g(~x) > 0), R2 is negative side (g(~x) < 0)

Decision Surfaces

or ~w t~x > −w0

Decision Surfaces

or ~w t~x > −w0

Decision Surfaces

or ~w t~x > −w0

Decision Surfaces

Multiple Classes

Variant: c dichotomizers (ωi , not ωi )

Variant: c(c−1)2 dichotomizers (all class pairs)

Variant: linear machine, discriminant functionsgi (~x), i = 1, . . . , c

Decision boundarygi (~x) = gj(~x)→ (~wi − ~wj)

t~x + (wi0 − wj0) = 0

(~wi − ~wj) ⊥ Hij , r =gi (~x)−gj (~x)||~wi−~wj ||

Decision Surfaces

Multiple Classes

t~x + (wi0 − wj0) = 0

Decision Surfaces

Multiple Classes

t~x + (wi0 − wj0) = 0

Decision Surfaces

Multiple Classes

t~x + (wi0 − wj0) = 0

Decision Surfaces

Dichotomizers in a Four–class Problem

not ω1

not ω2ω2

not ω3ω3

not ω4

ambiguous region

ω1 ω1

ω4ω4

ω3ω2

H13H12

H23 H24

FIGURE 5.3. Linear decision boundaries for a four-class problem. The top figure showsωi/not ωi dichotomies while the bottom figure shows ωi/ωj dichotomies and the corre-sponding decision boundaries Hij. The pink regions have ambiguous category assign-ments. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.

Decision Surfaces

Linear Machines in Multi–class Problems

ω1 R2

ω2 ω1ω3

ω2ω3

H15 H25

FIGURE 5.4. Decision boundaries produced by a linear machine for a three-class prob-lem and a five-class problem. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Decision Surfaces

Generalized Linear Discriminant Functions

More complex decision boundariese.g., quadratic discriminantg(~x) = w0 +

∑di=1 wixi +

∑di=1

∑dj=1 wijxixj

Generalized LDF g(~x) =∑d

i=1 aiyi (~x) = ~at~yd yi (~x) functions map points from d–dimensional ~x–space tod–dimensional ~y–space

Example g(x) = a1 + a2x + a3x2, ~y =

Decision boundary is linear in ~y–spaceTransformed density p(x) is degenerateIf d is large, huge number of parameters(requires large training data set)

Decision Surfaces

∑di=1 wixi +

∑di=1

∑dj=1 wijxixj

Decision Surfaces

∑di=1 wixi +

∑di=1

∑dj=1 wijxixj

Decision Surfaces

∑di=1 wixi +

∑di=1

∑dj=1 wijxixj

Decision Surfaces

From 1D to 3D

1-1 20-2x

R1R1 R2

y = ( )1x

FIGURE 5.5. The mapping y = (1, x, x2)t takes a line and transforms it to a parabolain three dimensions. A plane splits the resulting y-space into regions corresponding totwo categories, and this in turn gives a nonsimply connected decision region in theone-dimensional x-space. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Decision Surfaces

From 2D to 3D

( )x 1

αx 1x 2

FIGURE 5.6. The two-dimensional input space x is mapped through a polynomial func-tion f to y. Here the mapping is y1 = x1, y2 = x2 and y3 ∝ x1x2. A linear discriminantin this transformed space is a hyperplane, which cuts the surface. Points to the positiveside of the hyperplane H correspond to category ω1, and those beneath it correspond tocategory ω2. Here, in terms of the x space, R1 is a not simply connected. From: RichardO. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001by John Wiley & Sons, Inc.

Decision Surfaces

Linearly Separable Dichotomy

Two classes, samples ~yi , ~at~yi > 0→ ω1, ~at~yi < 0→ ω2

”Normalization” of ω2: ~yi = −~yi → ~at~yi > 0 ∀~yi

Solution region defines all possible values of ~aintersection of n half–spaces (~at~yi = 0)

Margin b > 0, ~at~yi ≥ b, new solution region has distance b||~yi ||

from old boundaries

Decision Surfaces

from old boundaries

Decision Surfaces

from old boundaries

Decision Surfaces

from old boundaries

Decision Surfaces

Solution Region and Normalization

separating plane

solution region

"separating" plane

solution region

FIGURE 5.8. Four training samples (black for ω1, red for ω2) and the solution region infeature space. The figure on the left shows the raw data; the solution vectors leads to aplane that separates the patterns from the two categories. In the figure on the right, thered points have been “normalized”—that is, changed in sign. Now the solution vectorleads to a plane that places all “normalized” points on the same side. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.

Decision Surfaces

Solution Region with Margins

solutionregion

b/||y2 ||

b/||y 1||

FIGURE 5.9. The effect of the margin on the solution region. At the left is the case ofno margin (b = 0) equivalent to a case such as shown at the left in Fig. 5.8. At the rightis the case b > 0, shrinking the solution region by margins b/‖yi‖. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.

Decision Surfaces

Gradient Descent Solutions

Set of linear inequalities ~at~yi > 0, define criterion functionJ(~a), which is minimized for a solution vector ~a∗

Minimizing a scalar function J(~a) by gradient descent~a(k + 1) = ~a(k)− η(k)~∇J(~a(k))

Second–order expansionJ(~a) ' J(~a(k)) + ~∇Jt(~a−~a(k)) + 1

2 (~a−~a(k))tH(~a−~a(k))H is Hessian Matrix

Minimize J(~a(k + 1)) with η(k) = ||~∇J||2~∇JtH ~∇J

J(~a) ∼ ~a2 → H = const.→ η = const.

Minimize second–order expansion with ~a(k + 1)→Newton Descent ~a(k + 1) = ~a(k)− H−1~∇J (expensive)

Decision Surfaces

Gradient and Newton Descent

FIGURE 5.10. The sequence of weight vectors given by a simple gradient descentmethod (red) and by Newton’s (second order) algorithm (black). Newton’s method typi-cally leads to greater improvement per step, even when using optimal learning rates forboth methods. However the added computational burden of inverting the Hessian ma-trix used in Newton’s method is not always justified, and simple gradient descent maysuffice. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.

Decision Surfaces

Perceptron Criterion Function

”Normalized” inequalities ~at~yi > 0Perceptron criterion Jp(~a) =

∑~y∈Y −~at~y

(Y is set of misclassified patterns)

Gradient ~∇Jp =∑

~y∈Y −~y

Update rule ~a(k + 1) = ~a(k) + η(k)∑

~y∈Yk ~y

Batch vs. single–sample correction

Decision Surfaces

∑~y∈Y −~at~y

~y∈Y −~y

~y∈Yk ~y

Decision Surfaces

∑~y∈Y −~at~y

~y∈Y −~y

~y∈Yk ~y

Decision Surfaces

∑~y∈Y −~at~y

~y∈Y −~y

~y∈Yk ~y

Decision Surfaces

Minimum Squared–Error Procedures

Set of equalities ~at~yi = bibi > 0 are arbitrary constants

Solve Y~a = ~bY is n × (d + 1) matrix containing all training vectors

If Y nonsingular ~a = Y−1~b, however Y mostly rectangular!

Minimizing ~e = Y~a− ~b leads toY tY~a = Y t~b → ~a = (Y tY )−1Y t~b = Y †~bY † is pseudoinverse (d + 1)× n matrix

Decision Surfaces

Support Vector Machines

Transform patterns to (much) higher dimensionvia nonlinear mapping ϕ(.)

Linear discriminant g(~y) = ~at~y

Distance of ~yk to H is zkg(~yk )||a|| ≥ b

zk = ±1 (normalization), b is margin

Maximize b with constrained ||a|| = 1b → minimize ||a|| with

inequality constraints

Kuhn–Tucker theorem, optimization with inequalityconstraints, generalization of Lagrange Multipliers

Decision Surfaces

Maximal Margin SVM

Maximize margin b using the Kuhn–Tucker functionalL(~a, ~α) = 1

2 ||~a||2 −

∑nk=1 αk [zk~a

t~yk − 1]

Resulting in dual problem (quadratic optimization)L(~α) =

∑nk=1 αk − 1

∑nk,j αkαjzkzj~y

tj ~yk

with constraints∑nk=1 zkαk = 0 αk ≥ 0

Then ~a∗ =∑n

i=1 ziα∗i ~yi (non–zero αi indicates support

vector)

Maximal margin b∗ = (∑n

i=1 αi∗)−

Decision Surfaces

Maximal Margin SVM

2 ||~a||2 −

∑nk=1 αk [zk~a

t~yk − 1]

∑nk=1 αk − 1

tj ~yk

Then ~a∗ =∑n

vector)

i=1 αi∗)−

Decision Surfaces

Maximal Margin SVM

2 ||~a||2 −

∑nk=1 αk [zk~a

t~yk − 1]

∑nk=1 αk − 1

tj ~yk

Then ~a∗ =∑n

vector)

i=1 αi∗)−

Decision Surfaces

Maximal Margin SVM

2 ||~a||2 −

∑nk=1 αk [zk~a

t~yk − 1]

∑nk=1 αk − 1

tj ~yk

Then ~a∗ =∑n

vector)

i=1 αi∗)−

Decision Surfaces

Maximal Margin Hyperplane

optimal hyperplane

FIGURE 5.19. Training a support vector machine consists of finding the optimal hyper-plane, that is, the one with the maximum distance from the nearest training patterns.The support vectors are those (nearest) patterns, a distance b from the hyperplane. Thethree support vectors are shown as solid dots. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons,Inc.

Decision Surfaces

Soft Margin SVM

Maximal margin SVM is sensitive to outliers, demands linearseparability for solution

Soft Margin SVM introducing slack variables ξzkg(~yk) ≥ b − ξk (relaxed margin)

Maximize relaxed margin b with Kuhn–Tucker functionalL(~a, ~α, ~ξ) = 1

2 ||~a||2 + C

∑nk=1 ξ

2i −

∑nk=1 αk [zk~a

t~yk − 1 + ξi ]

Again ~a∗ =∑n

i=1 ziα∗i ~yi

i=1 αi∗ − 1

C |α∗i |2)−

Depends on parameter C !

Decision Surfaces

Soft Margin SVM

2 ||~a||2 + C

∑nk=1 ξ

2i −

∑nk=1 αk [zk~a

t~yk − 1 + ξi ]

Again ~a∗ =∑n

i=1 ziα∗i ~yi

i=1 αi∗ − 1

C |α∗i |2)−

Decision Surfaces

Soft Margin SVM

2 ||~a||2 + C

∑nk=1 ξ

2i −

∑nk=1 αk [zk~a

t~yk − 1 + ξi ]

Again ~a∗ =∑n

i=1 ziα∗i ~yi

i=1 αi∗ − 1

C |α∗i |2)−

Decision Surfaces

Soft Margin SVM

2 ||~a||2 + C

∑nk=1 ξ

2i −

∑nk=1 αk [zk~a

t~yk − 1 + ξi ]

Again ~a∗ =∑n

i=1 ziα∗i ~yi

i=1 αi∗ − 1

C |α∗i |2)−

Decision Surfaces

Soft Margin SVM

2 ||~a||2 + C

∑nk=1 ξ

2i −

∑nk=1 αk [zk~a

t~yk − 1 + ξi ]

Again ~a∗ =∑n

i=1 ziα∗i ~yi

i=1 αi∗ − 1

C |α∗i |2)−

Decision Surfaces

Soft Margin SVM

2 ||~a||2 + C

∑nk=1 ξ

2i −

∑nk=1 αk [zk~a

t~yk − 1 + ξi ]

Again ~a∗ =∑n

i=1 ziα∗i ~yi

i=1 αi∗ − 1

C |α∗i |2)−

Neural Networks

Multilayer Neural Networks

Real-world problems: linear discriminant often not sufficient

NNs also implement nonlinear mapping to higher dimension

Learning finds mapping AND linear discriminant

Error–backpropagation is least square fit to Bayes discriminantfunctions

NNs motivated by biology, but can be explained without it

Neural Networks

XOR Net

biashidden j

output k

input i

.7-.4-1

FIGURE 6.1. The two-bit parity or exclusive-OR problem can be solved by a three-layer network. At the bottom is the two-dimensional feature x1x2-space, along with thefour patterns to be classified. The three-layer network is shown in the middle. The inputunits are linear and merely distribute their feature values through multiplicative weightsto the hidden units. The hidden and output units here are linear threshold units, eachof which forms the linear sum of its inputs times their associated weight to yield net,and emits a +1 if this net is greater than or equal to 0, and −1 otherwise, as shownby the graphs. Positive or “excitatory” weights are denoted by solid lines, negative or“inhibitory” weights by dashed lines; each weight magnitude is indicated by the line’sthickness, and is labeled. The single output unit sums the weighted signals from thehidden units and bias to form its net, and emits a +1 if its net is greater than or equalto 0 and emits a −1 otherwise. Within each unit we show a graph of its input-outputor activation function—f (net) versus net. This function is linear for the input units, aconstant for the bias, and a step or sign function elsewhere. We say that this networkhas a 2-2-1 fully connected topology, describing the number of units (other than thebias) in successive layers. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Neural Networks

Network Components

Neurons and synaptic connections (weights)

Net activation netj =∑d

i=1 xiwji + wj0 =∑d

i=0 xiwji ≡ ~wjt~x

Neuron output zk = f (netk), activation function

Common activation function class is sigmoid, e.g.,f (x) = 1

1+e−cx

Basic topologies: feed–forward and recurrent

Neural Networks

Network Components

1+e−cx

Neural Networks

Network Components

1+e−cx

Neural Networks

Network Components

1+e−cx

Neural Networks

Network Components

1+e−cx

Neural Networks

A 2–4–1 Network

y3 y4y2y1

FIGURE 6.2. A 2-4-1 network (with bias) along with the response functions at different units; each hiddenoutput unit has sigmoidal activation function f (·). In the case shown, the hidden unit outputs are paired inopposition thereby producing a “bump” at the output unit. Given a sufficiently large number of hidden units,any continuous function from input to output can be approximated arbitrarily well by such a network. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley& Sons, Inc.

Neural Networks

NN Decision Boundaries

two layer

three layer

FIGURE 6.3. Whereas a two-layer network classifier can only implement a linear deci-sion boundary, given an adequate number of hidden units, three-, four- and higher-layernetworks can implement arbitrary decision boundaries. The decision regions need notbe convex or simply connected. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Neural Networks

Network Learning

Learning as minimization (of network error)

Error is a function of network parameters

Gradient descent methods reduce error

Problem with hidden layers

Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)

Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally

Weight update ∆wj ,i = ηδjaiGeneralized error term δ

Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation

Neural Networks

Network Learning

Neural Networks

Network Learning

Neural Networks

Network Learning

Neural Networks

Network Learning

Neural Networks

Network Learning

Neural Networks

Network Learning

Neural Networks

Network Learning

Neural Networks

Error–Backpropagation I

w11v11

Hj =∑n

i=1 vj ,ixi Ik =∑h

j=1 wk,jyjyj = f (Hj), zk = f (Ik)

Error E (p) = 12

∑mk=1 (t

(p)k − z

(p)k )2

Output Layer: ∆wk,j = −η ∂E∂wk,j

∂E∂wk,j

= ∂E∂Ik

∂Ik∂wk,j

= ∂E∂Ik

yj∂E∂Ik

= ∂E∂zk

∂zk∂Ik

= −(tk − zk)f ′(Ik)∂E∂wk,j

= −(tk − zk)f ′(Ik)yj mit δk = (tk − zk)f ′(Ik)

∆wk,j = ηδkyj

Neural Networks

w11v11

Hj =∑n

Error E (p) = 12

∑mk=1 (t

(p)k − z

(p)k )2

∂E∂wk,j

= ∂E∂Ik

∂Ik∂wk,j

= ∂E∂Ik

yj∂E∂Ik

= ∂E∂zk

∂zk∂Ik

∆wk,j = ηδkyj

Neural Networks

w11v11

Hj =∑n

Error E (p) = 12

∑mk=1 (t

(p)k − z

(p)k )2

∂E∂wk,j

= ∂E∂Ik

∂Ik∂wk,j

= ∂E∂Ik

yj∂E∂Ik

= ∂E∂zk

∂zk∂Ik

∆wk,j = ηδkyj

Neural Networks

w11v11

Hj =∑n

Error E (p) = 12

∑mk=1 (t

(p)k − z

(p)k )2

∂E∂wk,j

= ∂E∂Ik

∂Ik∂wk,j

= ∂E∂Ik

yj∂E∂Ik

= ∂E∂zk

∂zk∂Ik

∆wk,j = ηδkyj

Neural Networks

Error–Backpropagation II

Hidden Layer: ∆vj ,i = −η ∂E∂vj,i

∂E∂vj,i

= ∂E∂Hj

∂vj,i= ∂E

∂Hjxi

∂E∂Hj

= ∂E∂yj

∂yj∂Hj

= ∂E∂yj

f ′(Hj)

∂E∂yj

= −12

∑mk=1

∂(tk−f (Ik ))2

∂yj= −

∑mk=1 (tk − zk)f ′(Ik)wk,j

mit δj = f ′(Hj)∑m

k=1 δkwk,j

∆vj ,i = ηδjxi

Local update rules propagating error from output to input

Present all p patterns of the training set = 1 Epoch (completetraining e.g., 1,000 epochs)

Batch Learning (Off–line): accumulate weight changes for allpatterns, then update weights

On–line Learning: update weights after each pattern

Neural Networks

∂E∂vj,i

= ∂E∂Hj

∂vj,i= ∂E

∂Hjxi

∂E∂Hj

= ∂E∂yj

∂yj∂Hj

= ∂E∂yj

f ′(Hj)

∂E∂yj

= −12

∑mk=1

∂(tk−f (Ik ))2

∂yj= −

k=1 δkwk,j

∆vj ,i = ηδjxi

Neural Networks

∂E∂vj,i

= ∂E∂Hj

∂vj,i= ∂E

∂Hjxi

∂E∂Hj

= ∂E∂yj

∂yj∂Hj

= ∂E∂yj

f ′(Hj)

∂E∂yj

= −12

∑mk=1

∂(tk−f (Ik ))2

∂yj= −

k=1 δkwk,j

∆vj ,i = ηδjxi

Neural Networks

∂E∂vj,i

= ∂E∂Hj

∂vj,i= ∂E

∂Hjxi

∂E∂Hj

= ∂E∂yj

∂yj∂Hj

= ∂E∂yj

f ′(Hj)

∂E∂yj

= −12

∑mk=1

∂(tk−f (Ik ))2

∂yj= −

k=1 δkwk,j

∆vj ,i = ηδjxi

Neural Networks

∂E∂vj,i

= ∂E∂Hj

∂vj,i= ∂E

∂Hjxi

∂E∂Hj

= ∂E∂yj

∂yj∂Hj

= ∂E∂yj

f ′(Hj)

∂E∂yj

= −12

∑mk=1

∂(tk−f (Ik ))2

∂yj= −

k=1 δkwk,j

∆vj ,i = ηδjxi

Neural Networks

Learning Curves

epochs

trainingtest

validation

1 2 3 4 5 6 7 8 9 10 11

FIGURE 6.6. A learning curve shows the criterion function as a function of the amountof training, typically indicated by the number of epochs or presentations of the full train-ing set. We plot the average error per pattern, that is, 1/n

∑np=1 Jp. The validation error

and the test or generalization error per pattern are virtually always higher than the train-ing error. In some protocols, training is stopped at the first minimum of the validationset. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.

Neural Networks

XOR Learning Details

-1 1x1

-1.5 -1 -0.5 0 0.5 1 1.5-1.5

y2 x2x1

10 20 30 40 50 60epoch

total error

error onindividual patterns

FIGURE 6.10. A 2-2-1 backpropagation network with bias and the four patterns of theXOR problem are shown at the top. The middle figure shows the outputs of the hid-den units for each of the four patterns; these outputs move across the y1y2-space asthe network learns. In this space, early in training (epoch 1) the two categories are notlinearly separable. As the input-to-hidden weights learn, as marked by the number ofepochs, the categories become linearly separable. The dashed line is the linear decisionboundary determined by the hidden-to-output weights at the end of learning; indeedthe patterns of the two classes are separated by this boundary. The bottom graph showsthe learning curves—the error on individual patterns and the total error as a functionof epoch. Note that, as frequently happens, the total training error decreases monoton-ically, even though this is not the case for the error on each individual pattern. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.

Neural Networks

Backpropagation Variants I

Standard Backpropagation: ~wt = ~wt−1 − η~∇E

Gradient Reuse: use ~∇E as long as error drops

BP with variable stepsize (learn rate) η

BP with momentum: ∆ ~wt = −η~∇E + α∆ ~wt−1

Neural Networks

Standard Backpropagation: ~wt = ~wt−1 − η~∇EGradient Reuse: use ~∇E as long as error drops

Neural Networks

Nonmetric Methods

Decision Trees

Real problems: nominal data, e.g., car = green, red,

Rule–based or syntactic methods

Decision tree (DT): series of questions (nodes) lead to answerat leaf (category)

DT is interpretable (decisions and categories)

Nonmetric Methods

Decision Trees

Nonmetric Methods

Decision Trees

Nonmetric Methods

Decision Trees

Nonmetric Methods

Monothetic Decision Tree

Color?

Size? Size?Shape?

yellowredgreen

thin mediumsmall smallbig

Grapefruit

big small

Watermelon Banana AppleApple

Grape Taste?

sweet sour

Cherry Grape

medium

level 0

level 1

level 2

level 3

FIGURE 8.1. Classification in a basic decision tree proceeds from top to bottom. The questions asked ateach node concern a particular property of the pattern, and the downward links correspond to the possiblevalues. Successive nodes are visited until a terminal or leaf node is reached, where the category label is read.Note that the same question, Size?, appears in different places in the tree and that different questions canhave different numbers of branches. Moreover, different leaf nodes, shown in pink, can be labeled by thesame category (e.g., Apple). From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.

Nonmetric Methods

Classification and Regression Trees

Goal: construct pure nodes (ideally, all leaf nodes are pure)

A pure leaf node resembles only patterns of single category

Design Issues

Branching factor = splits?Which query (property) at which node?Termination (leaf node)?Pruning (simplification)?Missing data?

Nonmetric Methods

Design Issues

Nonmetric Methods

Design Issues

Nonmetric Methods

Design Issues

Branching factor = splits?

Which query (property) at which node?Termination (leaf node)?Pruning (simplification)?Missing data?

Nonmetric Methods

Design Issues

Branching factor = splits?Which query (property) at which node?

Termination (leaf node)?Pruning (simplification)?Missing data?

Nonmetric Methods

Design Issues

Branching factor = splits?Which query (property) at which node?Termination (leaf node)?

Pruning (simplification)?Missing data?

Nonmetric Methods

Design Issues

Branching factor = splits?Which query (property) at which node?Termination (leaf node)?Pruning (simplification)?

Missing data?

Nonmetric Methods

Design Issues

Nonmetric Methods

Monothetic Decision Boundaries

FIGURE 8.3. Monothetic decision trees create decision boundaries with portions perpendicular to the featureaxes. The decision regions are marked R1 and R2 in these two-dimensional and three-dimensional two-category examples. With a sufficiently large tree, any decision boundary can be approximated arbitrarilywell in this way. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.

Nonmetric Methods

Entropy Impurity

Each non-binary tree can be transformed to binary tree

Monothetic (single feature node) and polythetic (multiplefeatures node) trees

Any query at a node should gain maximal purity (or minimalimpurity)

Entropy impurity of a node N with class “probabilities” Pi(N) = −

∑j P(ωj)ldP(ωj)

i(N) = 0→ pure node

Nonmetric Methods

Entropy Impurity

∑j P(ωj)ldP(ωj)

Nonmetric Methods

Entropy Impurity

∑j P(ωj)ldP(ωj)

Nonmetric Methods

Entropy Impurity

∑j P(ωj)ldP(ωj)

Nonmetric Methods

Entropy Impurity

∑j P(ωj)ldP(ωj)

Nonmetric Methods

Other Impurity Measures

Gini impurity (generalization of variance impurity)i(N) =

∑i 6=j P(ωi )P(ωj) = 1

2 [1−∑

j P2(ωj)]

Expected error rate at N (if pattern is selected fromdistribution at N)

Misclassification impurity (discontinous derivative may causeproblems)i(N) = 1−max

jP(ωj)

Minimal probability of a misclassified pattern at N

Nonmetric Methods

∑i 6=j P(ωi )P(ωj) = 1

2 [1−∑

j P2(ωj)]

jP(ωj)

Nonmetric Methods

∑i 6=j P(ωi )P(ωj) = 1

2 [1−∑

j P2(ωj)]

jP(ωj)

Nonmetric Methods

∑i 6=j P(ωi )P(ωj) = 1

2 [1−∑

j P2(ωj)]

jP(ωj)

Nonmetric Methods

Greedy Query Search

Select query with largest impurity decrease from Nto NL (left child) and NR (right child)∆i(N) = i(N)− PLi(NL)− (1− PL)i(NR)

Nominal features (exhaustive search), continous features(gradient descent)

Specific choice of impurity measure is uncritical, moreimportant are stop splitting and pruning methods

Multiway splits (B > 2), simple impurity decrease favors largesplits, scaling of impurity decrease, Gain Ratio Impurity∆i ′(N,B) = ∆i(N,B)

−∑

k Pk ldPk

Nonmetric Methods

Greedy Query Search

−∑

k Pk ldPk

Nonmetric Methods

Greedy Query Search

−∑

k Pk ldPk

Nonmetric Methods

Greedy Query Search

−∑

k Pk ldPk

Nonmetric Methods

Stop Splitting Methods

Naive Stop: each leaf node has impurity 0 (perfectoverfitting), may degenerate to a look–up table (a leaf nodefor each pattern)

Measure split performance with a separate validation set(minimal error on validation set)

Impurity threshold ∆i(N) ≤ β, unbalanced trees, choice of β?

Pattern threshold: stop when a node represents a certain(small) number (percentage) of patterns

Minimum Description Length (regularization reducescomplexity)J(DT ) = α#N +

∑LN i(LN) (LN = leaf nodes)

Statistical significance of impurity reduction(distribution of ∆i)

Nonmetric Methods

Pruning

Stop Splitting: insufficient look–ahead (horizon effect) due togreedy search

Pruning: merge nodes, starts at leaf nodes, but any node ispossible

Uses complete data set, huge cost with large data sets

Rule pruning: construct and simplify rules (conjunctions) foreach leaf

Context pruning: prune specific rules for specific patterns

Improved interpretability

Nonmetric Methods

Pruning

Nonmetric Methods

Pruning

Nonmetric Methods

Pruning

Nonmetric Methods

Pruning

Nonmetric Methods

Pruning

Nonmetric Methods

Feature Extraction

.2 .4 .6 .8 10

1 - 1.2 x1 + x2 < 0.1

x1 < 0.27

x2 < 0.32

x1 < 0.07

x2 < 0.6

x1 < 0.55

x2 < 0.86

x1 < 0.81

ω2 ω1

.2 .4 .6 .8 10

FIGURE 8.5. If the class of node decisions does not match the form of the training data,a very complicated decision tree will result, as shown at the top. Here decisions areparallel to the axes while in fact the data is better split by boundaries along anotherdirection. If, however, “proper” decision forms are used (here, linear combinations ofthe features), the tree can be quite simple, as shown at the bottom. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.

Nonmetric Methods

Potential Improvements

At each node train a linear classifier, arbitrary linear decisionboundaries

Long training, (again) fast recall

Integrate priors and/or costs by weights

Weighted Gini Impurity with cost λiji(N) =

∑ij λijP(ωi )P(ωj)

Nonmetric Methods

Multivariate Decision Trees

0.2 0.4 0.6 0.8 1

0.04 x1 + 0.16 x2 < 0.11

0.27 x1 - 0.44 x2 < -0.02

0.96 x1 - 1.77x2 < -0.45

5.43 x1 - 13.33 x2 < -6.03

x2 < 0.5

x2 < 0.56x1 < 0.95

x2 < 0.54

ω1ω2

0.2 0.4 0.6 0.8 1

FIGURE 8.6. One form of multivariate tree employs general linear decisions at eachnode, giving splits along arbitrary directions in the feature space. In virtually all inter-esting cases the training data are not linearly separable, and thus the LMS algorithm ismore useful than methods that require the data to be linearly separable, even though theLMS need not yield a minimum in classification error (Chapter 5). The tree at the bottomcan be simplified by methods outlined in Section 8.4.2. From: Richard O. Duda, PeterE. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.

Nonmetric Methods

Missing Attributes

Naive approach: use only non-deficient patterns

Better: use only non–deficient attributes

Works with training, but how to classify a deficient pattern?

Surrogate splits: find alternative splits using different featureshaving maximal predictive association (correlation)

Virtual values, e.g., mean value of non–deficient feature values

Nonmetric Methods

Missing Attributes

Nonmetric Methods

Missing Attributes

Nonmetric Methods

Missing Attributes

Nonmetric Methods

Missing Attributes

Nonmetric Methods

ID3 stems from third interactive dichotomizer

Nominal features (real are binned)

Branch factor is number of attributes

Train until all nodes pure or no more features

Results in tree depth = number of features

No pruning

Nonmetric Methods

No pruning

Nonmetric Methods

No pruning

Nonmetric Methods

No pruning

Nonmetric Methods

No pruning

Nonmetric Methods

No pruning

Nonmetric Methods

Refinement of ID3

B > 2 with nominal features, B = 2 with real features

Pruning based on statistical significance of splits

Missing features: sample all subtrees of missing feature usingtraining data

Additional rule pruning, can prune any node (see Figure 8.6)

Nonmetric Methods

Refinement of ID3

Nonmetric Methods

Refinement of ID3

Nonmetric Methods

Refinement of ID3

Nonmetric Methods

Refinement of ID3

Stochastic Methods

Stochastic Search

Analytical methods problematic in high dimensions or withcomplex models

Large number of local optima makes gradient descent verycostly

Stochastic methods try to localize promising search regions

Pure random search is often not sufficient

Simulated Annealing and Boltzmann Learning motivated bystatistical mechanics

Evolutionary Computation motivated by evolutionaryprinciples from biology

Stochastic Methods

Stochastic Search

Stochastic Methods

Stochastic Search

Stochastic Methods

Stochastic Search

Stochastic Methods

Stochastic Search

Stochastic Methods

Stochastic Search

Stochastic Methods

Energy Minimization

Example: minimizing (model) energy in a (Hopfield) network

Energy E = −12

∑Ni ,j=1 wijsi sj si = ±1

Minimize energy of spin–glass model

Probability of energy state, Boltzmann factor

P(γ) = e−EγT

Stochastic Methods

Energy Minimization

Energy E = −12

P(γ) = e−EγT

Stochastic Methods

Energy Minimization

Energy E = −12

P(γ) = e−EγT

Stochastic Methods

Energy Minimization

Energy E = −12

P(γ) = e−EγT

Stochastic Methods

Recurrent Nethi

visible

s1 s2 s3 s4 s5 s6

s14 s15 s16 s17

s12s13

FIGURE 7.1. The class of optimization problems of Eq. 1 can be viewed in terms of anetwork of nodes or units, each of which can be in the si = +1 or si = −1 state. Everypair of nodes i and j is connected by bi-directional weights wij; if a weight between twonodes is zero, then no connection is drawn. (Because the networks we shall discuss canhave an arbitrary interconnection, there is no notion of layers as in multilayer neuralnetworks.) The optimization problem is to find a configuration (i.e., assignment of allsi) that minimizes the energy described by Eq. 1. While our convention was to showfunctions inside each node’s circle, our convention in so-called Boltzmann networks isto indicate the state of each node. The configuration of the full network is indexed byan integer γ , and because here there are 17 binary nodes, γ is bounded 0 ≤ γ < 217.When such a network is used for pattern recognition, the input and output nodes aresaid to be visible, and the remaining nodes are said to be hidden. The states of thevisible nodes and hidden nodes are indexed by α and β, respectively, and in this caseare bounded 0 ≤ α ≤ 210 and 0 ≤ β < 27. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Stochastic Methods

Energy Landscape

FIGURE 7.2. The energy function or energy “landscape” on the left is meant to suggest the types of opti-mization problems addressed by simulated annealing. The method uses randomness, governed by a controlparameter or “temperature” T to avoid getting stuck in local energy minima and thus to find the global mini-mum, like a small ball rolling in the landscape as it is shaken. The pathological “golf course” landscape at theright is, generally speaking, not amenable to solution via simulated annealing because the region of lowestenergy is so small and is surrounded by energetically unfavorable configurations. The configuration spaces ofthe problems we shall address are discrete and are more accurately displayed in Fig. 7.6. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Stochastic Methods

Simulated Annealing

Simulated Annealing Basics

Stochastic search for state of lower energy

Basic idea: occasionally go to higher energy to possiblyescape local minima

After random change of parameter si∆Eab = Eb − Ea

accept Eb, if Eb < Ea or

accept Ea with P = e−∆Eab

Annealing Schedule, e.g., T (k + 1) = cT (k) 0 < c < 1typically 0.8 < c < 0.99

High initial temperature, large c and large kmax (number ofiterations) leads to good results (but also computational cost)

Stochastic Methods

Simulated Annealing

Stochastic Methods

Simulated Annealing

Stochastic Methods

Simulated Annealing

Stochastic Methods

Simulated Annealing

Stochastic Methods

Simulated Annealing

Simulated Annealing Experiment

−−−−−−

+−−−−−

++−−−−

−+−−−−

−++−−−

+++−−−

+−+−−−

−−+−−−

−−++−−

+−++−−

++++−−

−+++−−

−+−+−−

++−+−−

+−−+−−

−−−+−−

−−−++−

+−−++−

++−++−

−+−++−

−++++−

+++++−

+−+++−

−−+++−

−−+−+−

+−+−+−

+++−+−

−++−+−

−+−−+−

++−−+−

+−−−+−

−−−−+−

−−−−++

+−−−++

++−−++

−+−−++

−++−++

+++−++

+−+−++

−−+−++

−−++++

+−++++

++++++

−+++++

−+−+++

++−+++

+−−+++

−−−+++

−−−+−+

+−−+−+

++−+−+

−+−+−+

−+++−+

++++−+

+−++−+

−−++−+

−−+−−+

+−+−−+

+++−−+

−++−−+

−+−−−+

++−−−+

+−−−−+

−−−−−+

FIGURE 7.3. Stochastic simulated annealing (Algorithm 1) uses randomness, governed by a control parameteror “temperature” T (k) to search through a discrete space for a minimum of an energy function. In this examplethere are N = 6 variables; the 26 = 64 configurations are shown along the bottom as a column of + and− symbols. The plot of the associated energy of each configuration given by Eq. 1 for randomly chosenweights. Every transition corresponds to the change of just a single si. (The configurations have been arrangedso that adjacent ones differ by the state of just a single node; nevertheless, most transitions correspondingto a single node appear far apart in this ordering.) Because the system energy is invariant with respect to aglobal interchange si ↔ −si, there are two “global” minima. The graph at the upper left shows the annealingschedule—the decreasing temperature versus iteration number k. The middle portion shows the configurationversus iteration number generated by Algorithm 1. The trajectory through the configuration space is coloredred for transitions that increase the energy and black for those that decrease the energy. Such energeticallyunfavorable (red) transitions become rarer later in the anneal. The graph at the right shows the full energyE(k), which decreases to the global minimum. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Stochastic Methods

Simulated Annealing

Empirical Energy States

−−−−−−

+−−−−−

++−−−−

−+−−−−

−++−−−

+++−−−

+−+−−−

−−+−−−

−−++−−

+−++−−

++++−−

−+++−−

−+−+−−

++−+−−

+−−+−−

−−−+−−

−−−++−

+−−++−

++−++−

−+−++−

−++++−

+++++−

+−+++−

−−+++−

−−+−+−

+−+−+−

+++−+−

−++−+−

−+−−+−

++−−+−

+−−−+−

−−−−+−

−−−−++

+−−−++

++−−++

−+−−++

−++−++

+++−++

+−+−++

−−+−++

−−++++

+−++++

++++++

−+++++

−+−+++

++−+++

+−−+++

−−−+++

−−−+−+

+−−+−+

++−+−+

−+−+−+

−+++−+

++++−+

+−++−+

−−++−+

−−+−−+

+−+−−+

+++−−+

−++−−+

−+−−−+

++−−−+

+−−−−+

−−−−−+

FIGURE 7.4. An estimate of the probability P(γ ) of being in a configuration denoted by γ is shown forfour temperatures during a slow anneal. (These estimates, based on a large number of runs, are nearly thetheoretical values e−Eγ /T.) Early, at high T, each configuration is roughly equal in probability while late, atlow T, the probability is strongly concentrated at the global minima. The expected value of the energy, E[E](i.e., averaged at temperature T), decreases gradually during the anneal. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Projects

Project Teams

2 students form a team

Implementation of a pattern classification method

Use existing (free) software (e.g., WEKA)

Data import, pre–processing, results (graphics, tables)

Methods must be understood, parameters!

Project report (February 10, 2014)

Projects

Project Teams

Projects

Project Teams

Projects

Project Teams

Projects

Project Teams

Projects

Project Teams

Projects

Project Topics

k-NN Classifier (different metrics)Kauba, Mayer

Artificial Neural Networks (Boone)Reissig, DiStolfo

Support Vector MachineLinortner, N.N.

Decision Tree (C4.5)

Simulated Annealing (meta)

Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser

Genetic Programming (optional, JEvolution)

Projects

Project Topics

Projects

Project Topics

Projects

Project Topics

Projects

Project Topics

Projects

Project Topics

Projects

Project Topics

Projects

Project Data Sets

Data Sets

UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html

Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic

Leave–one–out validation (common partitioning)

Confusion matrix

Projects

Project Data Sets

Data Sets

UCI Machine Learning Archive

http://www.ics.uci.edu/~mlearn/MLRepository.html

Confusion matrix

Projects

Project Data Sets

Data Sets

Confusion matrix

Projects

Project Data Sets

Data Sets

Ionosphere: Radar Signals

Semeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic

Confusion matrix

Projects

Project Data Sets

Data Sets

Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit Recognition

Wine Quality: Wine Critic

Confusion matrix

Projects

Project Data Sets

Data Sets

Confusion matrix

Projects

Project Data Sets

Data Sets

Confusion matrix

Projects

Project Data Sets

Data Sets

Confusion matrix

vu pattern recognition ii - uni salzburghelmut/teaching/patternrecognition/prii.pdf · vu pattern...

Documents