vu pattern recognition ii - uni salzburghelmut/teaching/patternrecognition/prii.pdf · vu pattern...

301
VU Pattern Recognition II VU Pattern Recognition II Helmut A. Mayer Department of Computer Sciences University of Salzburg WS 13/14

Upload: others

Post on 01-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

VU Pattern Recognition II

Helmut A. Mayer

Department of Computer SciencesUniversity of Salzburg

WS 13/14

Page 2: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Outline

1 Introduction

2 Statistical ClassifiersBayesian Decision Theory

3 Nonparametric TechniquesDensity Estimationk–Nearest–Neigbor Estimation

4 Linear Discriminant FunctionsDecision Surfaces

5 Neural Networks

6 Nonmetric MethodsClassification and Regression Trees

7 Stochastic MethodsSimulated Annealing

8 Projects

Page 3: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Human vs. Machine

Human Perception

Senses to neural patterns

Machine Perception

Sensors to value patterns

Patterns are everywhere...

Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)

Features build Model

Fish Example

Page 4: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Human vs. Machine

Human Perception

Senses to neural patterns

Machine Perception

Sensors to value patterns

Patterns are everywhere...

Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)

Features build Model

Fish Example

Page 5: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Human vs. Machine

Human Perception

Senses to neural patterns

Machine Perception

Sensors to value patterns

Patterns are everywhere...

Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)

Features build Model

Fish Example

Page 6: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Human vs. Machine

Human Perception

Senses to neural patterns

Machine Perception

Sensors to value patterns

Patterns are everywhere...

Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)

Features build Model

Fish Example

Page 7: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Human vs. Machine

Human Perception

Senses to neural patterns

Machine Perception

Sensors to value patterns

Patterns are everywhere...

Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)

Features build Model

Fish Example

Page 8: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Human vs. Machine

Human Perception

Senses to neural patterns

Machine Perception

Sensors to value patterns

Patterns are everywhere...

Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)

Features build Model

Fish Example

Page 9: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Human vs. Machine

Human Perception

Senses to neural patterns

Machine Perception

Sensors to value patterns

Patterns are everywhere...

Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)

Features build Model

Fish Example

Page 10: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Human vs. Machine

Human Perception

Senses to neural patterns

Machine Perception

Sensors to value patterns

Patterns are everywhere...

Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)

Features build Model

Fish Example

Page 11: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Salmon or Sea Bass

FIGURE 1.1. The objects to be classified are first sensed by a transducer (camera),whose signals are preprocessed. Next the features are extracted and finally the clas-sification is emitted, here either “salmon” or “sea bass.” Although the information flowis often chosen to be from the source to the classifier, some systems employ informationflow in which earlier levels of processing can be altered based on the tentative or pre-liminary response in later levels (gray arrows). Yet others combine two or more stagesinto a unified step, such as simultaneous segmentation and feature extraction. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.

Page 12: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Fish Length Histogram

salmon sea bass

length

count

l*

0

2

4

6

8

10

12

16

18

20

22

5 10 2015 25

FIGURE 1.2. Histograms for the length feature for the two categories. No single thresh-old value of the length will serve to unambiguously discriminate between the two cat-egories; using length alone, we will have some errors. The value marked l∗ will lead tothe smallest number of errors, on average. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 13: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Fish Lightness Histogram

2 4 6 8 100

2

4

6

8

10

12

14

lightness

count

x*

salmon sea bass

FIGURE 1.3. Histograms for the lightness feature for the two categories. No singlethreshold value x∗ (decision boundary) will serve to unambiguously discriminate be-tween the two categories; using lightness alone, we will have some errors. The value x∗

marked will lead to the smallest number of errors, on average. From: Richard O. Duda,Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by JohnWiley & Sons, Inc.

Page 14: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Decision Theory

Cost of an Error?

Salmon tastes better..;-)Minimization of cost (risk)Decision Rule/Boundary

Improving Recognition

Feature Vector ~x =

(lightnesswidth

)

2D Decision Boundary

Page 15: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Decision Theory

Cost of an Error?

Salmon tastes better..;-)

Minimization of cost (risk)Decision Rule/Boundary

Improving Recognition

Feature Vector ~x =

(lightnesswidth

)

2D Decision Boundary

Page 16: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Decision Theory

Cost of an Error?

Salmon tastes better..;-)Minimization of cost (risk)

Decision Rule/Boundary

Improving Recognition

Feature Vector ~x =

(lightnesswidth

)

2D Decision Boundary

Page 17: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Decision Theory

Cost of an Error?

Salmon tastes better..;-)Minimization of cost (risk)Decision Rule/Boundary

Improving Recognition

Feature Vector ~x =

(lightnesswidth

)

2D Decision Boundary

Page 18: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Decision Theory

Cost of an Error?

Salmon tastes better..;-)Minimization of cost (risk)Decision Rule/Boundary

Improving Recognition

Feature Vector ~x =

(lightnesswidth

)2D Decision Boundary

Page 19: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Decision Theory

Cost of an Error?

Salmon tastes better..;-)Minimization of cost (risk)Decision Rule/Boundary

Improving Recognition

Feature Vector ~x =

(lightnesswidth

)

2D Decision Boundary

Page 20: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Decision Theory

Cost of an Error?

Salmon tastes better..;-)Minimization of cost (risk)Decision Rule/Boundary

Improving Recognition

Feature Vector ~x =

(lightnesswidth

)2D Decision Boundary

Page 21: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

2D Feature Space

2 4 6 8 1014

15

16

17

18

19

20

21

22

width

lightness

salmon sea bass

FIGURE 1.4. The two features of lightness and width for sea bass and salmon. The darkline could serve as a decision boundary of our classifier. Overall classification error onthe data shown is lower than if we use only one feature as in Fig. 1.3, but there willstill be some errors. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 22: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Overfitting

?

2 4 6 8 1014

15

16

17

18

19

20

21

22

width

lightness

salmon sea bass

FIGURE 1.5. Overly complex models for the fish will lead to decision boundaries thatare complicated. While such a decision may lead to perfect classification of our trainingsamples, it would lead to poor performance on future patterns. The novel test pointmarked ? is evidently most likely a salmon, whereas the complex decision boundaryshown leads it to be classified as a sea bass. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 23: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Generalization

2 4 6 8 1014

15

16

17

18

19

20

21

22

width

lightness

salmon sea bass

FIGURE 1.6. The decision boundary shown might represent the optimal tradeoff be-tween performance on the training set and simplicity of classifier, thereby giving thehighest accuracy on new patterns. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 24: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Related Fields

Statistical Hypothesis Testing

Image Processing

Regression (age ↔ weight)

Interpolation

Density Estimation

Page 25: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Related Fields

Statistical Hypothesis Testing

Image Processing

Regression (age ↔ weight)

Interpolation

Density Estimation

Page 26: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Related Fields

Statistical Hypothesis Testing

Image Processing

Regression (age ↔ weight)

Interpolation

Density Estimation

Page 27: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Related Fields

Statistical Hypothesis Testing

Image Processing

Regression (age ↔ weight)

Interpolation

Density Estimation

Page 28: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Related Fields

Statistical Hypothesis Testing

Image Processing

Regression (age ↔ weight)

Interpolation

Density Estimation

Page 29: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Pattern Recognition Systems

post-processing

classification

feature extraction

segmentation

sensing

input

decision

adjustments for missing features

adjustments for context

costs

FIGURE 1.7. Many pattern recognition systems can be partitioned into componentssuch as the ones shown here. A sensor converts images or sounds or other physicalinputs into signal data. The segmentor isolates sensed objects from the background orfrom other objects. A feature extractor measures object properties that are useful forclassification. The classifier uses these features to assign the sensed object to a cate-gory. Finally, a post processor can take account of other considerations, such as theeffects of context and the costs of errors, to decide on the appropriate action. Althoughthis description stresses a one-way or “bottom-up” flow of data, some systems employfeedback from higher levels back down to lower levels (gray arrows). From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.

Page 30: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Feature Extraction

Features ↔ Classification

Invariant Features (translation, rotation, scale)

Deformation (e.g. Cropping)

Feature Selection (Filter, Wrapper)

Page 31: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Feature Extraction

Features ↔ Classification

Invariant Features (translation, rotation, scale)

Deformation (e.g. Cropping)

Feature Selection (Filter, Wrapper)

Page 32: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Feature Extraction

Features ↔ Classification

Invariant Features (translation, rotation, scale)

Deformation (e.g. Cropping)

Feature Selection (Filter, Wrapper)

Page 33: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Feature Extraction

Features ↔ Classification

Invariant Features (translation, rotation, scale)

Deformation (e.g. Cropping)

Feature Selection (Filter, Wrapper)

Page 34: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Post Processing

Error Rate, Risk (weighted error)

Context (IC* *IN)

Multiple Classifiers (subspaces, fusion)

Page 35: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Post Processing

Error Rate, Risk (weighted error)

Context (IC* *IN)

Multiple Classifiers (subspaces, fusion)

Page 36: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Post Processing

Error Rate, Risk (weighted error)

Context (IC* *IN)

Multiple Classifiers (subspaces, fusion)

Page 37: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Design Cycle

collect data

choose features

choose model

train classifier

evaluate classifier

end

start

prior knowledge(e.g., invariances)

FIGURE 1.8. The design of a pattern recognition system involves a design cycle similarto the one shown here. Data must be collected, both to train and to test the system. Thecharacteristics of the data impact both the choice of appropriate discriminating featuresand the choice of models for the different categories. The training process uses some orall of the data to determine the system parameters. The results of evaluation may callfor repetition of various steps in this process in order to obtain satisfactory results. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.

Page 38: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Learning and Adaptation

Learning is Parameter Tuning

Supervised Learning (teacher)

Reinforcement Learning (critic)

Unsupervised Learning (clustering)

Page 39: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Learning and Adaptation

Learning is Parameter Tuning

Supervised Learning (teacher)

Reinforcement Learning (critic)

Unsupervised Learning (clustering)

Page 40: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Learning and Adaptation

Learning is Parameter Tuning

Supervised Learning (teacher)

Reinforcement Learning (critic)

Unsupervised Learning (clustering)

Page 41: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Introduction

Learning and Adaptation

Learning is Parameter Tuning

Supervised Learning (teacher)

Reinforcement Learning (critic)

Unsupervised Learning (clustering)

Page 42: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Probabilities

State of Nature ω = ω1 (class)

A Priori Probability P(ω1) (prior)

Decision Rule P(ω1) > P(ω2)→ ω1

Class–Conditional Probability Density Function p(x |ω)

Page 43: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Probabilities

State of Nature ω = ω1 (class)

A Priori Probability P(ω1) (prior)

Decision Rule P(ω1) > P(ω2)→ ω1

Class–Conditional Probability Density Function p(x |ω)

Page 44: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Probabilities

State of Nature ω = ω1 (class)

A Priori Probability P(ω1) (prior)

Decision Rule P(ω1) > P(ω2)→ ω1

Class–Conditional Probability Density Function p(x |ω)

Page 45: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Probabilities

State of Nature ω = ω1 (class)

A Priori Probability P(ω1) (prior)

Decision Rule P(ω1) > P(ω2)→ ω1

Class–Conditional Probability Density Function p(x |ω)

Page 46: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Class–Conditional Probability Density

9 10 11 12 13 14 15

0.1

0.2

0.3

0.4

p(x|ωi)

x

ω1

ω2

FIGURE 2.1. Hypothetical class-conditional probability density functions show theprobability density of measuring a particular feature value x given the pattern is incategory ωi . If x represents the lightness of a fish, the two curves might describe thedifference in lightness of populations of two types of fish. Density functions are normal-ized, and thus the area under each curve is 1.0. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons,Inc.

Page 47: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Bayes Decision Rule

Joint Probability Densityp(ωj , x) = P(ωj |x)p(x) = p(x |ωj)P(ωj)

Bayes Formula P(ωj |x) =p(x |ωj )P(ωj )

p(x)

Decision Rule P(ω1|x) > P(ω2|x)→ ω1

Page 48: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Bayes Decision Rule

Joint Probability Densityp(ωj , x) = P(ωj |x)p(x) = p(x |ωj)P(ωj)

Bayes Formula P(ωj |x) =p(x |ωj )P(ωj )

p(x)

Decision Rule P(ω1|x) > P(ω2|x)→ ω1

Page 49: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Bayes Decision Rule

Joint Probability Densityp(ωj , x) = P(ωj |x)p(x) = p(x |ωj)P(ωj)

Bayes Formula P(ωj |x) =p(x |ωj )P(ωj )

p(x)

Decision Rule P(ω1|x) > P(ω2|x)→ ω1

Page 50: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Posterior Probabilities

0.2

0.4

0.6

0.8

1

P(ωi|x)

x

ω1

ω2

9 10 11 12 13 14 15

FIGURE 2.2. Posterior probabilities for the particular priors P(ω1) = 2/3 and P(ω2)

= 1/3 for the class-conditional probability densities shown in Fig. 2.1. Thus in thiscase, given that a pattern is measured to have feature value x = 14, the probability it isin category ω2 is roughly 0.08, and that it is in ω1 is 0.92. At every x, the posteriors sumto 1.0. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.

Page 51: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Error Probabilities

Error P(error |x) =

{P(ω1|x) if ω2

P(ω2|x) if ω1

Average Error ProbabilityP(error) =

∫∞−∞ p(error , x)dx =

∫∞−∞ P(error |x)p(x)dx

Bayes Rule minimizes P(error)

Page 52: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Error Probabilities

Error P(error |x) =

{P(ω1|x) if ω2

P(ω2|x) if ω1

Average Error ProbabilityP(error) =

∫∞−∞ p(error , x)dx =

∫∞−∞ P(error |x)p(x)dx

Bayes Rule minimizes P(error)

Page 53: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Error Probabilities

Error P(error |x) =

{P(ω1|x) if ω2

P(ω2|x) if ω1

Average Error ProbabilityP(error) =

∫∞−∞ p(error , x)dx =

∫∞−∞ P(error |x)p(x)dx

Bayes Rule minimizes P(error)

Page 54: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Generalized Bayes Rule

Feature vector ~x ∈ Rd

Classes ω1 . . . ωc

Bayes Formula P(ωj |~x) =p(~x |ωj )P(ωj )

p(~x)

Page 55: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Generalized Bayes Rule

Feature vector ~x ∈ Rd

Classes ω1 . . . ωc

Bayes Formula P(ωj |~x) =p(~x |ωj )P(ωj )

p(~x)

Page 56: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Generalized Bayes Rule

Feature vector ~x ∈ Rd

Classes ω1 . . . ωc

Bayes Formula P(ωj |~x) =p(~x |ωj )P(ωj )

p(~x)

Page 57: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Dichotomizer

0

0.1

0.2

0.3

decisionboundary

p(x|ω2)P(ω2)

R1

R2

p(x|ω1)P(ω1)

R2

0

5

0

5

FIGURE 2.6. In this two-dimensional two-category classifier, the probability densitiesare Gaussian, the decision boundary consists of two hyperbolas, and thus the decisionregion R2 is not simply connected. The ellipses mark where the density is 1/e timesthat at the peak of the distribution. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 58: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

The Normal Density

Randomized Prototype Vectors with Mean ~µ→Normal Distribution

Expected ValueE [f (x)] =

∫∞−∞ f (x)p(x)dx (continous)

E [f (x)] =∑

x∈D f (x)P(x) (discrete)

Univariate Normal Density

p(x) = 1√2πσ

e−12

( x−µσ

)2

E [x ] = µ E [(x − µ)2] = σ2

Page 59: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

The Normal Density

Randomized Prototype Vectors with Mean ~µ→Normal Distribution

Expected ValueE [f (x)] =

∫∞−∞ f (x)p(x)dx (continous)

E [f (x)] =∑

x∈D f (x)P(x) (discrete)

Univariate Normal Density

p(x) = 1√2πσ

e−12

( x−µσ

)2

E [x ] = µ E [(x − µ)2] = σ2

Page 60: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

The Normal Density

Randomized Prototype Vectors with Mean ~µ→Normal Distribution

Expected ValueE [f (x)] =

∫∞−∞ f (x)p(x)dx (continous)

E [f (x)] =∑

x∈D f (x)P(x) (discrete)

Univariate Normal Density

p(x) = 1√2πσ

e−12

( x−µσ

)2

E [x ] = µ E [(x − µ)2] = σ2

Page 61: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Normal Distribution

x

2.5% 2.5%

σ

p(x)

µ + σ µ + 2σµ - σµ - 2σ µ

FIGURE 2.7. A univariate normal distribution has roughly 95% of its area in the range|x − µ| ≤ 2σ , as shown. The peak of the distribution has value p(µ) = 1/

√2πσ . From:

Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.

Page 62: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Multivariate Density

Multivariate Normal Density

p(x) = 1

2πd2 |Σ|

12e−

12

(~x−~µ)tΣ−1(~x−~µ)

Covariance Matrix Σ (d × d)E [~x ] = ~µ E [(~x − ~µ)(~x − ~µ)t ] = ΣE [xi ] = µi E [(xi − µi )(xj − µj)] = σij

Page 63: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Multivariate Density

Multivariate Normal Density

p(x) = 1

2πd2 |Σ|

12e−

12

(~x−~µ)tΣ−1(~x−~µ)

Covariance Matrix Σ (d × d)E [~x ] = ~µ E [(~x − ~µ)(~x − ~µ)t ] = ΣE [xi ] = µi E [(xi − µi )(xj − µj)] = σij

Page 64: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

2D Gaussian

x2

x1

µ

FIGURE 2.9. Samples drawn from a two-dimensional Gaussian lie in a cloud centeredon the mean �. The ellipses show lines of equal probability density of the Gaussian.From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copy-right c© 2001 by John Wiley & Sons, Inc.

Page 65: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Four Categories

R3

R2

R1

R4

R4

FIGURE 2.16. The decision regions for four normal distributions. Even with such a lownumber of categories, the shapes of the boundary regions can be rather complex. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.

Page 66: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Classification Errors

Bayes Error: overlapping densities, inherent problem property

Model Error: incorrect model

Estimation Error: finite sample of training data

Page 67: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Classification Errors

Bayes Error: overlapping densities, inherent problem property

Model Error: incorrect model

Estimation Error: finite sample of training data

Page 68: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Classification Errors

Bayes Error: overlapping densities, inherent problem property

Model Error: incorrect model

Estimation Error: finite sample of training data

Page 69: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Statistical Classifiers

Bayesian Decision Theory

Bayes Error and Dimensionality

x3

x1

x2

FIGURE 3.3. Two three-dimensional distributions have nonoverlapping densities, andthus in three dimensions the Bayes error vanishes. When projected to a subspace—here,the two-dimensional x1 − x2 subspace or a one-dimensional x1 subspace—there canbe greater overlap of the projected distributions, and hence greater Bayes error. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.

Page 70: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

Density Estimation

Unknown Densities

Real problems: multi–modal, parametric densities: uni–modal→ estimation of densities directly from data

P that pattern ~x falls in region R, P =∫R p(~x)d~x

n patterns, probability that k patterns are in RPk =

(nk

)Pk(1− P)n−k E [k] = nP

Assuming small region R →p(~x) ' const →

∫R p(~x)d~x ' p(~x)V → p(~x) '

knV

Page 71: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

Density Estimation

Unknown Densities

Real problems: multi–modal, parametric densities: uni–modal→ estimation of densities directly from data

P that pattern ~x falls in region R, P =∫R p(~x)d~x

n patterns, probability that k patterns are in RPk =

(nk

)Pk(1− P)n−k E [k] = nP

Assuming small region R →p(~x) ' const →

∫R p(~x)d~x ' p(~x)V → p(~x) '

knV

Page 72: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

Density Estimation

Unknown Densities

Real problems: multi–modal, parametric densities: uni–modal→ estimation of densities directly from data

P that pattern ~x falls in region R, P =∫R p(~x)d~x

n patterns, probability that k patterns are in RPk =

(nk

)Pk(1− P)n−k E [k] = nP

Assuming small region R →p(~x) ' const →

∫R p(~x)d~x ' p(~x)V → p(~x) '

knV

Page 73: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

Density Estimation

Unknown Densities

Real problems: multi–modal, parametric densities: uni–modal→ estimation of densities directly from data

P that pattern ~x falls in region R, P =∫R p(~x)d~x

n patterns, probability that k patterns are in RPk =

(nk

)Pk(1− P)n−k E [k] = nP

Assuming small region R →p(~x) ' const →

∫R p(~x)d~x ' p(~x)V → p(~x) '

knV

Page 74: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

Density Estimation

Relative Probability

1k/n

0.5

1

relativeprobability

0

1005020

P = 0.7

FIGURE 4.1. The relative probability an estimate given by Eq. 4 will yield a particularvalue for the probability density, here where the true probability was chosen to be 0.7.Each curve is labeled by the total number of patterns n sampled, and is scaled to givethe same maximum (at the true probability). The form of each curve is binomial, asgiven by Eq. 2. For large n, such binomials peak strongly at the true probability. In thelimit n → ∞, the curve approaches a delta function, and we are guaranteed that ourestimate will give the true probability. From: Richard O. Duda, Peter E. Hart, and DavidG. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 75: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

Density Estimation

Sample Size

Estimate p(~x) 'knV is dependent on size of V

if V → 0, p(~x) would be exact, but no more samples in V

Assuming infinite pattern set with decreasing Vn

n–th estimate pn(~x) =knnVn

For convergence of pn(~x)→ p(~x)limn→∞ Vn = 0 limn→∞ kn =∞ limn→∞

knn = 0

Decreasing Vn, e.g., Vn = 1√n→ Parzen Windows

Increasing kn, e.g., kn =√n→ kn–Nearest–Neighbors

Page 76: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

Density Estimation

Sample Size

Estimate p(~x) 'knV is dependent on size of V

if V → 0, p(~x) would be exact, but no more samples in V

Assuming infinite pattern set with decreasing Vn

n–th estimate pn(~x) =knnVn

For convergence of pn(~x)→ p(~x)limn→∞ Vn = 0 limn→∞ kn =∞ limn→∞

knn = 0

Decreasing Vn, e.g., Vn = 1√n→ Parzen Windows

Increasing kn, e.g., kn =√n→ kn–Nearest–Neighbors

Page 77: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

Density Estimation

Sample Size

Estimate p(~x) 'knV is dependent on size of V

if V → 0, p(~x) would be exact, but no more samples in V

Assuming infinite pattern set with decreasing Vn

n–th estimate pn(~x) =knnVn

For convergence of pn(~x)→ p(~x)limn→∞ Vn = 0 limn→∞ kn =∞ limn→∞

knn = 0

Decreasing Vn, e.g., Vn = 1√n→ Parzen Windows

Increasing kn, e.g., kn =√n→ kn–Nearest–Neighbors

Page 78: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

Density Estimation

Sample Size

Estimate p(~x) 'knV is dependent on size of V

if V → 0, p(~x) would be exact, but no more samples in V

Assuming infinite pattern set with decreasing Vn

n–th estimate pn(~x) =knnVn

For convergence of pn(~x)→ p(~x)limn→∞ Vn = 0 limn→∞ kn =∞ limn→∞

knn = 0

Decreasing Vn, e.g., Vn = 1√n→ Parzen Windows

Increasing kn, e.g., kn =√n→ kn–Nearest–Neighbors

Page 79: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

Density Estimation

Point Density Estimation

n = 1 n = 4 n = 9 n = 16 n = 100

...

...

...

...

Vn =1/ √n

kn = √n

FIGURE 4.2. There are two leading methods for estimating the density at a point, hereat the center of each square. The one shown in the top row is to start with a large volumecentered on the test point and shrink it according to a function such as Vn = 1/

√n. The

other method, shown in the bottom row, is to decrease the volume in a data-dependentway, for instance letting the volume enclose some number kn = √

n of sample points.The sequences in both cases represent random variables that generally converge andallow the true density at the test point to be calculated. From: Richard O. Duda, PeterE. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.

Page 80: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

Prototype Estimation

Estimate density at arbitrary ~x by kn nearest neighbors of ~x

pn(~x) =knnVn

(neighbors are training patterns)

Dense neighbors → small Vn → good resolutionSparse neighbors → large Vn → bad resolution

Problem: often∫pn(~x)d~x > 1

Page 81: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

Prototype Estimation

Estimate density at arbitrary ~x by kn nearest neighbors of ~x

pn(~x) =knnVn

(neighbors are training patterns)

Dense neighbors → small Vn → good resolutionSparse neighbors → large Vn → bad resolution

Problem: often∫pn(~x)d~x > 1

Page 82: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

Prototype Estimation

Estimate density at arbitrary ~x by kn nearest neighbors of ~x

pn(~x) =knnVn

(neighbors are training patterns)

Dense neighbors → small Vn → good resolutionSparse neighbors → large Vn → bad resolution

Problem: often∫pn(~x)d~x > 1

Page 83: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

1D kNN Estimate

x

p(x)

3 5

FIGURE 4.10. Eight points in one dimension and the k-nearest-neighbor density esti-mates, for k = 3 and 5. Note especially that the discontinuities in the slopes in theestimates generally lie away from the positions of the prototype points. From: RichardO. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001by John Wiley & Sons, Inc.

Page 84: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

2D kNN Estimate

0

p(x)

x1

x2

FIGURE 4.11. The k-nearest-neighbor estimate of a two-dimensional density for k = 5.Notice how such a finite n estimate can be quite “jagged,” and notice that disconti-nuities in the slopes generally occur along lines away from the positions of the pointsthemselves. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classifi-cation. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 85: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

Unimodal and Bimodal 1D kNN Estimates

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

n=1

kn=1

n=16

kn=4

n=256

kn=16

n= ∞kn= ∞

FIGURE 4.12. Several k-nearest-neighbor estimates of two unidimensional densities:a Gaussian and a bimodal distribution. Notice how the finite n estimates can be quite“spiky.” From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.

Page 86: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

Estimation of A Posteriori Probabilities

Samples of different classes, what is P(ωi |~x)?

Estimate for pn(~x , ωi ) =kinV (in arbitrary V )

Estimate for P(ωi |~x) = pn(~x ,ωi )∑cj=1 pn(~x ,ωj )

= kik

With n→∞ and Bayes Rule: optimal performance (Parzenand kNN)

Page 87: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

Estimation of A Posteriori Probabilities

Samples of different classes, what is P(ωi |~x)?

Estimate for pn(~x , ωi ) =kinV (in arbitrary V )

Estimate for P(ωi |~x) = pn(~x ,ωi )∑cj=1 pn(~x ,ωj )

= kik

With n→∞ and Bayes Rule: optimal performance (Parzenand kNN)

Page 88: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

Estimation of A Posteriori Probabilities

Samples of different classes, what is P(ωi |~x)?

Estimate for pn(~x , ωi ) =kinV (in arbitrary V )

Estimate for P(ωi |~x) = pn(~x ,ωi )∑cj=1 pn(~x ,ωj )

= kik

With n→∞ and Bayes Rule: optimal performance (Parzenand kNN)

Page 89: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

Nearest Neigbor Rule

Single nearest neigbor is ~x ′ (k = 1)Class label of ~x ′ is θ′ (random variable)P(θ′ = ωi ) = P(ωi |~x ′) ' P(ωi |~x) (for large n)

Assumption of 1NN: P(ωi |~x ′) is largest probabilityIf true (e.g., P ' 1, or P ' 1

c ), then 1NN close to Bayes Error

Average error probability P(e) =∫P(e|~x)p(~x)d~x

P(e|~x) = 1− P(ωi |~x ′) is ”minimum” P∗(e|~x)P∗(e) =

∫P∗(e|~x)p(~x)d~x

1NN error P = limn→∞ Pn(e)P∗ ≤ P ≤ P∗(2− c

c−1P∗)

Page 90: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

Nearest Neigbor Rule

Single nearest neigbor is ~x ′ (k = 1)Class label of ~x ′ is θ′ (random variable)P(θ′ = ωi ) = P(ωi |~x ′) ' P(ωi |~x) (for large n)

Assumption of 1NN: P(ωi |~x ′) is largest probabilityIf true (e.g., P ' 1, or P ' 1

c ), then 1NN close to Bayes Error

Average error probability P(e) =∫P(e|~x)p(~x)d~x

P(e|~x) = 1− P(ωi |~x ′) is ”minimum” P∗(e|~x)P∗(e) =

∫P∗(e|~x)p(~x)d~x

1NN error P = limn→∞ Pn(e)P∗ ≤ P ≤ P∗(2− c

c−1P∗)

Page 91: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

Nearest Neigbor Rule

Single nearest neigbor is ~x ′ (k = 1)Class label of ~x ′ is θ′ (random variable)P(θ′ = ωi ) = P(ωi |~x ′) ' P(ωi |~x) (for large n)

Assumption of 1NN: P(ωi |~x ′) is largest probabilityIf true (e.g., P ' 1, or P ' 1

c ), then 1NN close to Bayes Error

Average error probability P(e) =∫P(e|~x)p(~x)d~x

P(e|~x) = 1− P(ωi |~x ′) is ”minimum” P∗(e|~x)P∗(e) =

∫P∗(e|~x)p(~x)d~x

1NN error P = limn→∞ Pn(e)P∗ ≤ P ≤ P∗(2− c

c−1P∗)

Page 92: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

Nearest Neigbor Rule

Single nearest neigbor is ~x ′ (k = 1)Class label of ~x ′ is θ′ (random variable)P(θ′ = ωi ) = P(ωi |~x ′) ' P(ωi |~x) (for large n)

Assumption of 1NN: P(ωi |~x ′) is largest probabilityIf true (e.g., P ' 1, or P ' 1

c ), then 1NN close to Bayes Error

Average error probability P(e) =∫P(e|~x)p(~x)d~x

P(e|~x) = 1− P(ωi |~x ′) is ”minimum” P∗(e|~x)P∗(e) =

∫P∗(e|~x)p(~x)d~x

1NN error P = limn→∞ Pn(e)P∗ ≤ P ≤ P∗(2− c

c−1P∗)

Page 93: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

Voronoi Tesselation

x1

x2

x1

x3

FIGURE 4.13. In two dimensions, the nearest-neighbor algorithm leads to a partition-ing of the input space into Voronoi cells, each labeled by the category of the trainingpoint it contains. In three dimensions, the cells are three-dimensional, and the decisionboundary resembles the surface of a crystal. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 94: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

1NN Error Rate Bounds

c - 1c

P*

P

P = P

*

P =

2P

*

c - 1c

FIGURE 4.14. Bounds on the nearest-neighbor error rate P in a c-category problemgiven infinite training data, where P∗ is the Bayes error (Eq. 52). At low error rates, thenearest-neighbor error rate is bounded above by twice the Bayes rate. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.

Page 95: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

k–Nearest–Neigbor Rule

Straight–forward extension: k neighbors

Majority voting: P(ωm|~x) is largest probability (mostprototypes in class m)

If k →∞ then k–NN rule becomes optimal

Page 96: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

k–Nearest–Neigbor Rule

Straight–forward extension: k neighbors

Majority voting: P(ωm|~x) is largest probability (mostprototypes in class m)

If k →∞ then k–NN rule becomes optimal

Page 97: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

k–Nearest–Neigbor Rule

Straight–forward extension: k neighbors

Majority voting: P(ωm|~x) is largest probability (mostprototypes in class m)

If k →∞ then k–NN rule becomes optimal

Page 98: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

5NN in 2D

x

x2

x1

FIGURE 4.15. The k-nearest-neighbor query starts at the test point x and grows a spher-ical region until it encloses k training samples, and it labels the test point by a majorityvote of these samples. In this k = 5 case, the test point x would be labeled the categoryof the black points. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 99: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

kNN Error Rate Bounds

0 0.1 0.2 0.3 0.4

0.1

0.2

0.3

0.4

Bayes Rate

P*

P

135

915

0.5

FIGURE 4.16. The error rate for the k-nearest-neighbor rule for a two-category problemis bounded by Ck(P∗) in Eq. 54. Each curve is labeled by k; when k = ∞, the estimatedprobabilities match the true probabilities and thus the error rate is equal to the Bayesrate, that is, P = P∗. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 100: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

Metrics

What is a distance?

Properties of Metrics

Nonnegativity: D(~a,~b) ≥ 0

Reflexivity: D(~a,~b) = 0 iff ~a = ~b

Symmetry: D(~a,~b) = D(~b,~a)

Triangle inequality: D(~a,~b) + D(~b,~c) ≥ D(~a,~c)

Scaling of feature values equivalent to changing the metric

Page 101: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

Metrics

What is a distance?

Properties of Metrics

Nonnegativity: D(~a,~b) ≥ 0

Reflexivity: D(~a,~b) = 0 iff ~a = ~b

Symmetry: D(~a,~b) = D(~b,~a)

Triangle inequality: D(~a,~b) + D(~b,~c) ≥ D(~a,~c)

Scaling of feature values equivalent to changing the metric

Page 102: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

Metrics

What is a distance?

Properties of Metrics

Nonnegativity: D(~a,~b) ≥ 0

Reflexivity: D(~a,~b) = 0 iff ~a = ~b

Symmetry: D(~a,~b) = D(~b,~a)

Triangle inequality: D(~a,~b) + D(~b,~c) ≥ D(~a,~c)

Scaling of feature values equivalent to changing the metric

Page 103: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

Scaling is Change of Metric

x1

x2 x2

x x

αx1

FIGURE 4.18. Scaling the coordinates of a feature space can change the distance rela-tionships computed by the Euclidean metric. Here we see how such scaling can changethe behavior of a nearest-neighbor classifer. Consider the test point x and its nearestneighbor. In the original space (left), the black prototype is closest. In the figure at theright, the x1 axis has been rescaled by a factor 1/3; now the nearest prototype is the redone. If there is a large disparity in the ranges of the full data in each dimension, a com-mon procedure is to rescale all the data to equalize such ranges, and this is equivalentto changing the metric in the original space. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 104: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

Class of Metrics

Minkowski Metric (Lk Norm) Lk(~a,~b) = (∑d

i=1 |ai − bi |k)1k

L1 Norm: Manhattan distance

L2 Norm: Euclidean distance

L∞ Norm: Maximum of projected distances

Page 105: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

Class of Metrics

Minkowski Metric (Lk Norm) Lk(~a,~b) = (∑d

i=1 |ai − bi |k)1k

L1 Norm: Manhattan distance

L2 Norm: Euclidean distance

L∞ Norm: Maximum of projected distances

Page 106: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

Class of Metrics

Minkowski Metric (Lk Norm) Lk(~a,~b) = (∑d

i=1 |ai − bi |k)1k

L1 Norm: Manhattan distance

L2 Norm: Euclidean distance

L∞ Norm: Maximum of projected distances

Page 107: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

Class of Metrics

Minkowski Metric (Lk Norm) Lk(~a,~b) = (∑d

i=1 |ai − bi |k)1k

L1 Norm: Manhattan distance

L2 Norm: Euclidean distance

L∞ Norm: Maximum of projected distances

Page 108: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonparametric Techniques

k–Nearest–Neigbor Estimation

Minkowski Metric

1

42

0,0,0

1,0,0

0,1,01,1,1

FIGURE 4.19. Each colored surface consists of points a distance 1.0 from the origin,measured using different values for k in the Minkowski metric (k is printed in red). Thusthe white surfaces correspond to the L1 norm (Manhattan distance), the light gray spherecorresponds to the L2 norm (Euclidean distance), the dark gray ones correspond to theL4 norm, and the pink box corresponds to the L∞ norm. From: Richard O. Duda, PeterE. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.

Page 109: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Discriminant Functions

Assumption: we know the form of discriminant functions(not probability densities)

Problem: determine parameters of discriminant functions

Method: gradient descent of criterion functions(based on training set)

Page 110: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Discriminant Functions

Assumption: we know the form of discriminant functions(not probability densities)

Problem: determine parameters of discriminant functions

Method: gradient descent of criterion functions(based on training set)

Page 111: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Discriminant Functions

Assumption: we know the form of discriminant functions(not probability densities)

Problem: determine parameters of discriminant functions

Method: gradient descent of criterion functions(based on training set)

Page 112: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Linear Classifier

x0=1

x1

. . .w2

w0

w1

wd

g(x)

x2 xd. . .

bias unit

input units

output unit

FIGURE 5.1. A simple linear classifier having d input units, each corresponding to thevalues of the components of an input vector. Each input feature value xi is multipliedby its corresponding weight wi; the effective input at the output unit is the sum all theseproducts,

∑wixi. We show in each unit its effective input-output function. Thus each of

the d input units is linear, emitting exactly the value of its corresponding feature value.The single bias unit unit always emits the constant value 1.0. The single output unitemits a +1 if wtx + w0 > 0 or a −1 otherwise. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons,Inc.

Page 113: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Linear Discriminant Functions

Linear discriminant function g(~x) = ~w t~x + w0

(weight vector ~w , bias w0)

Two classes: g(~x) > 0→ ω1, else ω2

or ~w t~x > −w0

Decision surface is hyperplane, ~x1, ~x2 on boundary~w t~x1 + w0 = ~w t~x2 + w0 → ~w t(~x1 − ~x2) = 0(~w is normal vector)

Hyperplane H divides space in two half–spacesR1 is positive side (g(~x) > 0), R2 is negative side (g(~x) < 0)

Page 114: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Linear Discriminant Functions

Linear discriminant function g(~x) = ~w t~x + w0

(weight vector ~w , bias w0)

Two classes: g(~x) > 0→ ω1, else ω2

or ~w t~x > −w0

Decision surface is hyperplane, ~x1, ~x2 on boundary~w t~x1 + w0 = ~w t~x2 + w0 → ~w t(~x1 − ~x2) = 0(~w is normal vector)

Hyperplane H divides space in two half–spacesR1 is positive side (g(~x) > 0), R2 is negative side (g(~x) < 0)

Page 115: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Linear Discriminant Functions

Linear discriminant function g(~x) = ~w t~x + w0

(weight vector ~w , bias w0)

Two classes: g(~x) > 0→ ω1, else ω2

or ~w t~x > −w0

Decision surface is hyperplane, ~x1, ~x2 on boundary~w t~x1 + w0 = ~w t~x2 + w0 → ~w t(~x1 − ~x2) = 0(~w is normal vector)

Hyperplane H divides space in two half–spacesR1 is positive side (g(~x) > 0), R2 is negative side (g(~x) < 0)

Page 116: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Linear Discriminant Functions

Linear discriminant function g(~x) = ~w t~x + w0

(weight vector ~w , bias w0)

Two classes: g(~x) > 0→ ω1, else ω2

or ~w t~x > −w0

Decision surface is hyperplane, ~x1, ~x2 on boundary~w t~x1 + w0 = ~w t~x2 + w0 → ~w t(~x1 − ~x2) = 0(~w is normal vector)

Hyperplane H divides space in two half–spacesR1 is positive side (g(~x) > 0), R2 is negative side (g(~x) < 0)

Page 117: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Multiple Classes

Variant: c dichotomizers (ωi , not ωi )

Variant: c(c−1)2 dichotomizers (all class pairs)

Variant: linear machine, discriminant functionsgi (~x), i = 1, . . . , c

Decision boundarygi (~x) = gj(~x)→ (~wi − ~wj)

t~x + (wi0 − wj0) = 0

(~wi − ~wj) ⊥ Hij , r =gi (~x)−gj (~x)||~wi−~wj ||

Page 118: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Multiple Classes

Variant: c dichotomizers (ωi , not ωi )

Variant: c(c−1)2 dichotomizers (all class pairs)

Variant: linear machine, discriminant functionsgi (~x), i = 1, . . . , c

Decision boundarygi (~x) = gj(~x)→ (~wi − ~wj)

t~x + (wi0 − wj0) = 0

(~wi − ~wj) ⊥ Hij , r =gi (~x)−gj (~x)||~wi−~wj ||

Page 119: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Multiple Classes

Variant: c dichotomizers (ωi , not ωi )

Variant: c(c−1)2 dichotomizers (all class pairs)

Variant: linear machine, discriminant functionsgi (~x), i = 1, . . . , c

Decision boundarygi (~x) = gj(~x)→ (~wi − ~wj)

t~x + (wi0 − wj0) = 0

(~wi − ~wj) ⊥ Hij , r =gi (~x)−gj (~x)||~wi−~wj ||

Page 120: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Multiple Classes

Variant: c dichotomizers (ωi , not ωi )

Variant: c(c−1)2 dichotomizers (all class pairs)

Variant: linear machine, discriminant functionsgi (~x), i = 1, . . . , c

Decision boundarygi (~x) = gj(~x)→ (~wi − ~wj)

t~x + (wi0 − wj0) = 0

(~wi − ~wj) ⊥ Hij , r =gi (~x)−gj (~x)||~wi−~wj ||

Page 121: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Dichotomizers in a Four–class Problem

ω1

not ω1

ω1

not ω2ω2

not ω3ω3

not ω4

ω4

ambiguous region

ambiguous region

ω1 ω1

ω1

ω2

ω2

ω2

ω3

ω3

ω3

ω4

ω4ω4

ω2

ω4

ω3

ω3ω2

ω1

ω4

H13H12

H14

H23 H24

H34

FIGURE 5.3. Linear decision boundaries for a four-class problem. The top figure showsωi/not ωi dichotomies while the bottom figure shows ωi/ωj dichotomies and the corre-sponding decision boundaries Hij. The pink regions have ambiguous category assign-ments. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.

Page 122: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Linear Machines in Multi–class Problems

R1 R2

R3

R4

R5

ω1 R2

R3

R1

ω2 ω1ω3

ω5

ω2ω3

ω4

H15 H25

H24

H14

H35

H13

H34

H23

H12

H23

H13

FIGURE 5.4. Decision boundaries produced by a linear machine for a three-class prob-lem and a five-class problem. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 123: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Generalized Linear Discriminant Functions

More complex decision boundariese.g., quadratic discriminantg(~x) = w0 +

∑di=1 wixi +

∑di=1

∑dj=1 wijxixj

Generalized LDF g(~x) =∑d

i=1 aiyi (~x) = ~at~yd yi (~x) functions map points from d–dimensional ~x–space tod–dimensional ~y–space

Example g(x) = a1 + a2x + a3x2, ~y =

1xx2

Decision boundary is linear in ~y–spaceTransformed density p(x) is degenerateIf d is large, huge number of parameters(requires large training data set)

Page 124: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Generalized Linear Discriminant Functions

More complex decision boundariese.g., quadratic discriminantg(~x) = w0 +

∑di=1 wixi +

∑di=1

∑dj=1 wijxixj

Generalized LDF g(~x) =∑d

i=1 aiyi (~x) = ~at~yd yi (~x) functions map points from d–dimensional ~x–space tod–dimensional ~y–space

Example g(x) = a1 + a2x + a3x2, ~y =

1xx2

Decision boundary is linear in ~y–spaceTransformed density p(x) is degenerateIf d is large, huge number of parameters(requires large training data set)

Page 125: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Generalized Linear Discriminant Functions

More complex decision boundariese.g., quadratic discriminantg(~x) = w0 +

∑di=1 wixi +

∑di=1

∑dj=1 wijxixj

Generalized LDF g(~x) =∑d

i=1 aiyi (~x) = ~at~yd yi (~x) functions map points from d–dimensional ~x–space tod–dimensional ~y–space

Example g(x) = a1 + a2x + a3x2, ~y =

1xx2

Decision boundary is linear in ~y–spaceTransformed density p(x) is degenerateIf d is large, huge number of parameters(requires large training data set)

Page 126: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Generalized Linear Discriminant Functions

More complex decision boundariese.g., quadratic discriminantg(~x) = w0 +

∑di=1 wixi +

∑di=1

∑dj=1 wijxixj

Generalized LDF g(~x) =∑d

i=1 aiyi (~x) = ~at~yd yi (~x) functions map points from d–dimensional ~x–space tod–dimensional ~y–space

Example g(x) = a1 + a2x + a3x2, ~y =

1xx2

Decision boundary is linear in ~y–spaceTransformed density p(x) is degenerateIf d is large, huge number of parameters(requires large training data set)

Page 127: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

From 1D to 3D

0

-1

0

1

2

y2

0

2

4

y3

0.51

1.5

2

2.5

y1

1-1 20-2x

R1R1 R2

y = ( )1x

x2

R2

R1

ˆ

ˆ

FIGURE 5.5. The mapping y = (1, x, x2)t takes a line and transforms it to a parabolain three dimensions. A plane splits the resulting y-space into regions corresponding totwo categories, and this in turn gives a nonsimply connected decision region in theone-dimensional x-space. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 128: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

From 2D to 3D

y2w

R2R1

R1

R2

R1

x1

x2

x1

x2y1

y3

Hy =

( )x 1

x 2

αx 1x 2

ˆ

ˆˆ

FIGURE 5.6. The two-dimensional input space x is mapped through a polynomial func-tion f to y. Here the mapping is y1 = x1, y2 = x2 and y3 ∝ x1x2. A linear discriminantin this transformed space is a hyperplane, which cuts the surface. Points to the positiveside of the hyperplane H correspond to category ω1, and those beneath it correspond tocategory ω2. Here, in terms of the x space, R1 is a not simply connected. From: RichardO. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001by John Wiley & Sons, Inc.

Page 129: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Linearly Separable Dichotomy

Two classes, samples ~yi , ~at~yi > 0→ ω1, ~at~yi < 0→ ω2

”Normalization” of ω2: ~yi = −~yi → ~at~yi > 0 ∀~yi

Solution region defines all possible values of ~aintersection of n half–spaces (~at~yi = 0)

Margin b > 0, ~at~yi ≥ b, new solution region has distance b||~yi ||

from old boundaries

Page 130: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Linearly Separable Dichotomy

Two classes, samples ~yi , ~at~yi > 0→ ω1, ~at~yi < 0→ ω2

”Normalization” of ω2: ~yi = −~yi → ~at~yi > 0 ∀~yi

Solution region defines all possible values of ~aintersection of n half–spaces (~at~yi = 0)

Margin b > 0, ~at~yi ≥ b, new solution region has distance b||~yi ||

from old boundaries

Page 131: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Linearly Separable Dichotomy

Two classes, samples ~yi , ~at~yi > 0→ ω1, ~at~yi < 0→ ω2

”Normalization” of ω2: ~yi = −~yi → ~at~yi > 0 ∀~yi

Solution region defines all possible values of ~aintersection of n half–spaces (~at~yi = 0)

Margin b > 0, ~at~yi ≥ b, new solution region has distance b||~yi ||

from old boundaries

Page 132: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Linearly Separable Dichotomy

Two classes, samples ~yi , ~at~yi > 0→ ω1, ~at~yi < 0→ ω2

”Normalization” of ω2: ~yi = −~yi → ~at~yi > 0 ∀~yi

Solution region defines all possible values of ~aintersection of n half–spaces (~at~yi = 0)

Margin b > 0, ~at~yi ≥ b, new solution region has distance b||~yi ||

from old boundaries

Page 133: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Solution Region and Normalization

y1

y2

separating plane

solution region

y1

y2

"separating" plane

solution region

aa

FIGURE 5.8. Four training samples (black for ω1, red for ω2) and the solution region infeature space. The figure on the left shows the raw data; the solution vectors leads to aplane that separates the patterns from the two categories. In the figure on the right, thered points have been “normalized”—that is, changed in sign. Now the solution vectorleads to a plane that places all “normalized” points on the same side. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.

Page 134: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Solution Region with Margins

solutionregion

y1

y2

y3

a1

a2

solutionregion

a2

a1

y1

y2

y3

b/||y2 ||

b/||y 1||

b/||y

3||

}

}

}

FIGURE 5.9. The effect of the margin on the solution region. At the left is the case ofno margin (b = 0) equivalent to a case such as shown at the left in Fig. 5.8. At the rightis the case b > 0, shrinking the solution region by margins b/‖yi‖. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.

Page 135: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Gradient Descent Solutions

Set of linear inequalities ~at~yi > 0, define criterion functionJ(~a), which is minimized for a solution vector ~a∗

Minimizing a scalar function J(~a) by gradient descent~a(k + 1) = ~a(k)− η(k)~∇J(~a(k))

Second–order expansionJ(~a) ' J(~a(k)) + ~∇Jt(~a−~a(k)) + 1

2 (~a−~a(k))tH(~a−~a(k))H is Hessian Matrix

Minimize J(~a(k + 1)) with η(k) = ||~∇J||2~∇JtH ~∇J

J(~a) ∼ ~a2 → H = const.→ η = const.

Minimize second–order expansion with ~a(k + 1)→Newton Descent ~a(k + 1) = ~a(k)− H−1~∇J (expensive)

Page 136: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Gradient Descent Solutions

Set of linear inequalities ~at~yi > 0, define criterion functionJ(~a), which is minimized for a solution vector ~a∗

Minimizing a scalar function J(~a) by gradient descent~a(k + 1) = ~a(k)− η(k)~∇J(~a(k))

Second–order expansionJ(~a) ' J(~a(k)) + ~∇Jt(~a−~a(k)) + 1

2 (~a−~a(k))tH(~a−~a(k))H is Hessian Matrix

Minimize J(~a(k + 1)) with η(k) = ||~∇J||2~∇JtH ~∇J

J(~a) ∼ ~a2 → H = const.→ η = const.

Minimize second–order expansion with ~a(k + 1)→Newton Descent ~a(k + 1) = ~a(k)− H−1~∇J (expensive)

Page 137: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Gradient Descent Solutions

Set of linear inequalities ~at~yi > 0, define criterion functionJ(~a), which is minimized for a solution vector ~a∗

Minimizing a scalar function J(~a) by gradient descent~a(k + 1) = ~a(k)− η(k)~∇J(~a(k))

Second–order expansionJ(~a) ' J(~a(k)) + ~∇Jt(~a−~a(k)) + 1

2 (~a−~a(k))tH(~a−~a(k))H is Hessian Matrix

Minimize J(~a(k + 1)) with η(k) = ||~∇J||2~∇JtH ~∇J

J(~a) ∼ ~a2 → H = const.→ η = const.

Minimize second–order expansion with ~a(k + 1)→Newton Descent ~a(k + 1) = ~a(k)− H−1~∇J (expensive)

Page 138: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Gradient Descent Solutions

Set of linear inequalities ~at~yi > 0, define criterion functionJ(~a), which is minimized for a solution vector ~a∗

Minimizing a scalar function J(~a) by gradient descent~a(k + 1) = ~a(k)− η(k)~∇J(~a(k))

Second–order expansionJ(~a) ' J(~a(k)) + ~∇Jt(~a−~a(k)) + 1

2 (~a−~a(k))tH(~a−~a(k))H is Hessian Matrix

Minimize J(~a(k + 1)) with η(k) = ||~∇J||2~∇JtH ~∇J

J(~a) ∼ ~a2 → H = const.→ η = const.

Minimize second–order expansion with ~a(k + 1)→Newton Descent ~a(k + 1) = ~a(k)− H−1~∇J (expensive)

Page 139: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Gradient Descent Solutions

Set of linear inequalities ~at~yi > 0, define criterion functionJ(~a), which is minimized for a solution vector ~a∗

Minimizing a scalar function J(~a) by gradient descent~a(k + 1) = ~a(k)− η(k)~∇J(~a(k))

Second–order expansionJ(~a) ' J(~a(k)) + ~∇Jt(~a−~a(k)) + 1

2 (~a−~a(k))tH(~a−~a(k))H is Hessian Matrix

Minimize J(~a(k + 1)) with η(k) = ||~∇J||2~∇JtH ~∇J

J(~a) ∼ ~a2 → H = const.→ η = const.

Minimize second–order expansion with ~a(k + 1)→Newton Descent ~a(k + 1) = ~a(k)− H−1~∇J (expensive)

Page 140: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Gradient and Newton Descent

a1

a2

J(a)

FIGURE 5.10. The sequence of weight vectors given by a simple gradient descentmethod (red) and by Newton’s (second order) algorithm (black). Newton’s method typi-cally leads to greater improvement per step, even when using optimal learning rates forboth methods. However the added computational burden of inverting the Hessian ma-trix used in Newton’s method is not always justified, and simple gradient descent maysuffice. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.

Page 141: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Perceptron Criterion Function

”Normalized” inequalities ~at~yi > 0Perceptron criterion Jp(~a) =

∑~y∈Y −~at~y

(Y is set of misclassified patterns)

Gradient ~∇Jp =∑

~y∈Y −~y

Update rule ~a(k + 1) = ~a(k) + η(k)∑

~y∈Yk ~y

Batch vs. single–sample correction

Page 142: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Perceptron Criterion Function

”Normalized” inequalities ~at~yi > 0Perceptron criterion Jp(~a) =

∑~y∈Y −~at~y

(Y is set of misclassified patterns)

Gradient ~∇Jp =∑

~y∈Y −~y

Update rule ~a(k + 1) = ~a(k) + η(k)∑

~y∈Yk ~y

Batch vs. single–sample correction

Page 143: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Perceptron Criterion Function

”Normalized” inequalities ~at~yi > 0Perceptron criterion Jp(~a) =

∑~y∈Y −~at~y

(Y is set of misclassified patterns)

Gradient ~∇Jp =∑

~y∈Y −~y

Update rule ~a(k + 1) = ~a(k) + η(k)∑

~y∈Yk ~y

Batch vs. single–sample correction

Page 144: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Perceptron Criterion Function

”Normalized” inequalities ~at~yi > 0Perceptron criterion Jp(~a) =

∑~y∈Y −~at~y

(Y is set of misclassified patterns)

Gradient ~∇Jp =∑

~y∈Y −~y

Update rule ~a(k + 1) = ~a(k) + η(k)∑

~y∈Yk ~y

Batch vs. single–sample correction

Page 145: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Minimum Squared–Error Procedures

Set of equalities ~at~yi = bibi > 0 are arbitrary constants

Solve Y~a = ~bY is n × (d + 1) matrix containing all training vectors

If Y nonsingular ~a = Y−1~b, however Y mostly rectangular!

Minimizing ~e = Y~a− ~b leads toY tY~a = Y t~b → ~a = (Y tY )−1Y t~b = Y †~bY † is pseudoinverse (d + 1)× n matrix

Page 146: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Minimum Squared–Error Procedures

Set of equalities ~at~yi = bibi > 0 are arbitrary constants

Solve Y~a = ~bY is n × (d + 1) matrix containing all training vectors

If Y nonsingular ~a = Y−1~b, however Y mostly rectangular!

Minimizing ~e = Y~a− ~b leads toY tY~a = Y t~b → ~a = (Y tY )−1Y t~b = Y †~bY † is pseudoinverse (d + 1)× n matrix

Page 147: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Minimum Squared–Error Procedures

Set of equalities ~at~yi = bibi > 0 are arbitrary constants

Solve Y~a = ~bY is n × (d + 1) matrix containing all training vectors

If Y nonsingular ~a = Y−1~b, however Y mostly rectangular!

Minimizing ~e = Y~a− ~b leads toY tY~a = Y t~b → ~a = (Y tY )−1Y t~b = Y †~bY † is pseudoinverse (d + 1)× n matrix

Page 148: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Minimum Squared–Error Procedures

Set of equalities ~at~yi = bibi > 0 are arbitrary constants

Solve Y~a = ~bY is n × (d + 1) matrix containing all training vectors

If Y nonsingular ~a = Y−1~b, however Y mostly rectangular!

Minimizing ~e = Y~a− ~b leads toY tY~a = Y t~b → ~a = (Y tY )−1Y t~b = Y †~bY † is pseudoinverse (d + 1)× n matrix

Page 149: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Support Vector Machines

Transform patterns to (much) higher dimensionvia nonlinear mapping ϕ(.)

Linear discriminant g(~y) = ~at~y

Distance of ~yk to H is zkg(~yk )||a|| ≥ b

zk = ±1 (normalization), b is margin

Maximize b with constrained ||a|| = 1b → minimize ||a|| with

inequality constraints

Kuhn–Tucker theorem, optimization with inequalityconstraints, generalization of Lagrange Multipliers

Page 150: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Support Vector Machines

Transform patterns to (much) higher dimensionvia nonlinear mapping ϕ(.)

Linear discriminant g(~y) = ~at~y

Distance of ~yk to H is zkg(~yk )||a|| ≥ b

zk = ±1 (normalization), b is margin

Maximize b with constrained ||a|| = 1b → minimize ||a|| with

inequality constraints

Kuhn–Tucker theorem, optimization with inequalityconstraints, generalization of Lagrange Multipliers

Page 151: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Support Vector Machines

Transform patterns to (much) higher dimensionvia nonlinear mapping ϕ(.)

Linear discriminant g(~y) = ~at~y

Distance of ~yk to H is zkg(~yk )||a|| ≥ b

zk = ±1 (normalization), b is margin

Maximize b with constrained ||a|| = 1b → minimize ||a|| with

inequality constraints

Kuhn–Tucker theorem, optimization with inequalityconstraints, generalization of Lagrange Multipliers

Page 152: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Support Vector Machines

Transform patterns to (much) higher dimensionvia nonlinear mapping ϕ(.)

Linear discriminant g(~y) = ~at~y

Distance of ~yk to H is zkg(~yk )||a|| ≥ b

zk = ±1 (normalization), b is margin

Maximize b with constrained ||a|| = 1b → minimize ||a|| with

inequality constraints

Kuhn–Tucker theorem, optimization with inequalityconstraints, generalization of Lagrange Multipliers

Page 153: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Support Vector Machines

Transform patterns to (much) higher dimensionvia nonlinear mapping ϕ(.)

Linear discriminant g(~y) = ~at~y

Distance of ~yk to H is zkg(~yk )||a|| ≥ b

zk = ±1 (normalization), b is margin

Maximize b with constrained ||a|| = 1b → minimize ||a|| with

inequality constraints

Kuhn–Tucker theorem, optimization with inequalityconstraints, generalization of Lagrange Multipliers

Page 154: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Maximal Margin SVM

Maximize margin b using the Kuhn–Tucker functionalL(~a, ~α) = 1

2 ||~a||2 −

∑nk=1 αk [zk~a

t~yk − 1]

Resulting in dual problem (quadratic optimization)L(~α) =

∑nk=1 αk − 1

2

∑nk,j αkαjzkzj~y

tj ~yk

with constraints∑nk=1 zkαk = 0 αk ≥ 0

Then ~a∗ =∑n

i=1 ziα∗i ~yi (non–zero αi indicates support

vector)

Maximal margin b∗ = (∑n

i=1 αi∗)−

12

Page 155: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Maximal Margin SVM

Maximize margin b using the Kuhn–Tucker functionalL(~a, ~α) = 1

2 ||~a||2 −

∑nk=1 αk [zk~a

t~yk − 1]

Resulting in dual problem (quadratic optimization)L(~α) =

∑nk=1 αk − 1

2

∑nk,j αkαjzkzj~y

tj ~yk

with constraints∑nk=1 zkαk = 0 αk ≥ 0

Then ~a∗ =∑n

i=1 ziα∗i ~yi (non–zero αi indicates support

vector)

Maximal margin b∗ = (∑n

i=1 αi∗)−

12

Page 156: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Maximal Margin SVM

Maximize margin b using the Kuhn–Tucker functionalL(~a, ~α) = 1

2 ||~a||2 −

∑nk=1 αk [zk~a

t~yk − 1]

Resulting in dual problem (quadratic optimization)L(~α) =

∑nk=1 αk − 1

2

∑nk,j αkαjzkzj~y

tj ~yk

with constraints∑nk=1 zkαk = 0 αk ≥ 0

Then ~a∗ =∑n

i=1 ziα∗i ~yi (non–zero αi indicates support

vector)

Maximal margin b∗ = (∑n

i=1 αi∗)−

12

Page 157: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Maximal Margin SVM

Maximize margin b using the Kuhn–Tucker functionalL(~a, ~α) = 1

2 ||~a||2 −

∑nk=1 αk [zk~a

t~yk − 1]

Resulting in dual problem (quadratic optimization)L(~α) =

∑nk=1 αk − 1

2

∑nk,j αkαjzkzj~y

tj ~yk

with constraints∑nk=1 zkαk = 0 αk ≥ 0

Then ~a∗ =∑n

i=1 ziα∗i ~yi (non–zero αi indicates support

vector)

Maximal margin b∗ = (∑n

i=1 αi∗)−

12

Page 158: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Maximal Margin Hyperplane

y1

y2

R2

optimal hyperplane

max

imum

mar

gin

b

max

imum

mar

gin

b

R1

FIGURE 5.19. Training a support vector machine consists of finding the optimal hyper-plane, that is, the one with the maximum distance from the nearest training patterns.The support vectors are those (nearest) patterns, a distance b from the hyperplane. Thethree support vectors are shown as solid dots. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons,Inc.

Page 159: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Soft Margin SVM

Maximal margin SVM is sensitive to outliers, demands linearseparability for solution

Soft Margin SVM introducing slack variables ξzkg(~yk) ≥ b − ξk (relaxed margin)

Maximize relaxed margin b with Kuhn–Tucker functionalL(~a, ~α, ~ξ) = 1

2 ||~a||2 + C

2

∑nk=1 ξ

2i −

∑nk=1 αk [zk~a

t~yk − 1 + ξi ]

Again ~a∗ =∑n

i=1 ziα∗i ~yi

Maximal margin b∗ = (∑n

i=1 αi∗ − 1

C |α∗i |2)−

12

Depends on parameter C !

Page 160: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Soft Margin SVM

Maximal margin SVM is sensitive to outliers, demands linearseparability for solution

Soft Margin SVM introducing slack variables ξzkg(~yk) ≥ b − ξk (relaxed margin)

Maximize relaxed margin b with Kuhn–Tucker functionalL(~a, ~α, ~ξ) = 1

2 ||~a||2 + C

2

∑nk=1 ξ

2i −

∑nk=1 αk [zk~a

t~yk − 1 + ξi ]

Again ~a∗ =∑n

i=1 ziα∗i ~yi

Maximal margin b∗ = (∑n

i=1 αi∗ − 1

C |α∗i |2)−

12

Depends on parameter C !

Page 161: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Soft Margin SVM

Maximal margin SVM is sensitive to outliers, demands linearseparability for solution

Soft Margin SVM introducing slack variables ξzkg(~yk) ≥ b − ξk (relaxed margin)

Maximize relaxed margin b with Kuhn–Tucker functionalL(~a, ~α, ~ξ) = 1

2 ||~a||2 + C

2

∑nk=1 ξ

2i −

∑nk=1 αk [zk~a

t~yk − 1 + ξi ]

Again ~a∗ =∑n

i=1 ziα∗i ~yi

Maximal margin b∗ = (∑n

i=1 αi∗ − 1

C |α∗i |2)−

12

Depends on parameter C !

Page 162: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Soft Margin SVM

Maximal margin SVM is sensitive to outliers, demands linearseparability for solution

Soft Margin SVM introducing slack variables ξzkg(~yk) ≥ b − ξk (relaxed margin)

Maximize relaxed margin b with Kuhn–Tucker functionalL(~a, ~α, ~ξ) = 1

2 ||~a||2 + C

2

∑nk=1 ξ

2i −

∑nk=1 αk [zk~a

t~yk − 1 + ξi ]

Again ~a∗ =∑n

i=1 ziα∗i ~yi

Maximal margin b∗ = (∑n

i=1 αi∗ − 1

C |α∗i |2)−

12

Depends on parameter C !

Page 163: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Soft Margin SVM

Maximal margin SVM is sensitive to outliers, demands linearseparability for solution

Soft Margin SVM introducing slack variables ξzkg(~yk) ≥ b − ξk (relaxed margin)

Maximize relaxed margin b with Kuhn–Tucker functionalL(~a, ~α, ~ξ) = 1

2 ||~a||2 + C

2

∑nk=1 ξ

2i −

∑nk=1 αk [zk~a

t~yk − 1 + ξi ]

Again ~a∗ =∑n

i=1 ziα∗i ~yi

Maximal margin b∗ = (∑n

i=1 αi∗ − 1

C |α∗i |2)−

12

Depends on parameter C !

Page 164: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Linear Discriminant Functions

Decision Surfaces

Soft Margin SVM

Maximal margin SVM is sensitive to outliers, demands linearseparability for solution

Soft Margin SVM introducing slack variables ξzkg(~yk) ≥ b − ξk (relaxed margin)

Maximize relaxed margin b with Kuhn–Tucker functionalL(~a, ~α, ~ξ) = 1

2 ||~a||2 + C

2

∑nk=1 ξ

2i −

∑nk=1 αk [zk~a

t~yk − 1 + ξi ]

Again ~a∗ =∑n

i=1 ziα∗i ~yi

Maximal margin b∗ = (∑n

i=1 αi∗ − 1

C |α∗i |2)−

12

Depends on parameter C !

Page 165: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Multilayer Neural Networks

Real-world problems: linear discriminant often not sufficient

NNs also implement nonlinear mapping to higher dimension

Learning finds mapping AND linear discriminant

Error–backpropagation is least square fit to Bayes discriminantfunctions

NNs motivated by biology, but can be explained without it

Page 166: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Multilayer Neural Networks

Real-world problems: linear discriminant often not sufficient

NNs also implement nonlinear mapping to higher dimension

Learning finds mapping AND linear discriminant

Error–backpropagation is least square fit to Bayes discriminantfunctions

NNs motivated by biology, but can be explained without it

Page 167: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Multilayer Neural Networks

Real-world problems: linear discriminant often not sufficient

NNs also implement nonlinear mapping to higher dimension

Learning finds mapping AND linear discriminant

Error–backpropagation is least square fit to Bayes discriminantfunctions

NNs motivated by biology, but can be explained without it

Page 168: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Multilayer Neural Networks

Real-world problems: linear discriminant often not sufficient

NNs also implement nonlinear mapping to higher dimension

Learning finds mapping AND linear discriminant

Error–backpropagation is least square fit to Bayes discriminantfunctions

NNs motivated by biology, but can be explained without it

Page 169: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Multilayer Neural Networks

Real-world problems: linear discriminant often not sufficient

NNs also implement nonlinear mapping to higher dimension

Learning finds mapping AND linear discriminant

Error–backpropagation is least square fit to Bayes discriminantfunctions

NNs motivated by biology, but can be explained without it

Page 170: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

XOR Net

-1 1

-1

1

biashidden j

output k

input i

11

1 1

.5

-1.5

.7-.4-1

x1 x2

x1

x2

z=+1

z=-1

z=-1

0

1-1

0

1

-1

0

1

-1

0

1-1

0

1

-1

0

1

-1

0

1-1

0

1

-1

0

1

-1

R2

R2

R1

y1 y2

z

zk

wkj

wji

x1

x2

x1

x2

x1

x2

y1 y2

FIGURE 6.1. The two-bit parity or exclusive-OR problem can be solved by a three-layer network. At the bottom is the two-dimensional feature x1x2-space, along with thefour patterns to be classified. The three-layer network is shown in the middle. The inputunits are linear and merely distribute their feature values through multiplicative weightsto the hidden units. The hidden and output units here are linear threshold units, eachof which forms the linear sum of its inputs times their associated weight to yield net,and emits a +1 if this net is greater than or equal to 0, and −1 otherwise, as shownby the graphs. Positive or “excitatory” weights are denoted by solid lines, negative or“inhibitory” weights by dashed lines; each weight magnitude is indicated by the line’sthickness, and is labeled. The single output unit sums the weighted signals from thehidden units and bias to form its net, and emits a +1 if its net is greater than or equalto 0 and emits a −1 otherwise. Within each unit we show a graph of its input-outputor activation function—f (net) versus net. This function is linear for the input units, aconstant for the bias, and a step or sign function elsewhere. We say that this networkhas a 2-2-1 fully connected topology, describing the number of units (other than thebias) in successive layers. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 171: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Network Components

Neurons and synaptic connections (weights)

Net activation netj =∑d

i=1 xiwji + wj0 =∑d

i=0 xiwji ≡ ~wjt~x

Neuron output zk = f (netk), activation function

Common activation function class is sigmoid, e.g.,f (x) = 1

1+e−cx

Basic topologies: feed–forward and recurrent

Page 172: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Network Components

Neurons and synaptic connections (weights)

Net activation netj =∑d

i=1 xiwji + wj0 =∑d

i=0 xiwji ≡ ~wjt~x

Neuron output zk = f (netk), activation function

Common activation function class is sigmoid, e.g.,f (x) = 1

1+e−cx

Basic topologies: feed–forward and recurrent

Page 173: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Network Components

Neurons and synaptic connections (weights)

Net activation netj =∑d

i=1 xiwji + wj0 =∑d

i=0 xiwji ≡ ~wjt~x

Neuron output zk = f (netk), activation function

Common activation function class is sigmoid, e.g.,f (x) = 1

1+e−cx

Basic topologies: feed–forward and recurrent

Page 174: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Network Components

Neurons and synaptic connections (weights)

Net activation netj =∑d

i=1 xiwji + wj0 =∑d

i=0 xiwji ≡ ~wjt~x

Neuron output zk = f (netk), activation function

Common activation function class is sigmoid, e.g.,f (x) = 1

1+e−cx

Basic topologies: feed–forward and recurrent

Page 175: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Network Components

Neurons and synaptic connections (weights)

Net activation netj =∑d

i=1 xiwji + wj0 =∑d

i=0 xiwji ≡ ~wjt~x

Neuron output zk = f (netk), activation function

Common activation function class is sigmoid, e.g.,f (x) = 1

1+e−cx

Basic topologies: feed–forward and recurrent

Page 176: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

A 2–4–1 Network

y1

y2

y4

y3

y3 y4y2y1

x1 x2

z1

z1

x1

x2

FIGURE 6.2. A 2-4-1 network (with bias) along with the response functions at different units; each hiddenoutput unit has sigmoidal activation function f (·). In the case shown, the hidden unit outputs are paired inopposition thereby producing a “bump” at the output unit. Given a sufficiently large number of hidden units,any continuous function from input to output can be approximated arbitrarily well by such a network. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley& Sons, Inc.

Page 177: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

NN Decision Boundaries

two layer

three layer

x1 x2

x1

x2

...

x1 x2

R1

R2

R1

R2

R2

R1

x2

x1

FIGURE 6.3. Whereas a two-layer network classifier can only implement a linear deci-sion boundary, given an adequate number of hidden units, three-, four- and higher-layernetworks can implement arbitrary decision boundaries. The decision regions need notbe convex or simply connected. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 178: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Network Learning

Learning as minimization (of network error)

Error is a function of network parameters

Gradient descent methods reduce error

Problem with hidden layers

Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)

Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally

Weight update ∆wj ,i = ηδjaiGeneralized error term δ

Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation

Page 179: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Network Learning

Learning as minimization (of network error)

Error is a function of network parameters

Gradient descent methods reduce error

Problem with hidden layers

Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)

Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally

Weight update ∆wj ,i = ηδjaiGeneralized error term δ

Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation

Page 180: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Network Learning

Learning as minimization (of network error)

Error is a function of network parameters

Gradient descent methods reduce error

Problem with hidden layers

Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)

Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally

Weight update ∆wj ,i = ηδjaiGeneralized error term δ

Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation

Page 181: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Network Learning

Learning as minimization (of network error)

Error is a function of network parameters

Gradient descent methods reduce error

Problem with hidden layers

Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)

Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally

Weight update ∆wj ,i = ηδjaiGeneralized error term δ

Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation

Page 182: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Network Learning

Learning as minimization (of network error)

Error is a function of network parameters

Gradient descent methods reduce error

Problem with hidden layers

Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)

Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally

Weight update ∆wj ,i = ηδjaiGeneralized error term δ

Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation

Page 183: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Network Learning

Learning as minimization (of network error)

Error is a function of network parameters

Gradient descent methods reduce error

Problem with hidden layers

Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)

Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally

Weight update ∆wj ,i = ηδjaiGeneralized error term δ

Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation

Page 184: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Network Learning

Learning as minimization (of network error)

Error is a function of network parameters

Gradient descent methods reduce error

Problem with hidden layers

Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)

Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally

Weight update ∆wj ,i = ηδjaiGeneralized error term δ

Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation

Page 185: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Network Learning

Learning as minimization (of network error)

Error is a function of network parameters

Gradient descent methods reduce error

Problem with hidden layers

Backpropagation = Iterative Local Gradient DescentWerbos (1974), Rumelhart, Hinton, Williams (1986)

Error–Backpropagation, output error is transmitted backwardsas weighted error, network weights are updated locally

Weight update ∆wj ,i = ηδjaiGeneralized error term δ

Common transfer functions: differentiable, nonlinear,monotonous, easily computable differentiation

Page 186: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Error–Backpropagation I

x1

x2

H1

H2

I1

I2 z2

z1

y1

y2

w12

w22

w21

w11v11

v21

v12

v22

Hj =∑n

i=1 vj ,ixi Ik =∑h

j=1 wk,jyjyj = f (Hj), zk = f (Ik)

Error E (p) = 12

∑mk=1 (t

(p)k − z

(p)k )2

Output Layer: ∆wk,j = −η ∂E∂wk,j

∂E∂wk,j

= ∂E∂Ik

∂Ik∂wk,j

= ∂E∂Ik

yj∂E∂Ik

= ∂E∂zk

∂zk∂Ik

= −(tk − zk)f ′(Ik)∂E∂wk,j

= −(tk − zk)f ′(Ik)yj mit δk = (tk − zk)f ′(Ik)

∆wk,j = ηδkyj

Page 187: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Error–Backpropagation I

x1

x2

H1

H2

I1

I2 z2

z1

y1

y2

w12

w22

w21

w11v11

v21

v12

v22

Hj =∑n

i=1 vj ,ixi Ik =∑h

j=1 wk,jyjyj = f (Hj), zk = f (Ik)

Error E (p) = 12

∑mk=1 (t

(p)k − z

(p)k )2

Output Layer: ∆wk,j = −η ∂E∂wk,j

∂E∂wk,j

= ∂E∂Ik

∂Ik∂wk,j

= ∂E∂Ik

yj∂E∂Ik

= ∂E∂zk

∂zk∂Ik

= −(tk − zk)f ′(Ik)∂E∂wk,j

= −(tk − zk)f ′(Ik)yj mit δk = (tk − zk)f ′(Ik)

∆wk,j = ηδkyj

Page 188: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Error–Backpropagation I

x1

x2

H1

H2

I1

I2 z2

z1

y1

y2

w12

w22

w21

w11v11

v21

v12

v22

Hj =∑n

i=1 vj ,ixi Ik =∑h

j=1 wk,jyjyj = f (Hj), zk = f (Ik)

Error E (p) = 12

∑mk=1 (t

(p)k − z

(p)k )2

Output Layer: ∆wk,j = −η ∂E∂wk,j

∂E∂wk,j

= ∂E∂Ik

∂Ik∂wk,j

= ∂E∂Ik

yj∂E∂Ik

= ∂E∂zk

∂zk∂Ik

= −(tk − zk)f ′(Ik)∂E∂wk,j

= −(tk − zk)f ′(Ik)yj mit δk = (tk − zk)f ′(Ik)

∆wk,j = ηδkyj

Page 189: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Error–Backpropagation I

x1

x2

H1

H2

I1

I2 z2

z1

y1

y2

w12

w22

w21

w11v11

v21

v12

v22

Hj =∑n

i=1 vj ,ixi Ik =∑h

j=1 wk,jyjyj = f (Hj), zk = f (Ik)

Error E (p) = 12

∑mk=1 (t

(p)k − z

(p)k )2

Output Layer: ∆wk,j = −η ∂E∂wk,j

∂E∂wk,j

= ∂E∂Ik

∂Ik∂wk,j

= ∂E∂Ik

yj∂E∂Ik

= ∂E∂zk

∂zk∂Ik

= −(tk − zk)f ′(Ik)∂E∂wk,j

= −(tk − zk)f ′(Ik)yj mit δk = (tk − zk)f ′(Ik)

∆wk,j = ηδkyj

Page 190: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Error–Backpropagation II

Hidden Layer: ∆vj ,i = −η ∂E∂vj,i

∂E∂vj,i

= ∂E∂Hj

∂Hj

∂vj,i= ∂E

∂Hjxi

∂E∂Hj

= ∂E∂yj

∂yj∂Hj

= ∂E∂yj

f ′(Hj)

∂E∂yj

= −12

∑mk=1

∂(tk−f (Ik ))2

∂yj= −

∑mk=1 (tk − zk)f ′(Ik)wk,j

mit δj = f ′(Hj)∑m

k=1 δkwk,j

∆vj ,i = ηδjxi

Local update rules propagating error from output to input

Present all p patterns of the training set = 1 Epoch (completetraining e.g., 1,000 epochs)

Batch Learning (Off–line): accumulate weight changes for allpatterns, then update weights

On–line Learning: update weights after each pattern

Page 191: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Error–Backpropagation II

Hidden Layer: ∆vj ,i = −η ∂E∂vj,i

∂E∂vj,i

= ∂E∂Hj

∂Hj

∂vj,i= ∂E

∂Hjxi

∂E∂Hj

= ∂E∂yj

∂yj∂Hj

= ∂E∂yj

f ′(Hj)

∂E∂yj

= −12

∑mk=1

∂(tk−f (Ik ))2

∂yj= −

∑mk=1 (tk − zk)f ′(Ik)wk,j

mit δj = f ′(Hj)∑m

k=1 δkwk,j

∆vj ,i = ηδjxi

Local update rules propagating error from output to input

Present all p patterns of the training set = 1 Epoch (completetraining e.g., 1,000 epochs)

Batch Learning (Off–line): accumulate weight changes for allpatterns, then update weights

On–line Learning: update weights after each pattern

Page 192: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Error–Backpropagation II

Hidden Layer: ∆vj ,i = −η ∂E∂vj,i

∂E∂vj,i

= ∂E∂Hj

∂Hj

∂vj,i= ∂E

∂Hjxi

∂E∂Hj

= ∂E∂yj

∂yj∂Hj

= ∂E∂yj

f ′(Hj)

∂E∂yj

= −12

∑mk=1

∂(tk−f (Ik ))2

∂yj= −

∑mk=1 (tk − zk)f ′(Ik)wk,j

mit δj = f ′(Hj)∑m

k=1 δkwk,j

∆vj ,i = ηδjxi

Local update rules propagating error from output to input

Present all p patterns of the training set = 1 Epoch (completetraining e.g., 1,000 epochs)

Batch Learning (Off–line): accumulate weight changes for allpatterns, then update weights

On–line Learning: update weights after each pattern

Page 193: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Error–Backpropagation II

Hidden Layer: ∆vj ,i = −η ∂E∂vj,i

∂E∂vj,i

= ∂E∂Hj

∂Hj

∂vj,i= ∂E

∂Hjxi

∂E∂Hj

= ∂E∂yj

∂yj∂Hj

= ∂E∂yj

f ′(Hj)

∂E∂yj

= −12

∑mk=1

∂(tk−f (Ik ))2

∂yj= −

∑mk=1 (tk − zk)f ′(Ik)wk,j

mit δj = f ′(Hj)∑m

k=1 δkwk,j

∆vj ,i = ηδjxi

Local update rules propagating error from output to input

Present all p patterns of the training set = 1 Epoch (completetraining e.g., 1,000 epochs)

Batch Learning (Off–line): accumulate weight changes for allpatterns, then update weights

On–line Learning: update weights after each pattern

Page 194: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Error–Backpropagation II

Hidden Layer: ∆vj ,i = −η ∂E∂vj,i

∂E∂vj,i

= ∂E∂Hj

∂Hj

∂vj,i= ∂E

∂Hjxi

∂E∂Hj

= ∂E∂yj

∂yj∂Hj

= ∂E∂yj

f ′(Hj)

∂E∂yj

= −12

∑mk=1

∂(tk−f (Ik ))2

∂yj= −

∑mk=1 (tk − zk)f ′(Ik)wk,j

mit δj = f ′(Hj)∑m

k=1 δkwk,j

∆vj ,i = ηδjxi

Local update rules propagating error from output to input

Present all p patterns of the training set = 1 Epoch (completetraining e.g., 1,000 epochs)

Batch Learning (Off–line): accumulate weight changes for allpatterns, then update weights

On–line Learning: update weights after each pattern

Page 195: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Learning Curves

J/n

epochs

trainingtest

validation

1 2 3 4 5 6 7 8 9 10 11

FIGURE 6.6. A learning curve shows the criterion function as a function of the amountof training, typically indicated by the number of epochs or presentations of the full train-ing set. We plot the average error per pattern, that is, 1/n

∑np=1 Jp. The validation error

and the test or generalization error per pattern are virtually always higher than the train-ing error. In some protocols, training is stopped at the first minimum of the validationset. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.

Page 196: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

XOR Learning Details

-1 1x1

-1

1

x2

-1.5 -1 -0.5 0 0.5 1 1.5-1.5

-1

-0.5

0

0.5

1

1.5

2

final

decis

ion b

ound

ary

y1

y2 x2x1

y2y1

bias

10 20 30 40 50 60epoch

0.5

1

1.5

2

J

1

1

1515

15

15

1

1

30

30

30

45

45

45

45

60

60

60

60

total error

error onindividual patterns

inpu

tre

pres

enta

tion

hidd

enre

pres

enta

tion

FIGURE 6.10. A 2-2-1 backpropagation network with bias and the four patterns of theXOR problem are shown at the top. The middle figure shows the outputs of the hid-den units for each of the four patterns; these outputs move across the y1y2-space asthe network learns. In this space, early in training (epoch 1) the two categories are notlinearly separable. As the input-to-hidden weights learn, as marked by the number ofepochs, the categories become linearly separable. The dashed line is the linear decisionboundary determined by the hidden-to-output weights at the end of learning; indeedthe patterns of the two classes are separated by this boundary. The bottom graph showsthe learning curves—the error on individual patterns and the total error as a functionof epoch. Note that, as frequently happens, the total training error decreases monoton-ically, even though this is not the case for the error on each individual pattern. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.

Page 197: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Backpropagation Variants I

Standard Backpropagation: ~wt = ~wt−1 − η~∇E

Gradient Reuse: use ~∇E as long as error drops

BP with variable stepsize (learn rate) η

BP with momentum: ∆ ~wt = −η~∇E + α∆ ~wt−1

Page 198: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Backpropagation Variants I

Standard Backpropagation: ~wt = ~wt−1 − η~∇EGradient Reuse: use ~∇E as long as error drops

BP with variable stepsize (learn rate) η

BP with momentum: ∆ ~wt = −η~∇E + α∆ ~wt−1

Page 199: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Backpropagation Variants I

Standard Backpropagation: ~wt = ~wt−1 − η~∇EGradient Reuse: use ~∇E as long as error drops

BP with variable stepsize (learn rate) η

BP with momentum: ∆ ~wt = −η~∇E + α∆ ~wt−1

Page 200: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Neural Networks

Backpropagation Variants I

Standard Backpropagation: ~wt = ~wt−1 − η~∇EGradient Reuse: use ~∇E as long as error drops

BP with variable stepsize (learn rate) η

BP with momentum: ∆ ~wt = −η~∇E + α∆ ~wt−1

Page 201: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Decision Trees

Real problems: nominal data, e.g., car = green, red,

blue

Rule–based or syntactic methods

Decision tree (DT): series of questions (nodes) lead to answerat leaf (category)

DT is interpretable (decisions and categories)

Page 202: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Decision Trees

Real problems: nominal data, e.g., car = green, red,

blue

Rule–based or syntactic methods

Decision tree (DT): series of questions (nodes) lead to answerat leaf (category)

DT is interpretable (decisions and categories)

Page 203: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Decision Trees

Real problems: nominal data, e.g., car = green, red,

blue

Rule–based or syntactic methods

Decision tree (DT): series of questions (nodes) lead to answerat leaf (category)

DT is interpretable (decisions and categories)

Page 204: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Decision Trees

Real problems: nominal data, e.g., car = green, red,

blue

Rule–based or syntactic methods

Decision tree (DT): series of questions (nodes) lead to answerat leaf (category)

DT is interpretable (decisions and categories)

Page 205: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Monothetic Decision Tree

Color?

Size? Size?Shape?

round

Size?

yellowredgreen

thin mediumsmall smallbig

Grapefruit

big small

Watermelon Banana AppleApple

Lemon

Grape Taste?

sweet sour

Cherry Grape

medium

level 0

level 1

level 2

level 3

root

FIGURE 8.1. Classification in a basic decision tree proceeds from top to bottom. The questions asked ateach node concern a particular property of the pattern, and the downward links correspond to the possiblevalues. Successive nodes are visited until a terminal or leaf node is reached, where the category label is read.Note that the same question, Size?, appears in different places in the tree and that different questions canhave different numbers of branches. Moreover, different leaf nodes, shown in pink, can be labeled by thesame category (e.g., Apple). From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.

Page 206: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

CART

Goal: construct pure nodes (ideally, all leaf nodes are pure)

A pure leaf node resembles only patterns of single category

Design Issues

Branching factor = splits?Which query (property) at which node?Termination (leaf node)?Pruning (simplification)?Missing data?

Page 207: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

CART

Goal: construct pure nodes (ideally, all leaf nodes are pure)

A pure leaf node resembles only patterns of single category

Design Issues

Branching factor = splits?Which query (property) at which node?Termination (leaf node)?Pruning (simplification)?Missing data?

Page 208: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

CART

Goal: construct pure nodes (ideally, all leaf nodes are pure)

A pure leaf node resembles only patterns of single category

Design Issues

Branching factor = splits?Which query (property) at which node?Termination (leaf node)?Pruning (simplification)?Missing data?

Page 209: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

CART

Goal: construct pure nodes (ideally, all leaf nodes are pure)

A pure leaf node resembles only patterns of single category

Design Issues

Branching factor = splits?

Which query (property) at which node?Termination (leaf node)?Pruning (simplification)?Missing data?

Page 210: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

CART

Goal: construct pure nodes (ideally, all leaf nodes are pure)

A pure leaf node resembles only patterns of single category

Design Issues

Branching factor = splits?Which query (property) at which node?

Termination (leaf node)?Pruning (simplification)?Missing data?

Page 211: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

CART

Goal: construct pure nodes (ideally, all leaf nodes are pure)

A pure leaf node resembles only patterns of single category

Design Issues

Branching factor = splits?Which query (property) at which node?Termination (leaf node)?

Pruning (simplification)?Missing data?

Page 212: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

CART

Goal: construct pure nodes (ideally, all leaf nodes are pure)

A pure leaf node resembles only patterns of single category

Design Issues

Branching factor = splits?Which query (property) at which node?Termination (leaf node)?Pruning (simplification)?

Missing data?

Page 213: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

CART

Goal: construct pure nodes (ideally, all leaf nodes are pure)

A pure leaf node resembles only patterns of single category

Design Issues

Branching factor = splits?Which query (property) at which node?Termination (leaf node)?Pruning (simplification)?Missing data?

Page 214: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Monothetic Decision Boundaries

R1

R2R

2

R2

R2

R1 R1

R1

x1

x3

x2

x2

x1

R1

R2

R1

FIGURE 8.3. Monothetic decision trees create decision boundaries with portions perpendicular to the featureaxes. The decision regions are marked R1 and R2 in these two-dimensional and three-dimensional two-category examples. With a sufficiently large tree, any decision boundary can be approximated arbitrarilywell in this way. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.

Page 215: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Entropy Impurity

Each non-binary tree can be transformed to binary tree

Monothetic (single feature node) and polythetic (multiplefeatures node) trees

Any query at a node should gain maximal purity (or minimalimpurity)

Entropy impurity of a node N with class “probabilities” Pi(N) = −

∑j P(ωj)ldP(ωj)

i(N) = 0→ pure node

Page 216: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Entropy Impurity

Each non-binary tree can be transformed to binary tree

Monothetic (single feature node) and polythetic (multiplefeatures node) trees

Any query at a node should gain maximal purity (or minimalimpurity)

Entropy impurity of a node N with class “probabilities” Pi(N) = −

∑j P(ωj)ldP(ωj)

i(N) = 0→ pure node

Page 217: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Entropy Impurity

Each non-binary tree can be transformed to binary tree

Monothetic (single feature node) and polythetic (multiplefeatures node) trees

Any query at a node should gain maximal purity (or minimalimpurity)

Entropy impurity of a node N with class “probabilities” Pi(N) = −

∑j P(ωj)ldP(ωj)

i(N) = 0→ pure node

Page 218: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Entropy Impurity

Each non-binary tree can be transformed to binary tree

Monothetic (single feature node) and polythetic (multiplefeatures node) trees

Any query at a node should gain maximal purity (or minimalimpurity)

Entropy impurity of a node N with class “probabilities” Pi(N) = −

∑j P(ωj)ldP(ωj)

i(N) = 0→ pure node

Page 219: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Entropy Impurity

Each non-binary tree can be transformed to binary tree

Monothetic (single feature node) and polythetic (multiplefeatures node) trees

Any query at a node should gain maximal purity (or minimalimpurity)

Entropy impurity of a node N with class “probabilities” Pi(N) = −

∑j P(ωj)ldP(ωj)

i(N) = 0→ pure node

Page 220: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Other Impurity Measures

Gini impurity (generalization of variance impurity)i(N) =

∑i 6=j P(ωi )P(ωj) = 1

2 [1−∑

j P2(ωj)]

Expected error rate at N (if pattern is selected fromdistribution at N)

Misclassification impurity (discontinous derivative may causeproblems)i(N) = 1−max

jP(ωj)

Minimal probability of a misclassified pattern at N

Page 221: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Other Impurity Measures

Gini impurity (generalization of variance impurity)i(N) =

∑i 6=j P(ωi )P(ωj) = 1

2 [1−∑

j P2(ωj)]

Expected error rate at N (if pattern is selected fromdistribution at N)

Misclassification impurity (discontinous derivative may causeproblems)i(N) = 1−max

jP(ωj)

Minimal probability of a misclassified pattern at N

Page 222: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Other Impurity Measures

Gini impurity (generalization of variance impurity)i(N) =

∑i 6=j P(ωi )P(ωj) = 1

2 [1−∑

j P2(ωj)]

Expected error rate at N (if pattern is selected fromdistribution at N)

Misclassification impurity (discontinous derivative may causeproblems)i(N) = 1−max

jP(ωj)

Minimal probability of a misclassified pattern at N

Page 223: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Other Impurity Measures

Gini impurity (generalization of variance impurity)i(N) =

∑i 6=j P(ωi )P(ωj) = 1

2 [1−∑

j P2(ωj)]

Expected error rate at N (if pattern is selected fromdistribution at N)

Misclassification impurity (discontinous derivative may causeproblems)i(N) = 1−max

jP(ωj)

Minimal probability of a misclassified pattern at N

Page 224: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Greedy Query Search

Select query with largest impurity decrease from Nto NL (left child) and NR (right child)∆i(N) = i(N)− PLi(NL)− (1− PL)i(NR)

Nominal features (exhaustive search), continous features(gradient descent)

Specific choice of impurity measure is uncritical, moreimportant are stop splitting and pruning methods

Multiway splits (B > 2), simple impurity decrease favors largesplits, scaling of impurity decrease, Gain Ratio Impurity∆i ′(N,B) = ∆i(N,B)

−∑

k Pk ldPk

Page 225: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Greedy Query Search

Select query with largest impurity decrease from Nto NL (left child) and NR (right child)∆i(N) = i(N)− PLi(NL)− (1− PL)i(NR)

Nominal features (exhaustive search), continous features(gradient descent)

Specific choice of impurity measure is uncritical, moreimportant are stop splitting and pruning methods

Multiway splits (B > 2), simple impurity decrease favors largesplits, scaling of impurity decrease, Gain Ratio Impurity∆i ′(N,B) = ∆i(N,B)

−∑

k Pk ldPk

Page 226: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Greedy Query Search

Select query with largest impurity decrease from Nto NL (left child) and NR (right child)∆i(N) = i(N)− PLi(NL)− (1− PL)i(NR)

Nominal features (exhaustive search), continous features(gradient descent)

Specific choice of impurity measure is uncritical, moreimportant are stop splitting and pruning methods

Multiway splits (B > 2), simple impurity decrease favors largesplits, scaling of impurity decrease, Gain Ratio Impurity∆i ′(N,B) = ∆i(N,B)

−∑

k Pk ldPk

Page 227: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Greedy Query Search

Select query with largest impurity decrease from Nto NL (left child) and NR (right child)∆i(N) = i(N)− PLi(NL)− (1− PL)i(NR)

Nominal features (exhaustive search), continous features(gradient descent)

Specific choice of impurity measure is uncritical, moreimportant are stop splitting and pruning methods

Multiway splits (B > 2), simple impurity decrease favors largesplits, scaling of impurity decrease, Gain Ratio Impurity∆i ′(N,B) = ∆i(N,B)

−∑

k Pk ldPk

Page 228: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Stop Splitting Methods

Naive Stop: each leaf node has impurity 0 (perfectoverfitting), may degenerate to a look–up table (a leaf nodefor each pattern)

Measure split performance with a separate validation set(minimal error on validation set)

Impurity threshold ∆i(N) ≤ β, unbalanced trees, choice of β?

Pattern threshold: stop when a node represents a certain(small) number (percentage) of patterns

Minimum Description Length (regularization reducescomplexity)J(DT ) = α#N +

∑LN i(LN) (LN = leaf nodes)

Statistical significance of impurity reduction(distribution of ∆i)

Page 229: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Stop Splitting Methods

Naive Stop: each leaf node has impurity 0 (perfectoverfitting), may degenerate to a look–up table (a leaf nodefor each pattern)

Measure split performance with a separate validation set(minimal error on validation set)

Impurity threshold ∆i(N) ≤ β, unbalanced trees, choice of β?

Pattern threshold: stop when a node represents a certain(small) number (percentage) of patterns

Minimum Description Length (regularization reducescomplexity)J(DT ) = α#N +

∑LN i(LN) (LN = leaf nodes)

Statistical significance of impurity reduction(distribution of ∆i)

Page 230: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Stop Splitting Methods

Naive Stop: each leaf node has impurity 0 (perfectoverfitting), may degenerate to a look–up table (a leaf nodefor each pattern)

Measure split performance with a separate validation set(minimal error on validation set)

Impurity threshold ∆i(N) ≤ β, unbalanced trees, choice of β?

Pattern threshold: stop when a node represents a certain(small) number (percentage) of patterns

Minimum Description Length (regularization reducescomplexity)J(DT ) = α#N +

∑LN i(LN) (LN = leaf nodes)

Statistical significance of impurity reduction(distribution of ∆i)

Page 231: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Stop Splitting Methods

Naive Stop: each leaf node has impurity 0 (perfectoverfitting), may degenerate to a look–up table (a leaf nodefor each pattern)

Measure split performance with a separate validation set(minimal error on validation set)

Impurity threshold ∆i(N) ≤ β, unbalanced trees, choice of β?

Pattern threshold: stop when a node represents a certain(small) number (percentage) of patterns

Minimum Description Length (regularization reducescomplexity)J(DT ) = α#N +

∑LN i(LN) (LN = leaf nodes)

Statistical significance of impurity reduction(distribution of ∆i)

Page 232: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Stop Splitting Methods

Naive Stop: each leaf node has impurity 0 (perfectoverfitting), may degenerate to a look–up table (a leaf nodefor each pattern)

Measure split performance with a separate validation set(minimal error on validation set)

Impurity threshold ∆i(N) ≤ β, unbalanced trees, choice of β?

Pattern threshold: stop when a node represents a certain(small) number (percentage) of patterns

Minimum Description Length (regularization reducescomplexity)J(DT ) = α#N +

∑LN i(LN) (LN = leaf nodes)

Statistical significance of impurity reduction(distribution of ∆i)

Page 233: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Stop Splitting Methods

Naive Stop: each leaf node has impurity 0 (perfectoverfitting), may degenerate to a look–up table (a leaf nodefor each pattern)

Measure split performance with a separate validation set(minimal error on validation set)

Impurity threshold ∆i(N) ≤ β, unbalanced trees, choice of β?

Pattern threshold: stop when a node represents a certain(small) number (percentage) of patterns

Minimum Description Length (regularization reducescomplexity)J(DT ) = α#N +

∑LN i(LN) (LN = leaf nodes)

Statistical significance of impurity reduction(distribution of ∆i)

Page 234: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Pruning

Stop Splitting: insufficient look–ahead (horizon effect) due togreedy search

Pruning: merge nodes, starts at leaf nodes, but any node ispossible

Uses complete data set, huge cost with large data sets

Rule pruning: construct and simplify rules (conjunctions) foreach leaf

Context pruning: prune specific rules for specific patterns

Improved interpretability

Page 235: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Pruning

Stop Splitting: insufficient look–ahead (horizon effect) due togreedy search

Pruning: merge nodes, starts at leaf nodes, but any node ispossible

Uses complete data set, huge cost with large data sets

Rule pruning: construct and simplify rules (conjunctions) foreach leaf

Context pruning: prune specific rules for specific patterns

Improved interpretability

Page 236: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Pruning

Stop Splitting: insufficient look–ahead (horizon effect) due togreedy search

Pruning: merge nodes, starts at leaf nodes, but any node ispossible

Uses complete data set, huge cost with large data sets

Rule pruning: construct and simplify rules (conjunctions) foreach leaf

Context pruning: prune specific rules for specific patterns

Improved interpretability

Page 237: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Pruning

Stop Splitting: insufficient look–ahead (horizon effect) due togreedy search

Pruning: merge nodes, starts at leaf nodes, but any node ispossible

Uses complete data set, huge cost with large data sets

Rule pruning: construct and simplify rules (conjunctions) foreach leaf

Context pruning: prune specific rules for specific patterns

Improved interpretability

Page 238: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Pruning

Stop Splitting: insufficient look–ahead (horizon effect) due togreedy search

Pruning: merge nodes, starts at leaf nodes, but any node ispossible

Uses complete data set, huge cost with large data sets

Rule pruning: construct and simplify rules (conjunctions) foreach leaf

Context pruning: prune specific rules for specific patterns

Improved interpretability

Page 239: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Pruning

Stop Splitting: insufficient look–ahead (horizon effect) due togreedy search

Pruning: merge nodes, starts at leaf nodes, but any node ispossible

Uses complete data set, huge cost with large data sets

Rule pruning: construct and simplify rules (conjunctions) foreach leaf

Context pruning: prune specific rules for specific patterns

Improved interpretability

Page 240: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Feature Extraction

.2 .4 .6 .8 10

.2

.4

.6

.8

1 - 1.2 x1 + x2 < 0.1

x1 < 0.27

x2 < 0.32

x1 < 0.07

x2 < 0.6

x1 < 0.55

x2 < 0.86

x1 < 0.81

x1

x2

ω2 ω1

ω2

ω1

ω1

ω1

ω1

ω2

ω2

ω2R2

R1

R2

R1

.2 .4 .6 .8 10

.2

.4

.6

.8

1

x1

x2

FIGURE 8.5. If the class of node decisions does not match the form of the training data,a very complicated decision tree will result, as shown at the top. Here decisions areparallel to the axes while in fact the data is better split by boundaries along anotherdirection. If, however, “proper” decision forms are used (here, linear combinations ofthe features), the tree can be quite simple, as shown at the bottom. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.

Page 241: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Potential Improvements

At each node train a linear classifier, arbitrary linear decisionboundaries

Long training, (again) fast recall

Integrate priors and/or costs by weights

Weighted Gini Impurity with cost λiji(N) =

∑ij λijP(ωi )P(ωj)

Page 242: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Potential Improvements

At each node train a linear classifier, arbitrary linear decisionboundaries

Long training, (again) fast recall

Integrate priors and/or costs by weights

Weighted Gini Impurity with cost λiji(N) =

∑ij λijP(ωi )P(ωj)

Page 243: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Potential Improvements

At each node train a linear classifier, arbitrary linear decisionboundaries

Long training, (again) fast recall

Integrate priors and/or costs by weights

Weighted Gini Impurity with cost λiji(N) =

∑ij λijP(ωi )P(ωj)

Page 244: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Potential Improvements

At each node train a linear classifier, arbitrary linear decisionboundaries

Long training, (again) fast recall

Integrate priors and/or costs by weights

Weighted Gini Impurity with cost λiji(N) =

∑ij λijP(ωi )P(ωj)

Page 245: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Multivariate Decision Trees

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

0.04 x1 + 0.16 x2 < 0.11

0.27 x1 - 0.44 x2 < -0.02

0.96 x1 - 1.77x2 < -0.45

5.43 x1 - 13.33 x2 < -6.03

x2 < 0.5

x2 < 0.56x1 < 0.95

x2 < 0.54

x1

ω1ω2

R2

R1

0

ω2

ω1

ω1

ω1

ω1

ω2

ω2

ω2

R1

R2

R2

R1

x2

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

x10

x2

FIGURE 8.6. One form of multivariate tree employs general linear decisions at eachnode, giving splits along arbitrary directions in the feature space. In virtually all inter-esting cases the training data are not linearly separable, and thus the LMS algorithm ismore useful than methods that require the data to be linearly separable, even though theLMS need not yield a minimum in classification error (Chapter 5). The tree at the bottomcan be simplified by methods outlined in Section 8.4.2. From: Richard O. Duda, PeterE. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.

Page 246: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Missing Attributes

Naive approach: use only non-deficient patterns

Better: use only non–deficient attributes

Works with training, but how to classify a deficient pattern?

Surrogate splits: find alternative splits using different featureshaving maximal predictive association (correlation)

Virtual values, e.g., mean value of non–deficient feature values

Page 247: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Missing Attributes

Naive approach: use only non-deficient patterns

Better: use only non–deficient attributes

Works with training, but how to classify a deficient pattern?

Surrogate splits: find alternative splits using different featureshaving maximal predictive association (correlation)

Virtual values, e.g., mean value of non–deficient feature values

Page 248: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Missing Attributes

Naive approach: use only non-deficient patterns

Better: use only non–deficient attributes

Works with training, but how to classify a deficient pattern?

Surrogate splits: find alternative splits using different featureshaving maximal predictive association (correlation)

Virtual values, e.g., mean value of non–deficient feature values

Page 249: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Missing Attributes

Naive approach: use only non-deficient patterns

Better: use only non–deficient attributes

Works with training, but how to classify a deficient pattern?

Surrogate splits: find alternative splits using different featureshaving maximal predictive association (correlation)

Virtual values, e.g., mean value of non–deficient feature values

Page 250: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

Missing Attributes

Naive approach: use only non-deficient patterns

Better: use only non–deficient attributes

Works with training, but how to classify a deficient pattern?

Surrogate splits: find alternative splits using different featureshaving maximal predictive association (correlation)

Virtual values, e.g., mean value of non–deficient feature values

Page 251: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

ID3

ID3 stems from third interactive dichotomizer

Nominal features (real are binned)

Branch factor is number of attributes

Train until all nodes pure or no more features

Results in tree depth = number of features

No pruning

Page 252: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

ID3

ID3 stems from third interactive dichotomizer

Nominal features (real are binned)

Branch factor is number of attributes

Train until all nodes pure or no more features

Results in tree depth = number of features

No pruning

Page 253: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

ID3

ID3 stems from third interactive dichotomizer

Nominal features (real are binned)

Branch factor is number of attributes

Train until all nodes pure or no more features

Results in tree depth = number of features

No pruning

Page 254: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

ID3

ID3 stems from third interactive dichotomizer

Nominal features (real are binned)

Branch factor is number of attributes

Train until all nodes pure or no more features

Results in tree depth = number of features

No pruning

Page 255: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

ID3

ID3 stems from third interactive dichotomizer

Nominal features (real are binned)

Branch factor is number of attributes

Train until all nodes pure or no more features

Results in tree depth = number of features

No pruning

Page 256: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

ID3

ID3 stems from third interactive dichotomizer

Nominal features (real are binned)

Branch factor is number of attributes

Train until all nodes pure or no more features

Results in tree depth = number of features

No pruning

Page 257: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

C4.5

Refinement of ID3

B > 2 with nominal features, B = 2 with real features

Pruning based on statistical significance of splits

Missing features: sample all subtrees of missing feature usingtraining data

Additional rule pruning, can prune any node (see Figure 8.6)

Page 258: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

C4.5

Refinement of ID3

B > 2 with nominal features, B = 2 with real features

Pruning based on statistical significance of splits

Missing features: sample all subtrees of missing feature usingtraining data

Additional rule pruning, can prune any node (see Figure 8.6)

Page 259: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

C4.5

Refinement of ID3

B > 2 with nominal features, B = 2 with real features

Pruning based on statistical significance of splits

Missing features: sample all subtrees of missing feature usingtraining data

Additional rule pruning, can prune any node (see Figure 8.6)

Page 260: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

C4.5

Refinement of ID3

B > 2 with nominal features, B = 2 with real features

Pruning based on statistical significance of splits

Missing features: sample all subtrees of missing feature usingtraining data

Additional rule pruning, can prune any node (see Figure 8.6)

Page 261: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Nonmetric Methods

Classification and Regression Trees

C4.5

Refinement of ID3

B > 2 with nominal features, B = 2 with real features

Pruning based on statistical significance of splits

Missing features: sample all subtrees of missing feature usingtraining data

Additional rule pruning, can prune any node (see Figure 8.6)

Page 262: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Stochastic Methods

Stochastic Search

Analytical methods problematic in high dimensions or withcomplex models

Large number of local optima makes gradient descent verycostly

Stochastic methods try to localize promising search regions

Pure random search is often not sufficient

Simulated Annealing and Boltzmann Learning motivated bystatistical mechanics

Evolutionary Computation motivated by evolutionaryprinciples from biology

Page 263: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Stochastic Methods

Stochastic Search

Analytical methods problematic in high dimensions or withcomplex models

Large number of local optima makes gradient descent verycostly

Stochastic methods try to localize promising search regions

Pure random search is often not sufficient

Simulated Annealing and Boltzmann Learning motivated bystatistical mechanics

Evolutionary Computation motivated by evolutionaryprinciples from biology

Page 264: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Stochastic Methods

Stochastic Search

Analytical methods problematic in high dimensions or withcomplex models

Large number of local optima makes gradient descent verycostly

Stochastic methods try to localize promising search regions

Pure random search is often not sufficient

Simulated Annealing and Boltzmann Learning motivated bystatistical mechanics

Evolutionary Computation motivated by evolutionaryprinciples from biology

Page 265: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Stochastic Methods

Stochastic Search

Analytical methods problematic in high dimensions or withcomplex models

Large number of local optima makes gradient descent verycostly

Stochastic methods try to localize promising search regions

Pure random search is often not sufficient

Simulated Annealing and Boltzmann Learning motivated bystatistical mechanics

Evolutionary Computation motivated by evolutionaryprinciples from biology

Page 266: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Stochastic Methods

Stochastic Search

Analytical methods problematic in high dimensions or withcomplex models

Large number of local optima makes gradient descent verycostly

Stochastic methods try to localize promising search regions

Pure random search is often not sufficient

Simulated Annealing and Boltzmann Learning motivated bystatistical mechanics

Evolutionary Computation motivated by evolutionaryprinciples from biology

Page 267: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Stochastic Methods

Stochastic Search

Analytical methods problematic in high dimensions or withcomplex models

Large number of local optima makes gradient descent verycostly

Stochastic methods try to localize promising search regions

Pure random search is often not sufficient

Simulated Annealing and Boltzmann Learning motivated bystatistical mechanics

Evolutionary Computation motivated by evolutionaryprinciples from biology

Page 268: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Stochastic Methods

Energy Minimization

Example: minimizing (model) energy in a (Hopfield) network

Energy E = −12

∑Ni ,j=1 wijsi sj si = ±1

Minimize energy of spin–glass model

Probability of energy state, Boltzmann factor

P(γ) = e−EγT

Z(T )

Page 269: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Stochastic Methods

Energy Minimization

Example: minimizing (model) energy in a (Hopfield) network

Energy E = −12

∑Ni ,j=1 wijsi sj si = ±1

Minimize energy of spin–glass model

Probability of energy state, Boltzmann factor

P(γ) = e−EγT

Z(T )

Page 270: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Stochastic Methods

Energy Minimization

Example: minimizing (model) energy in a (Hopfield) network

Energy E = −12

∑Ni ,j=1 wijsi sj si = ±1

Minimize energy of spin–glass model

Probability of energy state, Boltzmann factor

P(γ) = e−EγT

Z(T )

Page 271: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Stochastic Methods

Energy Minimization

Example: minimizing (model) energy in a (Hopfield) network

Energy E = −12

∑Ni ,j=1 wijsi sj si = ±1

Minimize energy of spin–glass model

Probability of energy state, Boltzmann factor

P(γ) = e−EγT

Z(T )

Page 272: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Stochastic Methods

Recurrent Nethi

dden

wij

i

j

αβ

visible

visible

s1 s2 s3 s4 s5 s6

s14 s15 s16 s17

s8

s10

s9

s11

s12s13

s7

FIGURE 7.1. The class of optimization problems of Eq. 1 can be viewed in terms of anetwork of nodes or units, each of which can be in the si = +1 or si = −1 state. Everypair of nodes i and j is connected by bi-directional weights wij; if a weight between twonodes is zero, then no connection is drawn. (Because the networks we shall discuss canhave an arbitrary interconnection, there is no notion of layers as in multilayer neuralnetworks.) The optimization problem is to find a configuration (i.e., assignment of allsi) that minimizes the energy described by Eq. 1. While our convention was to showfunctions inside each node’s circle, our convention in so-called Boltzmann networks isto indicate the state of each node. The configuration of the full network is indexed byan integer γ , and because here there are 17 binary nodes, γ is bounded 0 ≤ γ < 217.When such a network is used for pattern recognition, the input and output nodes aresaid to be visible, and the remaining nodes are said to be hidden. The states of thevisible nodes and hidden nodes are indexed by α and β, respectively, and in this caseare bounded 0 ≤ α ≤ 210 and 0 ≤ β < 27. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 273: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Stochastic Methods

Energy Landscape

x1 x1

x2 x2

E E

FIGURE 7.2. The energy function or energy “landscape” on the left is meant to suggest the types of opti-mization problems addressed by simulated annealing. The method uses randomness, governed by a controlparameter or “temperature” T to avoid getting stuck in local energy minima and thus to find the global mini-mum, like a small ball rolling in the landscape as it is shaken. The pathological “golf course” landscape at theright is, generally speaking, not amenable to solution via simulated annealing because the region of lowestenergy is so small and is surrounded by energetically unfavorable configurations. The configuration spaces ofthe problems we shall address are discrete and are more accurately displayed in Fig. 7.6. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 274: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Stochastic Methods

Simulated Annealing

Simulated Annealing Basics

Stochastic search for state of lower energy

Basic idea: occasionally go to higher energy to possiblyescape local minima

After random change of parameter si∆Eab = Eb − Ea

accept Eb, if Eb < Ea or

accept Ea with P = e−∆Eab

T

Annealing Schedule, e.g., T (k + 1) = cT (k) 0 < c < 1typically 0.8 < c < 0.99

High initial temperature, large c and large kmax (number ofiterations) leads to good results (but also computational cost)

Page 275: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Stochastic Methods

Simulated Annealing

Simulated Annealing Basics

Stochastic search for state of lower energy

Basic idea: occasionally go to higher energy to possiblyescape local minima

After random change of parameter si∆Eab = Eb − Ea

accept Eb, if Eb < Ea or

accept Ea with P = e−∆Eab

T

Annealing Schedule, e.g., T (k + 1) = cT (k) 0 < c < 1typically 0.8 < c < 0.99

High initial temperature, large c and large kmax (number ofiterations) leads to good results (but also computational cost)

Page 276: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Stochastic Methods

Simulated Annealing

Simulated Annealing Basics

Stochastic search for state of lower energy

Basic idea: occasionally go to higher energy to possiblyescape local minima

After random change of parameter si∆Eab = Eb − Ea

accept Eb, if Eb < Ea or

accept Ea with P = e−∆Eab

T

Annealing Schedule, e.g., T (k + 1) = cT (k) 0 < c < 1typically 0.8 < c < 0.99

High initial temperature, large c and large kmax (number ofiterations) leads to good results (but also computational cost)

Page 277: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Stochastic Methods

Simulated Annealing

Simulated Annealing Basics

Stochastic search for state of lower energy

Basic idea: occasionally go to higher energy to possiblyescape local minima

After random change of parameter si∆Eab = Eb − Ea

accept Eb, if Eb < Ea or

accept Ea with P = e−∆Eab

T

Annealing Schedule, e.g., T (k + 1) = cT (k) 0 < c < 1typically 0.8 < c < 0.99

High initial temperature, large c and large kmax (number ofiterations) leads to good results (but also computational cost)

Page 278: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Stochastic Methods

Simulated Annealing

Simulated Annealing Basics

Stochastic search for state of lower energy

Basic idea: occasionally go to higher energy to possiblyescape local minima

After random change of parameter si∆Eab = Eb − Ea

accept Eb, if Eb < Ea or

accept Ea with P = e−∆Eab

T

Annealing Schedule, e.g., T (k + 1) = cT (k) 0 < c < 1typically 0.8 < c < 0.99

High initial temperature, large c and large kmax (number ofiterations) leads to good results (but also computational cost)

Page 279: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Stochastic Methods

Simulated Annealing

Simulated Annealing Experiment

−−−−−−

+−−−−−

++−−−−

−+−−−−

−++−−−

+++−−−

+−+−−−

−−+−−−

−−++−−

+−++−−

++++−−

−+++−−

−+−+−−

++−+−−

+−−+−−

−−−+−−

−−−++−

+−−++−

++−++−

−+−++−

−++++−

+++++−

+−+++−

−−+++−

−−+−+−

+−+−+−

+++−+−

−++−+−

−+−−+−

++−−+−

+−−−+−

−−−−+−

−−−−++

+−−−++

++−−++

−+−−++

−++−++

+++−++

+−+−++

−−+−++

−−++++

+−++++

++++++

−+++++

−+−+++

++−+++

+−−+++

−−−+++

−−−+−+

+−−+−+

++−+−+

−+−+−+

−+++−+

++++−+

+−++−+

−−++−+

−−+−−+

+−+−−+

+++−−+

−++−−+

−+−−−+

++−−−+

+−−−−+

−−−−−+

T(k

)E

(k)

k k

E

s1

s2

s3

s4

s5

s6

begin

end

γ

FIGURE 7.3. Stochastic simulated annealing (Algorithm 1) uses randomness, governed by a control parameteror “temperature” T (k) to search through a discrete space for a minimum of an energy function. In this examplethere are N = 6 variables; the 26 = 64 configurations are shown along the bottom as a column of + and− symbols. The plot of the associated energy of each configuration given by Eq. 1 for randomly chosenweights. Every transition corresponds to the change of just a single si. (The configurations have been arrangedso that adjacent ones differ by the state of just a single node; nevertheless, most transitions correspondingto a single node appear far apart in this ordering.) Because the system energy is invariant with respect to aglobal interchange si ↔ −si, there are two “global” minima. The graph at the upper left shows the annealingschedule—the decreasing temperature versus iteration number k. The middle portion shows the configurationversus iteration number generated by Algorithm 1. The trajectory through the configuration space is coloredred for transitions that increase the energy and black for those that decrease the energy. Such energeticallyunfavorable (red) transitions become rarer later in the anneal. The graph at the right shows the full energyE(k), which decreases to the global minimum. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 280: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Stochastic Methods

Simulated Annealing

Empirical Energy States

−−−−−−

+−−−−−

++−−−−

−+−−−−

−++−−−

+++−−−

+−+−−−

−−+−−−

−−++−−

+−++−−

++++−−

−+++−−

−+−+−−

++−+−−

+−−+−−

−−−+−−

−−−++−

+−−++−

++−++−

−+−++−

−++++−

+++++−

+−+++−

−−+++−

−−+−+−

+−+−+−

+++−+−

−++−+−

−+−−+−

++−−+−

+−−−+−

−−−−+−

−−−−++

+−−−++

++−−++

−+−−++

−++−++

+++−++

+−+−++

−−+−++

−−++++

+−++++

++++++

−+++++

−+−+++

++−+++

+−−+++

−−−+++

−−−+−+

+−−+−+

++−+−+

−+−+−+

−+++−+

++++−+

+−++−+

−−++−+

−−+−−+

+−+−−+

+++−−+

−++−−+

−+−−−+

++−−−+

+−−−−+

−−−−−+

T(k

)

k E

P(γ)

ε[E

]

k

s1

s2

s3

s4

s5

s6

γ

FIGURE 7.4. An estimate of the probability P(γ ) of being in a configuration denoted by γ is shown forfour temperatures during a slow anneal. (These estimates, based on a large number of runs, are nearly thetheoretical values e−Eγ /T.) Early, at high T, each configuration is roughly equal in probability while late, atlow T, the probability is strongly concentrated at the global minima. The expected value of the energy, E[E](i.e., averaged at temperature T), decreases gradually during the anneal. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 281: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Projects

Project Teams

2 students form a team

Implementation of a pattern classification method

Use existing (free) software (e.g., WEKA)

Data import, pre–processing, results (graphics, tables)

Methods must be understood, parameters!

Project report (February 10, 2014)

Page 282: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Projects

Project Teams

2 students form a team

Implementation of a pattern classification method

Use existing (free) software (e.g., WEKA)

Data import, pre–processing, results (graphics, tables)

Methods must be understood, parameters!

Project report (February 10, 2014)

Page 283: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Projects

Project Teams

2 students form a team

Implementation of a pattern classification method

Use existing (free) software (e.g., WEKA)

Data import, pre–processing, results (graphics, tables)

Methods must be understood, parameters!

Project report (February 10, 2014)

Page 284: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Projects

Project Teams

2 students form a team

Implementation of a pattern classification method

Use existing (free) software (e.g., WEKA)

Data import, pre–processing, results (graphics, tables)

Methods must be understood, parameters!

Project report (February 10, 2014)

Page 285: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Projects

Project Teams

2 students form a team

Implementation of a pattern classification method

Use existing (free) software (e.g., WEKA)

Data import, pre–processing, results (graphics, tables)

Methods must be understood, parameters!

Project report (February 10, 2014)

Page 286: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Projects

Project Teams

2 students form a team

Implementation of a pattern classification method

Use existing (free) software (e.g., WEKA)

Data import, pre–processing, results (graphics, tables)

Methods must be understood, parameters!

Project report (February 10, 2014)

Page 287: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Projects

Project Topics

k-NN Classifier (different metrics)Kauba, Mayer

Artificial Neural Networks (Boone)Reissig, DiStolfo

Support Vector MachineLinortner, N.N.

Decision Tree (C4.5)

Simulated Annealing (meta)

Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser

Genetic Programming (optional, JEvolution)

Page 288: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Projects

Project Topics

k-NN Classifier (different metrics)Kauba, Mayer

Artificial Neural Networks (Boone)Reissig, DiStolfo

Support Vector MachineLinortner, N.N.

Decision Tree (C4.5)

Simulated Annealing (meta)

Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser

Genetic Programming (optional, JEvolution)

Page 289: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Projects

Project Topics

k-NN Classifier (different metrics)Kauba, Mayer

Artificial Neural Networks (Boone)Reissig, DiStolfo

Support Vector MachineLinortner, N.N.

Decision Tree (C4.5)

Simulated Annealing (meta)

Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser

Genetic Programming (optional, JEvolution)

Page 290: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Projects

Project Topics

k-NN Classifier (different metrics)Kauba, Mayer

Artificial Neural Networks (Boone)Reissig, DiStolfo

Support Vector MachineLinortner, N.N.

Decision Tree (C4.5)

Simulated Annealing (meta)

Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser

Genetic Programming (optional, JEvolution)

Page 291: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Projects

Project Topics

k-NN Classifier (different metrics)Kauba, Mayer

Artificial Neural Networks (Boone)Reissig, DiStolfo

Support Vector MachineLinortner, N.N.

Decision Tree (C4.5)

Simulated Annealing (meta)

Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser

Genetic Programming (optional, JEvolution)

Page 292: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Projects

Project Topics

k-NN Classifier (different metrics)Kauba, Mayer

Artificial Neural Networks (Boone)Reissig, DiStolfo

Support Vector MachineLinortner, N.N.

Decision Tree (C4.5)

Simulated Annealing (meta)

Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser

Genetic Programming (optional, JEvolution)

Page 293: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Projects

Project Topics

k-NN Classifier (different metrics)Kauba, Mayer

Artificial Neural Networks (Boone)Reissig, DiStolfo

Support Vector MachineLinortner, N.N.

Decision Tree (C4.5)

Simulated Annealing (meta)

Genetic Algorithm (meta, JEvolution)Auracher, Herzog, Kirchgasser

Genetic Programming (optional, JEvolution)

Page 294: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Projects

Project Data Sets

Data Sets

UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html

Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic

Leave–one–out validation (common partitioning)

Confusion matrix

Page 295: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Projects

Project Data Sets

Data Sets

UCI Machine Learning Archive

http://www.ics.uci.edu/~mlearn/MLRepository.html

Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic

Leave–one–out validation (common partitioning)

Confusion matrix

Page 296: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Projects

Project Data Sets

Data Sets

UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html

Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic

Leave–one–out validation (common partitioning)

Confusion matrix

Page 297: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Projects

Project Data Sets

Data Sets

UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html

Ionosphere: Radar Signals

Semeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic

Leave–one–out validation (common partitioning)

Confusion matrix

Page 298: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Projects

Project Data Sets

Data Sets

UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html

Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit Recognition

Wine Quality: Wine Critic

Leave–one–out validation (common partitioning)

Confusion matrix

Page 299: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Projects

Project Data Sets

Data Sets

UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html

Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic

Leave–one–out validation (common partitioning)

Confusion matrix

Page 300: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Projects

Project Data Sets

Data Sets

UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html

Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic

Leave–one–out validation (common partitioning)

Confusion matrix

Page 301: VU Pattern Recognition II - Uni Salzburghelmut/Teaching/PatternRecognition/prII.pdf · VU Pattern Recognition II Outline 1 Introduction 2 Statistical Classi ers Bayesian Decision

VU Pattern Recognition II

Projects

Project Data Sets

Data Sets

UCI Machine Learning Archivehttp://www.ics.uci.edu/~mlearn/MLRepository.html

Ionosphere: Radar SignalsSemeion Handwritten Digit: Digit RecognitionWine Quality: Wine Critic

Leave–one–out validation (common partitioning)

Confusion matrix