prénom nom document analysis: fundamentals of pattern recognition prof. rolf ingold, university of...

Prénom Nom

Document Analysis:Fundamentals of pattern recognition

Prof. Rolf Ingold, University of Fribourg

Master course, spring semester 2008

© Prof. Rolf Ingold

2

Outline

Introduction Feature extraction and decision Role of training Feature selection

Example : Font recognition Bayesian decision theory Evaluation


3

Goals of Pattern Recognition

Pattern recognition aims at discovering and identifying patterns in raw data it consists of assigning symbols to data (patterns) it is based on a a priori knowledge, often statistical information

Pattern recognition is used for computer perception (image/sound analysis) in a preliminary step, a sensor captures raw information this information is interpreted to take decisions

Pattern recognition can be thought as a methodic way of reducing the information in order to keep only the relevant meaning


4

Pattern Recognition Applications

Pattern recognition is involved in many applications seismological survey speech recognition scientific imagery (biology, health-care, physics, ...) satellite based observation (military and civil applications, ...) document analysis, with several components:

optical character recognition (OCR) font identification handwriting recognition (off-line ) graphics recognition

computer vision (3D scene analysis) biometry: person identification and authentication ...

Pattern recognition methodologies rely on other scientific domains: statistics, operation research, graph theory, artificial intelligence, ...


5

Origin of Difficulties

Pattern recognition is mainly an information overload problem The difficulty is issued from

variability of objects belonging to the same class distortion of captured data (noise, degradations, ...)


6

Steps Involved in Pattern Recognition

Pattern recognition is basically a two stage process: Feature extraction, aiming at removing redundancy while

keeping significant information Classification, consisting in making a decision by associating a

class label

observation feature vector

class

123.0

789.6

345.12


7

Role of Training

Features classesdecision

training

extraction

Models

Classifiers (tools that perform classification tasks) are generally designed to be trained Each class is characterized by a model Models are built with representative training data


8

Supervised vs. Unsupervised Training

Two different situations may occur regarding training material:

Supervised training is performed when the training samples are labeled with the class they belong to each class is associated with a set of training samples

Ti={xi1, xi2,..., xiNi},

supposed to be statistically representative for the class

Unsupervised training is performed when the training samples are statistically representative but mixed over all classes

T={x1, x2,..., xn},


9

Feature Selection

Features are selected accordingly to the application

Features should be chosen carefully by considering discrimination power between classes robustness to intra-class distortions and noise global statistical independency (spread over the entire feature

space) "fast computation" reasonable dimension (number of features)


10

Features for Character Recognition

Given a binary image of a character, a lot of features can be used for character recognition Size, i.e., width and height of the bounding box Position of baseline (if available) Weight (number of black pixels) Perimeter (length of the contours) Center of gravity Moments (second and third order in both directions) Distributions of horizontal and vertical runs Number of intersections with a (eventually random) set of lines Length and structure (singular points, holes) of skeleton ... Local features computed on sub-images …


11

Font Recognition: Goal

Goal: recognize fonts of synthetically generated isolated words as binary (black & white) or grey level images at 300 dpi

12 standard font classes are considered 3 families:

Arial Courier New Times New Roman

4 styles: Plain Italic Bold Bold Italic

single size : 12 pt


12

Font Recognition: Extracted Features

Words are segmented with a surrounding white border of 1 pixel

Some preprocessing steps are used Horizontal projection profile (hp) Derivative of horizontal projection profile (hpd)

The following features are calculated hp-mean (or density): mean of hp hpd-stdev (or slanting): standard deviation of hpd hr-mean: mean of horizontal runs (up to length 12) hr-stdev: standard deviation of horizontal runs (up to length 12) vr-mean: mean of vertical runs (up to length 12) vr-stdev: standard vertical of horizontal runs (up to length 12)


13

Font Recognition: Illustration of Features

Basic image processing features used are horizontal projection profile distribution of horizontal runs (from 1 to 11) distribution of vertical runs (from 1 to 11)


14

Font Recognition: decision boundaries on single feature (1)

Some single features are highly discriminant for some font sets

hpd-stdev is discriminating ■ roman and ■ italic fonts

hr-mean is discriminating ■ normal and ■ bold fonts

10

20

30

40

50

60

10

20

30

40

50

60


15

Font Recognition: decision boundaries on single feature (2)

Other features may partly discriminate font sets

hr-mean can partly discriminate ■ Arial, ■ Courier and ■ Times

10

20

30

40


16

Font Recognition: decision boundaries on multiple features (1)

By combining two features, font discrimination is improved

(hpd-stdev, vr-stdev) discriminate ■ roman and ■ italic fonts

2.5 5 7.5 10 12.5 15

0.5

1

1.5

2

2.5

3

hpd-stdev

vr-stdev


17

3 4 5 6 7 8

0.5

1

1.5

2

2.5

3

7.5 10 12.5 15 17.5 20

4

5

6

7

4 5 6 7

0.5

1

1.5

2

2.5

3

Font Recognition: decision boundaries on multiple features (2)

font family discrimination (■ Arial, ■ Courier and ■ Times) becomes possible by combining several couple of features

4 5 6 7

0.5

1

1.5

2

2.5

3

7.5 10 12.5 15 17.5 20

4

5

6

7

3 4 5 6 7 8

0.5

1

1.5

2

2.5

3

vr mean hp mean

hr mean

vr stdev

hr meanvr stdev


18

Bayesian Decision Theory

Bayesian decision makes the assumption that all information contributing to the decision can be stated in form of probabilities

P(i): the a priori probability (or prior) of each class

p(x|i): the class conditional density function of the feature

vector x, also called likelihood of the class i with respect to x

The goal is to determine the class i, for which the a posteriori

probability (or posterior) P(i|x) is the highest


19

Bayesian Rule

The Bayes rule allows to calculate the a posteriori probability of each class, as a function of priors and likelihoods

where

p(x) is called evidence and can be considered as a normalization factor, i.e.,

)(

)()|()|(

x

xx

p

PpP ii

i

j

jj Ppp )()|()( xx

1)()|(

)()|(

)(

)()|()|(

j

jj

iii

i

ii

ii Pp

Pp

p

PpP

x

x

x

xx


20

Influence of Posterior Probabilities

P(1)=0.5, P(2)=0.5 P(1)=0.1, P(2)=0.9

)|( i xP )|( i xP

Example with a single feature: posterior probabilities in two different cases regarding a priori probabilities

20 30 40 50

0.2

0.4

0.6

0.8

1

20 30 40 50

0.2

0.4

0.6

0.8

1

20 30 40 50

0.02

0.04

0.06

0.08

0.1

0.12

0.14

20 30 40 50

0.02

0.04

0.06

0.08

0.1

)|( i xP

)|( ixp )()|( ii Pxp

)|( i xP

21

21


21

Probability of Error

Given a feature x of a given sample, the probability of error for a decision (x)=i is equal to

The probability of error is given by

ij

ij xPxPxerrorP )|(1)|()|(

dxxpxerrorPdxxerrorPerrorP )()|(),()(


22

Optimal Decision Boundaries

The minimal error is obtained by the decision (x)=i with

jxPxP ji )|()|(


23

Decision Theory

In the simplest case a decision consist in assigning to an observation x a class label i = x

A natural extension consists in adding a “rejection class” R so that

xR

In the most general case, the decision results in an action i = x


24

Optimal Decision Theory

Let us consider a loss function ij defining the loss incurred by

taking action i when the true state of nature is j ; usually

The risk of taking an action i for a particular sample x is

The optimal decision consists in choosing i that minimizes the risk

ji

jiji 0

0),(

j

jjii PR )|(ω)ω|λ(α)|(α xx

ijRR ji )|(α)|(α xx


25

Optimal decision

When ii = 0 and ij = 1 j ≠ i , the optimal decision

consists of minimizing the probability of error

The minimal error is obtained by the decision (x)=i with

or equivalently

In the case when all a priori probabilities are equivalent

jxPxP ji )|()|(

jPxpPxp jjii )()|()()|(

jxpxp ji )|()|(


26

Minimum Risk for Two Classes

Let ij ij be the loss of action i when the true state is j

The conditional risks of each decision is expressed as

Then, the optimal decision rule becomes

or equivalently

And in the case of 11 22

)|()|()|(

)|()|()|(

2221212

2121111

xxx

xxx

PPR

PPR

212221211121 δδ)|(ω)λ(λ)|(ω)λ(λ elsedecideif xx PP

211

2

1121

2212

2

1 δδ)(ω

)(ω

λλ

λλ

)ω|(

)ω|(elsedecideif

P

P

p

p

x

x

211

2

21

12

2

1 δδ)(ω

)(ω

λ

λ

)ω|(

)ω|(elsedecideif

P

P

p

p

x

x

21212121 δδ)|(ωλ)|(ωλ elsedecideif xx PP


27

Discriminant Functions

In the case of multiple classes a pattern classifier can be specified by a set of discriminant functions gi(x) such that the decision i corresponds to

Thus, a Bayesian classifier is naturally represented by

The choice of discriminant functions is not unique

gi(x) can be replaced by f (gi(x)) for any monotonic increasing

function f(x) A minimum error-rate classifier can be obtained with

ijgg jii )()( xx

)|()( xx ii Rg

)(ln)|(ln)(

)()|()(

iii

iii

Ppg

Ppg

xx

xx


28

Bayesian Rule in Higher Dimensions

The Bayesian rule can easily be generalized to the multidimensional case, where features are represented by a vector x.

where

)(

)()|()|(

x

xx

p

PpP ii

i

i

ii Ppp )()|()( xx


29

Conclusion about Bayesian Decision

Bayesian decision theory provides a theoretical framework for statistical pattern recognition

This theory supposes the following probabilistic information to be known: the number of classes a priori probabilities of each class class dependent feature distributions for each class

The remaining problem is: how to estimate all these things feature distributions are hard to be estimated priors are seldom known even the number of classes is not always given


30

Performance Evaluation

Performance evaluation is a very important issue of PR it gives an objective measure of the performance it allows to compare different methods

Performance evaluation requires correctly labeled test data test data should be different from training data a strategy consists in cyclically using 80% of the data for

training, and the remaining 20% for evaluation


31

Performance Measures: Recognition / Error Rates

Performance evaluation uses several measures recognition rate corresponds to the ratio

number of correct answers / number of total answers error rate corresponds to the ratio

number of incorrect answers / number of total answers rejection rate corresponds to the ratio

number of rejections / number of total answers

recognition rate = 1 – (rejection rate + error rate)


32

Performance Measures: Recall & Precision

On binary decisions (a sample belongs to the class or not) two other measurements are frequently used recall corresponds to the ratio of correctly assigned samples to

the size of the class precision corresponds to the ratio of correctly assigned

samples to the number of assigned samples Recall and precision are changing in opposite directions

equal error rate is sometimes considered to be the best trade-off

Additionally, the harmonic mean of precision and recall, calledF-measure is frequently used

recallprecisionrecallprecision2

measureF

prénom nom document analysis: fundamentals of pattern recognition prof. rolf ingold, university of...

Documents