Pattern RecognitionLinear Classifiers
Zaheer Ahmad PhD Scholar
[email protected] of Computer Science University of Peshawar
Agenda• Pattern Recognition
– Features and Patterns– Classifiers– Approaches– Design Cycle
• Linear Classification– Linear Discriminant Functions – Linear Separability– Fisher Discriminant Functions – Support Vector Machines(SVMs)
What is pattern recognition?
• “The assignment of a physical object or event to one of several pre-specified categories” –Duda and Hart
• “The science that concerns the description or classification (recognition) of measurements” –Schalkoff
• “The process of giving names to observations x”, –Schürmann
• Pattern Recognition is concerned with answering the question “What is this?” –Morse
Applications of PR
• Image processing • Computer vision • Speech recognition • Data Mining • Automated target recognition • Optical character recognition • Seismic analysis
• Man and machine diagnostics • Fingerprint identification• Industrial inspection • Financial forecast • Medical diagnosis • ECG signal analysis
Terminology
• Recognition: During recognition (or classification) given objects are assigned to prescribed classes.
• classification is the problem of identifying which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known
• An algorithm that implements classification, especially in a concrete implementation, is known as a classifier
• A classifier is a machine which performs classification.
Features • Feature is any distinctive aspect, quality or
characteristic of an object• Features may be symbolic (i.e., color) or numeric
(i.e., height) • The combination of features is a -dim column 𝑑 𝑑
vector called a feature vector • The -dimensional space defined by the feature 𝑑
vector is called the feature space – Objects are represented as points in feature space; the
result is a scatter plot
Features
a “good” feature vector? • The quality of a feature vector is related to its
ability to discriminate • It should include examples from different
classes • Examples from the same class should have
similar feature values • Examples from different classes have different
feature values
More feature properties
Pattern and Pattern Class • A pattern is an object, process or event that can be given a
name.• Pattern is a composite of traits or features characteristic of
an individual • In classification tasks, a pattern is a pair of variables { , } 𝑥 𝜔
where – 𝑥 is a collection of observations or features (feature vector) – 𝜔 is the concept behind the observation (label/category)
• A pattern class (or category) is a set of patterns sharing common attributes and usually originating from the same source.
• A class/ pattern class is a set of objects having some important properties in common
Decision Boundary/Surface
• A line or curve separating the classes is a decision boundary
• The equation g(x) = 0 defines the decision surface that separates points assigned to the category ω1 from points assigned to the category ω2
• When g(x) is linear, the decision surface is a hyperplane• If x1 and x2 are both on the hyperplane then
Decision Boundary
Slope intercept form of a Line(Straight Line):The equation of a line with a defined slope m can also be written as follows: y = mx + b
Classifiers • The task of a classifier is to partition feature
space into class-labeled decision regions • Borders between decision regions are called
decision boundaries • The classification of feature vector consists 𝑥
of determining which decision region it belongs to, and assign to this class 𝑥
Pattern recognition approaches
Statistical • Patterns classified based on an underlying statistical model of
the features – The statistical model is defined by a family of class-conditional
probability density functions ( | ) (Probability of 𝑝 𝑥 𝜔feature vector given class ) 𝑥 𝜔
Neural • Classification is based on the response of a network of
processing units (neurons) to an input stimuli (pattern) – “Knowledge” is stored in the connectivity and strength of the synaptic
weights – Trainable, non-algorithmic, black-box strategy
• Very attractive since – it requires minimum a priori knowledge – with enough layers and neurons, ANNs can create any
complex decision region Syntactic • Patterns classified based on measures of structural similarity • “Knowledge” is represented by means of formal grammars or
relational descriptions (graphs) • Used not only for classification, but also for description
Typically, syntactic approaches formulate hierarchical descriptions of complex patterns built up from simpler sub patterns
The pattern recognition design cycle
Data collection • Probably the most time-intensive component of a PR project • How many examples are enough? Feature choice • Critical to the success of the PR problem
– “Garbage in, garbage out”
• Requires basic prior knowledge Model choice • Statistical, neural and structural approaches • Parameter settings
Training • Given a feature set and a “blank” model,
adapt the model to explain the data • Supervised, unsupervised and reinforcement
learning Evaluation • How well does the trained model do? • Overfitting vs. generalization
• Classification in which the decision boundary in the feature (input) space is linear
Linear Classification
• In linear classification the input space is split in (hyper-)planes, each with an assigned class
Linear Separable
• If a hyperplanar decision boundary exists that correctly classify all the training samples for a c=2 class problem, the samples are said to be linearly separable.
Linear Discriminant Function• A discriminant function that is a linear
combination of the components of x is called a linear discriminant function and can be written as
where w is the weight vector and w0 is the bias (or threshold weight).
Linear Classifiers
• Linear Classifiers– a linear classifier is a mapping which partitions
feature space using a linear function (a straight line, or a hyperplane)
– it is one of the simplest classifiers we can imagine• “separate the two classes using a straight line in feature
space”
– in 2 dimensions the decision boundary is a straight line
2-Class Data with a Linear Decision Boundary
-4 -2 0 2 4 6 8 10 12 14-4
-2
0
2
4
6
8
Feature 1
Featu
re 2
TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE
DecisionBoundary
DecisionRegion 1
Decision Region 2
Data that is Not “Linearly Separable”
2 3 4 5 6 7 8 9 10-1
0
1
2
3
4
5
6
Feature 1
Featu
re 2
TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE
DecisionRegion 2
DecisionRegion 1
DecisionBoundary
Fisher’s linear discriminant
• A simple linear discriminant function is a projection of the data down to 1-D.– So choose the projection that gives the best separation of
the classes. • An obvious direction to choose is the direction of the line
joining the class means.– But if the main direction of variance in each class is not
orthogonal to this line, this will not give good separation (see the next figure).
• Fisher’s method chooses the direction that maximizes the ratio of between class variance to within class variance.– This is the direction in which the projected points contain
the most information about class membership (under Gaussian assumptions)
• Classes well-separated in D-space may strongly overlap in 1-dimension – Adjust component of the weight vector w – Select projection to maximize class-separation
• Can be generalized for multiple classes
A picture showing the advantage of Fisher’s linear discriminant.
When projected onto the line joining the class means, the classes are not well separated.
Fisher chooses a direction that makes the projected classes much tighter, even though their projected means are less far apart.
Math of Fisher’s linear discriminants
• What linear transformation is best for discrimination?
• The projection onto the vector separating the class means seems sensible:
• But we also want small variance within each class:
• Fisher’s objective function is:
xwTy
12 mmw
)(
)(
222
121
2
1
mys
mys
Cnn
Cnn
22
21
212 )(
)(ss
mmJ
wbetween
within
)(:
)()()()(
)()(
)()(
121
2211
1212
22
21
212
21
mmSw
mxmxmxmxS
mmmmS
wSw
wSww
W
Cn
Tnn
Cn
TnnW
TB
WT
BT
solutionOptimal
ss
mmJ
More math of Fisher’s linear discriminants
Support Vector Machines(SVMs)
• a support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks.
• a good separation is achieved by the hyperplane that has the largest distance to the nearest training data point of any class
• the larger the margin the lower the generalization error of the classifier.
A support vector machine (SVM) is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis.
A separating hypreplane
1iy 1iy
Separating Hyperplane
But There are many possibilities
for such hyperplanes !!
0w x b
1x
2x
Separating Hyperplanes
1iy 1iy
Yes, There are many possible separating hyperplanes
It could be
Which one should we choose!
this one or this or this or maybe….!
Choosing a separating hyperplane:
ix'x
-Hyperplane should be as far as possible from any sample point.
-This way a new data that is close to the old samples will be classified correctly.
Good generalization!
Choosing a separating hyperplane.The SVM approach: Linear separable case
-The SVM idea is to maximize the distance between The hyperplane and the closest sample point.
In the optimal hyper- plane:
The distance to the closest negative point =
The distance to the closest positive point.
Choosing a separating hyperplane.The SVM approach: Linear separable case
ix
Mar
gin
d d
d
These are Support Vectors
Support vectors are the samples closest to the separating hyperplane.
thanK yoU