lecture5 prediction class - purdue university · optimization over score functions •...

34
Data Mining CS57300 Purdue University Bruno Ribeiro January 22, 2018 1

Upload: others

Post on 23-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

Data Mining

CS57300Purdue University

Bruno Ribeiro

January 22, 2018

1

Page 2: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

• Classification• Observing feature x predict discrete-valued label y

• Prediction (point estimate)• Observing feature x predict of real-value label y • Forecasting: predictions + confidence intervals

• Ranking prediction (rank estimate)• Predict item ranking

• E.g.: Google results, Netflix recommendations

2

Differences Between Classification & Prediction

Page 3: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

Predictive modeling

• Data representation:

• Training set: Paired attribute vectors and labels <y(i), x(i)> or

n×p tabular data with label (y) and p-1 attributes (x)

• Task: estimate a predictive function f(x;θ)=y

• Assume that there is a function y=f(x) that maps data instances (x) to labels (y)

• Construct a model that approximates the mapping

• Classification: if y is categorical

• Regression: if y is real-valued

3

Page 4: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

Modeling approaches

4

Page 5: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

Learning predictive models

• Choose a data representation

• Select a knowledge representation (a “model”)

• Defines a space of possible models M={M1, M2, ..., Mk}

• Use search to identify “best” model(s)

• Search the space of models (i.e., with alternative structures and/or parameters)

• Evaluate possible models with scoring function to determine the model which best fits the data

5

Page 6: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

Scoring functions

• Given a model M and dataset D, we would like to “score” model M with respect to D

• Goal is to rank the models in terms of their utility (for capturing D) and choose the “best” model

• Score function can be used to search over parameters and/or model structure

• Score functions can be different for:

• Models vs. patterns

• Predictive vs. descriptive functions

• Models with varying complexity (i.e., number parameters)

6

Page 7: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

Predictive scoring functions

• Assess the quality of predictions for a set of instances

• Measures difference between the prediction M makes for an instance i and the true class label value of i

Predicted class label for item i

True class label for item i

Distance betweenpredicted and true

Sum over examples

7

Page 8: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

Predictive scoring functions

• Common score functions:

• Zero-one loss

• Squared loss

• More later…

• Do we minimize or maximize these functions?

Careful with definition of class labels!Why?

8

Page 9: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

Scoring functions

• Guide search inside of algorithms

• Select path in heuristic search

• Decide when to stop

• Identify whether to include model or pattern in output

• Evaluate results outside of algorithms

• Measure the absolute quality of model or pattern

• Compare the relative quality of different algorithms or outputs

9

Page 10: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

Where’s the search?

10

Page 11: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

ModelScore(e.g.

Likelihood function)

X

Alex Holehuse, Notes from Andrew Ng’s Machine Learning Class, http://www.holehouse.org/mlclass/01_02_Introduction_regression_analysis_and_gr.html

Model space

Find Parameters that Minimize Model Score (Error)

Usually we maximize (- score)

11

Page 12: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

Searching over models/patterns

• Consider a space of possible models M={M1, M2, ..., Mk} with parameters θ

• Search could be over model structures or parameters, e.g.:

• Parameters: In a linear regression model, find the regression coefficients (β) that minimize squared loss on the training data

• Model structure: In a decision trees, find the tree structure that maximizes accuracy on the training data

12

Page 13: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

Optimization over score functions

• Smooth functions:

• If a function is smooth, it is differentiable and the derivatives are continuous, then we can use gradient-based optimization

• If function is convex, we can solve the minimization problem in closed form: ∇S(θ) using convex optimization

• If function is smooth but non-linear, we can use iterative search over the surface of S to find a local minimum (e.g., hill-climbing)

• Non-smooth functions:

• If the function is discrete, then traditional optimization methods that rely on smoothness are not applicable. Instead we need to use combinatorial optimization

13

Page 14: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

14

Linear Methods for Classification

Page 15: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

• Given x features of a car (length, width, mpg, maximum speed,…)• Classify cars into categories based on x

15

Motivation 1/2

Page 16: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

• A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one of three medical conditions. Which of the three conditions does the individual have?

• An online banking service must be able to determine whether or not a transaction being performed on the site is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth.

• On the basis of DNA sequence data for a number of patients with and without a given disease, a biologist would like to figure out which DNA mutations are deleterious (disease-causing) and which are not.

16

Motivation 2/2

Page 17: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

• Two classes • x is a D-dimensional real-valued vector (set of features)• y is the car class

• Find linear discriminant weights w

• Score function is the squared error

• Search algorithm: least squares algorithm17

Linear Discriminant Function (Two Classes)

182 4. LINEAR MODELS FOR CLASSIFICATION

Figure 4.1 Illustration of the geometry of alinear discriminant function in two dimensions.The decision surface, shown in red, is perpen-dicular to w, and its displacement from theorigin is controlled by the bias parameter w0.Also, the signed orthogonal distance of a gen-eral point x from the decision surface is givenby y(x)/∥w∥.

x2

x1

wx

y(x)∥w∥

x⊥

−w0∥w∥

y = 0y < 0

y > 0

R2

R1

an arbitrary point x and let x⊥ be its orthogonal projection onto the decision surface,so that

x = x⊥ + rw

∥w∥ . (4.6)

Multiplying both sides of this result by wT and adding w0, and making use of y(x) =wTx + w0 and y(x⊥) = wTx⊥ + w0 = 0, we have

r =y(x)∥w∥ . (4.7)

This result is illustrated in Figure 4.1.As with the linear regression models in Chapter 3, it is sometimes convenient

to use a more compact notation in which we introduce an additional dummy ‘input’value x0 = 1 and then define !w = (w0,w) and !x = (x0,x) so that

y(x) = !wT!x. (4.8)

In this case, the decision surfaces are D-dimensional hyperplanes passing throughthe origin of the D + 1-dimensional expanded input space.

4.1.2 Multiple classesNow consider the extension of linear discriminants to K > 2 classes. We might

be tempted be to build a K-class discriminant by combining a number of two-classdiscriminant functions. However, this leads to some serious difficulties (Duda andHart, 1973) as we now show.

Consider the use of K−1 classifiers each of which solves a two-class problem ofseparating points in a particular class Ck from points not in that class. This is knownas a one-versus-the-rest classifier. The left-hand example in Figure 4.2 shows an

4.1. Discriminant Functions 181

(McCullagh and Nelder, 1989). Note, however, that in contrast to the models usedfor regression, they are no longer linear in the parameters due to the presence of thenonlinear function f(·). This will lead to more complex analytical and computa-tional properties than for linear regression models. Nevertheless, these models arestill relatively simple compared to the more general nonlinear models that will bestudied in subsequent chapters.

The algorithms discussed in this chapter will be equally applicable if we firstmake a fixed nonlinear transformation of the input variables using a vector of basisfunctions φ(x) as we did for regression models in Chapter 3. We begin by consider-ing classification directly in the original input space x, while in Section 4.3 we shallfind it convenient to switch to a notation involving basis functions for consistencywith later chapters.

4.1. Discriminant Functions

A discriminant is a function that takes an input vector x and assigns it to one of Kclasses, denoted Ck. In this chapter, we shall restrict attention to linear discriminants,namely those for which the decision surfaces are hyperplanes. To simplify the dis-cussion, we consider first the case of two classes and then investigate the extensionto K > 2 classes.

4.1.1 Two classesThe simplest representation of a linear discriminant function is obtained by tak-

ing a linear function of the input vector so that

y(x) = wTx + w0 (4.4)

where w is called a weight vector, and w0 is a bias (not to be confused with bias inthe statistical sense). The negative of the bias is sometimes called a threshold. Aninput vector x is assigned to class C1 if y(x) ! 0 and to class C2 otherwise. The cor-responding decision boundary is therefore defined by the relation y(x) = 0, whichcorresponds to a (D − 1)-dimensional hyperplane within the D-dimensional inputspace. Consider two points xA and xB both of which lie on the decision surface.Because y(xA) = y(xB) = 0, we have wT(xA−xB) = 0 and hence the vector w isorthogonal to every vector lying within the decision surface, and so w determines theorientation of the decision surface. Similarly, if x is a point on the decision surface,then y(x) = 0, and so the normal distance from the origin to the decision surface isgiven by

wTx∥w∥ = − w0

∥w∥ . (4.5)

We therefore see that the bias parameter w0 determines the location of the decisionsurface. These properties are illustrated for the case of D = 2 in Figure 4.1.

Furthermore, we note that the value of y(x) gives a signed measure of the per-pendicular distance r of the point x from the decision surface. To see this, consider

yc =

(1 , if car c is “small”

�1 , if car c is “luxury”

Figure: C. Bishop

Page 18: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

18

How to Deal with Multiple Classes?

Page 19: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

• How to classify objects into multiple types?

19

Naïve Approach: one vs. many Classification

y(1)c =

(1 , if car c is “small”

�1 , if car c is “luxury”

y(2)c =

(1 , if car c is “small”

�1 , if car c is “medium”

y(3)c =

(1 , if car c is “medium”

�1 , if car c is “luxury”

Might work OK in some scenarios… but not clear in this case

Page 20: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

luxury

20

Issue with using binary classifiers for K classes4.1. Discriminant Functions 183

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

Figure 4.2 Attempting to construct a K class discriminant from a set of two class discriminants leads to am-biguous regions, shown in green. On the left is an example involving the use of two discriminants designed todistinguish points in class Ck from points not in class Ck. On the right is an example involving three discriminantfunctions each of which is used to separate a pair of classes Ck and Cj .

example involving three classes where this approach leads to regions of input spacethat are ambiguously classified.

An alternative is to introduce K(K − 1)/2 binary discriminant functions, onefor every possible pair of classes. This is known as a one-versus-one classifier. Eachpoint is then classified according to a majority vote amongst the discriminant func-tions. However, this too runs into the problem of ambiguous regions, as illustratedin the right-hand diagram of Figure 4.2.

We can avoid these difficulties by considering a single K-class discriminantcomprising K linear functions of the form

yk(x) = wTk x + wk0 (4.9)

and then assigning a point x to class Ck if yk(x) > yj(x) for all j = k. The decisionboundary between class Ck and class Cj is therefore given by yk(x) = yj(x) andhence corresponds to a (D − 1)-dimensional hyperplane defined by

(wk − wj)Tx + (wk0 − wj0) = 0. (4.10)

This has the same form as the decision boundary for the two-class case discussed inSection 4.1.1, and so analogous geometrical properties apply.

The decision regions of such a discriminant are always singly connected andconvex. To see this, consider two points xA and xB both of which lie inside decisionregion Rk, as illustrated in Figure 4.3. Any point !x that lies on the line connectingxA and xB can be expressed in the form

!x = λxA + (1 − λ)xB (4.11)

small

medium

Uncertain classification region

Figure: C. Bishop

Page 21: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

21

Using Least Squares for Multiple Classes(Solution)

Page 22: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

• We start by encoding the classes as a one-hot binary coding scheme• Class 3 is encoded as i3 , the 3rd column of identity matrix IK

22

Encoding the K Classes

i3

IK =

0

BBBB@

1 0 0 · · · 00 1 0 · · · 00 0 1 · · · 0...

. . .0 0 0 · · · 1| {z }

i1,i2,...,iK

1

CCCCA

9>>>>=

>>>>;

K

Page 23: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

• Assign item with features x to class k if yk(x) > yj(x), ∀j ≠ k

• The decision boundary is a (D-1) dimensional hyperplane

23

Linear Function

4.1. Discriminant Functions 183

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

Figure 4.2 Attempting to construct a K class discriminant from a set of two class discriminants leads to am-biguous regions, shown in green. On the left is an example involving the use of two discriminants designed todistinguish points in class Ck from points not in class Ck. On the right is an example involving three discriminantfunctions each of which is used to separate a pair of classes Ck and Cj .

example involving three classes where this approach leads to regions of input spacethat are ambiguously classified.

An alternative is to introduce K(K − 1)/2 binary discriminant functions, onefor every possible pair of classes. This is known as a one-versus-one classifier. Eachpoint is then classified according to a majority vote amongst the discriminant func-tions. However, this too runs into the problem of ambiguous regions, as illustratedin the right-hand diagram of Figure 4.2.

We can avoid these difficulties by considering a single K-class discriminantcomprising K linear functions of the form

yk(x) = wTk x + wk0 (4.9)

and then assigning a point x to class Ck if yk(x) > yj(x) for all j = k. The decisionboundary between class Ck and class Cj is therefore given by yk(x) = yj(x) andhence corresponds to a (D − 1)-dimensional hyperplane defined by

(wk − wj)Tx + (wk0 − wj0) = 0. (4.10)

This has the same form as the decision boundary for the two-class case discussed inSection 4.1.1, and so analogous geometrical properties apply.

The decision regions of such a discriminant are always singly connected andconvex. To see this, consider two points xA and xB both of which lie inside decisionregion Rk, as illustrated in Figure 4.3. Any point !x that lies on the line connectingxA and xB can be expressed in the form

!x = λxA + (1 − λ)xB (4.11)

182 4. LINEAR MODELS FOR CLASSIFICATION

Figure 4.1 Illustration of the geometry of alinear discriminant function in two dimensions.The decision surface, shown in red, is perpen-dicular to w, and its displacement from theorigin is controlled by the bias parameter w0.Also, the signed orthogonal distance of a gen-eral point x from the decision surface is givenby y(x)/∥w∥.

x2

x1

wx

y(x)∥w∥

x⊥

−w0∥w∥

y = 0y < 0

y > 0

R2

R1

an arbitrary point x and let x⊥ be its orthogonal projection onto the decision surface,so that

x = x⊥ + rw

∥w∥ . (4.6)

Multiplying both sides of this result by wT and adding w0, and making use of y(x) =wTx + w0 and y(x⊥) = wTx⊥ + w0 = 0, we have

r =y(x)∥w∥ . (4.7)

This result is illustrated in Figure 4.1.As with the linear regression models in Chapter 3, it is sometimes convenient

to use a more compact notation in which we introduce an additional dummy ‘input’value x0 = 1 and then define !w = (w0,w) and !x = (x0,x) so that

y(x) = !wT!x. (4.8)

In this case, the decision surfaces are D-dimensional hyperplanes passing throughthe origin of the D + 1-dimensional expanded input space.

4.1.2 Multiple classesNow consider the extension of linear discriminants to K > 2 classes. We might

be tempted be to build a K-class discriminant by combining a number of two-classdiscriminant functions. However, this leads to some serious difficulties (Duda andHart, 1973) as we now show.

Consider the use of K−1 classifiers each of which solves a two-class problem ofseparating points in a particular class Ck from points not in that class. This is knownas a one-versus-the-rest classifier. The left-hand example in Figure 4.2 shows an

for D=2

4.1. Discriminant Functions 183

R1

R2

R3

?

C1

not C1

C2

not C2

R1

R2

R3

?C1

C2

C1

C3

C2

C3

Figure 4.2 Attempting to construct a K class discriminant from a set of two class discriminants leads to am-biguous regions, shown in green. On the left is an example involving the use of two discriminants designed todistinguish points in class Ck from points not in class Ck. On the right is an example involving three discriminantfunctions each of which is used to separate a pair of classes Ck and Cj .

example involving three classes where this approach leads to regions of input spacethat are ambiguously classified.

An alternative is to introduce K(K − 1)/2 binary discriminant functions, onefor every possible pair of classes. This is known as a one-versus-one classifier. Eachpoint is then classified according to a majority vote amongst the discriminant func-tions. However, this too runs into the problem of ambiguous regions, as illustratedin the right-hand diagram of Figure 4.2.

We can avoid these difficulties by considering a single K-class discriminantcomprising K linear functions of the form

yk(x) = wTk x + wk0 (4.9)

and then assigning a point x to class Ck if yk(x) > yj(x) for all j = k. The decisionboundary between class Ck and class Cj is therefore given by yk(x) = yj(x) andhence corresponds to a (D − 1)-dimensional hyperplane defined by

(wk − wj)Tx + (wk0 − wj0) = 0. (4.10)

This has the same form as the decision boundary for the two-class case discussed inSection 4.1.1, and so analogous geometrical properties apply.

The decision regions of such a discriminant are always singly connected andconvex. To see this, consider two points xA and xB both of which lie inside decisionregion Rk, as illustrated in Figure 4.3. Any point !x that lies on the line connectingxA and xB can be expressed in the form

!x = λxA + (1 − λ)xB (4.11)

Page 24: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

where

and

The least squares solution is

where

and

24

Least Squares Solution in Matrix Form

184 4. LINEAR MODELS FOR CLASSIFICATION

Figure 4.3 Illustration of the decision regions for a mul-ticlass linear discriminant, with the decisionboundaries shown in red. If two points xA

and xB both lie inside the same decision re-gion Rk, then any point bx that lies on the lineconnecting these two points must also lie inRk, and hence the decision region must besingly connected and convex.

Ri

Rj

Rk

xA

xB

x

where 0 ! λ ! 1. From the linearity of the discriminant functions, it follows that

yk(!x) = λyk(xA) + (1 − λ)yk(xB). (4.12)

Because both xA and xB lie inside Rk, it follows that yk(xA) > yj(xA), andyk(xB) > yj(xB), for all j = k, and hence yk(!x) > yj(!x), and so !x also liesinside Rk. Thus Rk is singly connected and convex.

Note that for two classes, we can either employ the formalism discussed here,based on two discriminant functions y1(x) and y2(x), or else use the simpler butequivalent formulation described in Section 4.1.1 based on a single discriminantfunction y(x).

We now explore three approaches to learning the parameters of linear discrimi-nant functions, based on least squares, Fisher’s linear discriminant, and the percep-tron algorithm.

4.1.3 Least squares for classificationIn Chapter 3, we considered models that were linear functions of the parame-

ters, and we saw that the minimization of a sum-of-squares error function led to asimple closed-form solution for the parameter values. It is therefore tempting to seeif we can apply the same formalism to classification problems. Consider a generalclassification problem with K classes, with a 1-of-K binary coding scheme for thetarget vector t. One justification for using least squares in such a context is that itapproximates the conditional expectation E[t|x] of the target values given the inputvector. For the binary coding scheme, this conditional expectation is given by thevector of posterior class probabilities. Unfortunately, however, these probabilitiesare typically approximated rather poorly, indeed the approximations can have valuesoutside the range (0, 1), due to the limited flexibility of a linear model as we shallsee shortly.

Each class Ck is described by its own linear model so that

yk(x) = wTk x + wk0 (4.13)

where k = 1, . . . , K. We can conveniently group these together using vector nota-tion so that

y(x) = "WT#x (4.14)

fW =

w10 · · · wk0 · · · wK0

wT1 · · · wT

k · · · wTK

4.1. Discriminant Functions 185

where !W is a matrix whose kth column comprises the D + 1-dimensional vector"wk = (wk0,wT

k )T and "x is the corresponding augmented input vector (1,xT)T witha dummy input x0 = 1. This representation was discussed in detail in Section 3.1. Anew input x is then assigned to the class for which the output yk = "wT

k "x is largest.We now determine the parameter matrix !W by minimizing a sum-of-squares

error function, as we did for regression in Chapter 3. Consider a training data set{xn, tn} where n = 1, . . . , N , and define a matrix T whose nth row is the vector tT

n ,together with a matrix "X whose nth row is "xT

n . The sum-of-squares error functioncan then be written as

ED(!W) =12

Tr#

("X!W − T)T("X!W − T)$

. (4.15)

Setting the derivative with respect to !W to zero, and rearranging, we then obtain thesolution for !W in the form

!W = ("XT "X)−1 "XTT = "X†T (4.16)

where "X† is the pseudo-inverse of the matrix "X, as discussed in Section 3.1.1. Wethen obtain the discriminant function in the form

y(x) = !WT"x = TT%

"X†&T

"x. (4.17)

An interesting property of least-squares solutions with multiple target variablesis that if every target vector in the training set satisfies some linear constraint

aTtn + b = 0 (4.18)

for some constants a and b, then the model prediction for any value of x will satisfythe same constraint so thatExercise 4.2

aTy(x) + b = 0. (4.19)

Thus if we use a 1-of-K coding scheme for K classes, then the predictions madeby the model will have the property that the elements of y(x) will sum to 1 for anyvalue of x. However, this summation constraint alone is not sufficient to allow themodel outputs to be interpreted as probabilities because they are not constrained tolie within the interval (0, 1).

The least-squares approach gives an exact closed-form solution for the discrimi-nant function parameters. However, even as a discriminant function (where we use itto make decisions directly and dispense with any probabilistic interpretation) it suf-fers from some severe problems. We have already seen that least-squares solutionsSection 2.3.7lack robustness to outliers, and this applies equally to the classification application,as illustrated in Figure 4.4. Here we see that the additional data points in the right-hand figure produce a significant change in the location of the decision boundary,even though these point would be correctly classified by the original decision bound-ary in the left-hand figure. The sum-of-squares error function penalizes predictionsthat are ‘too correct’ in that they lie a long way on the correct side of the decision

ex = (1,xT )T

T =⇥iy1 , · · · , iyn

One hot encoding of class y1

eX =

1 · · · 1x

T1 · · · x

Tn

Pseudoinverse

Page 25: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

25

Working with non-linearly separable data…

Page 26: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

26

Linear Methods → Linear Boundaries?

4.2 Linear Regression of an Indicator Matrix 103

1

1

1

1111

11

11

1

1

1

1

1

1

11 1

111

1

1

1

1

1

11

1

1

11

1

11

11

1

1

1

1

1

1

1 11

1

1 1

1

1

1

1

1

1

111

1

1

1

1

1

1

1 111

1

1

1

11

1

1

1

1

1

11

1

1

1

11

1

1

1

1

111

1

1

11

11

1

1

1

1

1

11

1

1

1

1

1

1 1

1

1

11

1

1

1

1 1

1

1

11 1

1

1

1 11

11

1

1

1

1

1

1

1

11

1 1

11

1

1

1

11

1

1

1

1 1

11

1

11

1

1

1

1

11

1

11

1

11

1

1

111

1

1 1

1

11

1

1

11

1

1

1

1

1

1

1

1

1

1

22

222

2

2

2

22

2

222

22

2

22

222

2

222

2

2

2

2 2

2

2

2

22

2

2

22

2

2

2

2

2

2 222

2

2

2

2

2

22

2

2

22

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

222

2

2

2

2

2 2

2

2

2

2

2

22

2

2

2

2 22

2

2

2

2

2

2

2

2

2

2

2

2 22

2

2

2

2

2

2

2

22

2

2

2

22

2 2

22 2

2

2

2

2

2

2

2

2

2

2

2

22

2

22

222

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2 22

2

2

22

22

2 2

2

2

2

3

3

3

3

33

3

3

3

3

3 3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3

3 3

3

3

3

3

33

3

3

33

3

3

33

3

333

3

3

3 3

3 3

3

3

3

3

3

3

3

3

3

3

33

3 3

3

33

3 3

33

3

3

3

333

3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

3

33

3

3 33

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

33

33

3

33

3

3

3

3

3

3

3 3

3

3

3 3

3

3

3

33

3

3

3 3

33

33

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

1

1

1

1111

11

11

1

1

1

1

1

1

11 1

111

1

1

1

1

1

11

1

1

11

1

11

11

1

1

1

1

1

1

1 11

1

1 1

1

1

1

1

1

1

111

1

1

1

1

1

1

1 111

1

1

1

11

1

1

1

1

1

11

1

1

1

11

1

1

1

1

111

1

1

11

11

1

1

1

1

1

11

1

1

1

1

1

1 1

1

1

11

1

1

1

1 1

1

1

11 1

1

1

1 11

11

1

1

1

1

1

1

1

11

1 1

11

1

1

1

11

1

1

1

1 1

11

1

11

1

1

1

1

11

1

11

1

11

1

1

111

1

1 1

1

11

1

1

11

1

1

1

1

1

1

1

1

1

1

22

222

2

2

2

22

2

222

22

2

22

222

2

222

2

2

2

2 2

2

2

2

22

2

2

22

2

2

2

2

2

2 222

2

2

2

2

2

22

2

2

22

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

222

2

2

2

2

2 2

2

2

2

2

2

22

2

2

2

2 22

2

2

2

2

2

2

2

2

2

2

2

2 22

2

2

2

2

2

2

2

22

2

2

2

22

2 2

22 2

2

2

2

2

2

2

2

2

2

2

2

22

2

22

222

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2 22

2

2

22

22

2 2

2

2

2

3

3

3

3

33

3

3

3

3

3 3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3

3 3

3

3

3

3

33

3

3

33

3

3

33

3

333

3

3

3 3

3 3

3

3

3

3

3

3

3

3

3

3

33

3 3

3

33

3 3

33

3

3

3

333

3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

3

33

3

3 33

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

33

33

3

33

3

3

3

3

3

3

3 3

3

3

3 3

3

3

3

33

3

3

3 3

33

33

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

FIGURE 4.1. The left plot shows some data from three classes, with lineardecision boundaries found by linear discriminant analysis. The right plot showsquadratic decision boundaries. These were obtained by finding linear boundariesin the five-dimensional space X1, X2, X1X2, X

21 , X

22 . Linear inequalities in this

space are quadratic inequalities in the original space.

mation h(X) where h : IRp !→ IRq with q > p, and will be explored in laterchapters.

4.2 Linear Regression of an Indicator Matrix

Here each of the response categories are coded via an indicator variable.Thus if G has K classes, there will be K such indicators Yk, k = 1, . . . ,K,with Yk = 1 if G = k else 0. These are collected together in a vectorY = (Y1, . . . , YK), and the N training instances of these form an N × Kindicator response matrix Y. Y is a matrix of 0’s and 1’s, with each rowhaving a single 1. We fit a linear regression model to each of the columnsof Y simultaneously, and the fit is given by

Y = X(XTX)−1XTY. (4.3)

Chapter 3 has more details on linear regression. Note that we have a coeffi-cient vector for each response column yk, and hence a (p+1)×K coefficientmatrix B = (XTX)−1XTY. Here X is the model matrix with p+1 columnscorresponding to the p inputs, and a leading column of 1’s for the intercept.

A new observation with input x is classified as follows:

• compute the fitted output f(x)T = (1, xT )B, a K vector;

• identify the largest component and classify accordingly:

G(x) = argmaxk∈G fk(x). (4.4)

Data from three classes, with linear decision boundaries

Linear boundaries in the five-dimensional space

ϕ(x)=(X1,X2,X1X2,X12,X2

2)

Linear inequalities in the transformed space are quadratic inequalities in the original space.

� ⌘ �(x)

Page 27: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

• Helps if ϕn = ϕ(xn) is binary, quantized or normalized• Example: if 1st feature xn(1) is age of user n, then

• ϕn(1) could be indicator if xn(1) belongs to 1st quantile• ϕn(2) indicator whether xn(1) belongs to 2nd quantile• …

• Useful to add interaction terms ϕh = ϕ(xn) ○ ϕ(xm) , where“○” is the Hadamard product (or element-wise product)

• XOR operator can be better than Hadamard product for binary variables.

27

Tips

Page 28: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

Least Squares Solution

28

Issues with Least Squares Classification

minimize

h+ε

ε

With square loss (score), optimization cares too much about reducing distance to boundary of well separable items…

...solution is to change the score function

X

c2TrainingDataCars

(yc � y(xc))2

x2 x2

x1 x1

Page 29: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

• Back to two classes, C1 and C2

• Logistic regression is often used for two classes

where

and

• But can be easily generalized to K classes

29

Logistic Regression (for Classification)

4.3. Probabilistic Discriminative Models 205

basis functions is typically set to a constant, say φ0(x) = 1, so that the correspond-ing parameter w0 plays the role of a bias. For the remainder of this chapter, we shallinclude a fixed basis function transformation φ(x), as this will highlight some usefulsimilarities to the regression models discussed in Chapter 3.

For many problems of practical interest, there is significant overlap betweenthe class-conditional densities p(x|Ck). This corresponds to posterior probabilitiesp(Ck|x), which, for at least some values of x, are not 0 or 1. In such cases, the opti-mal solution is obtained by modelling the posterior probabilities accurately and thenapplying standard decision theory, as discussed in Chapter 1. Note that nonlineartransformations φ(x) cannot remove such class overlap. Indeed, they can increasethe level of overlap, or create overlap where none existed in the original observationspace. However, suitable choices of nonlinearity can make the process of modellingthe posterior probabilities easier.

Such fixed basis function models have important limitations, and these will beSection 3.6resolved in later chapters by allowing the basis functions themselves to adapt to thedata. Notwithstanding these limitations, models with fixed nonlinear basis functionsplay an important role in applications, and a discussion of such models will intro-duce many of the key concepts needed for an understanding of their more complexcounterparts.

4.3.2 Logistic regressionWe begin our treatment of generalized linear models by considering the problem

of two-class classification. In our discussion of generative approaches in Section 4.2,we saw that under rather general assumptions, the posterior probability of class C1

can be written as a logistic sigmoid acting on a linear function of the feature vectorφ so that

p(C1|φ) = y(φ) = σ!wTφ

"(4.87)

with p(C2|φ) = 1 − p(C1|φ). Here σ(·) is the logistic sigmoid function defined by(4.59). In the terminology of statistics, this model is known as logistic regression,although it should be emphasized that this is a model for classification rather thanregression.

For an M -dimensional feature space φ, this model has M adjustable parameters.By contrast, if we had fitted Gaussian class conditional densities using maximumlikelihood, we would have used 2M parameters for the means and M(M + 1)/2parameters for the (shared) covariance matrix. Together with the class prior p(C1),this gives a total of M(M +5)/2+1 parameters, which grows quadratically with M ,in contrast to the linear dependence on M of the number of parameters in logisticregression. For large values of M , there is a clear advantage in working with thelogistic regression model directly.

We now use maximum likelihood to determine the parameters of the logisticregression model. To do this, we shall make use of the derivative of the logistic sig-moid function, which can conveniently be expressed in terms of the sigmoid functionitselfExercise 4.12

da= σ(1 − σ). (4.88)

�(a) =exp(a)

1 + exp(a)

Logistic function

� ⌘ �(x)Transformation of feature vector (possibly non-linear)

Page 30: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

• For a dataset , where ϕc is transformed feature vector of car c, tc∈{0,1} is the class of car c∈TrainingDataCars

• The likelihood function over the training data is

where ,

• The log of the likelihood function (log-likelihood) is

• To solve the above equation, we maximize the log-likelihood over parameters w

• The above equation is also described as cross-entropy loss, logistic loss, log loss,on Tensorflow, Sklearns, and pyTorch.

30

Finding Logistic Regression Parameters (binary classification)206 4. LINEAR MODELS FOR CLASSIFICATION

For a data set {φn, tn}, where tn ∈ {0, 1} and φn = φ(xn), with n =1, . . . , N , the likelihood function can be written

p(t|w) =N!

n=1

ytnn {1 − yn}1−tn (4.89)

where t = (t1, . . . , tN )T and yn = p(C1|φn). As usual, we can define an errorfunction by taking the negative logarithm of the likelihood, which gives the cross-entropy error function in the form

E(w) = − ln p(t|w) = −N"

n=1

{tn ln yn + (1 − tn) ln(1 − yn)} (4.90)

where yn = σ(an) and an = wTφn. Taking the gradient of the error function withrespect to w, we obtainExercise 4.13

∇E(w) =N"

n=1

(yn − tn)φn (4.91)

where we have made use of (4.88). We see that the factor involving the derivativeof the logistic sigmoid has cancelled, leading to a simplified form for the gradientof the log likelihood. In particular, the contribution to the gradient from data pointn is given by the ‘error’ yn − tn between the target value and the prediction of themodel, times the basis function vector φn. Furthermore, comparison with (3.13)shows that this takes precisely the same form as the gradient of the sum-of-squareserror function for the linear regression model.Section 3.1.1

If desired, we could make use of the result (4.91) to give a sequential algorithmin which patterns are presented one at a time, in which each of the weight vectors isupdated using (3.22) in which ∇En is the nth term in (4.91).

It is worth noting that maximum likelihood can exhibit severe over-fitting fordata sets that are linearly separable. This arises because the maximum likelihood so-lution occurs when the hyperplane corresponding to σ = 0.5, equivalent to wTφ =0, separates the two classes and the magnitude of w goes to infinity. In this case, thelogistic sigmoid function becomes infinitely steep in feature space, corresponding toa Heaviside step function, so that every training point from each class k is assigneda posterior probability p(Ck|x) = 1. Furthermore, there is typically a continuumExercise 4.14of such solutions because any separating hyperplane will give rise to the same pos-terior probabilities at the training data points, as will be seen later in Figure 10.13.Maximum likelihood provides no way to favour one such solution over another, andwhich solution is found in practice will depend on the choice of optimization algo-rithm and on the parameter initialization. Note that the problem will arise even ifthe number of data points is large compared with the number of parameters in themodel, so long as the training data set is linearly separable. The singularity can beavoided by inclusion of a prior and finding a MAP solution for w, or equivalently byadding a regularization term to the error function.

y(�n) = �(wT�n)

log p(t|w) =

X

c2TrainingDataCars

tc log y(�c) + (1� tc) log(1� y(�c))

{�c, tc}

p(t|w) =Y

c2TrainingDataCars

y(�c)tc(1� y(�c))

1�tc

Page 31: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

• The above equations give the following gradient

• The logistic function has derivative

• The second derivative (Hessian) is then (verify!)

where and H is a positive definite matrix. Because H is positive definite the optimization is concave on w and has a unique maximum.

31

Solving Logistic Regression via Maximum Likelihood

rw log p(t|w) =

X

c2TrainingDataCars

(tc � y(�c))�c = �T(t� y)

H = �TR�

Rc = y(�c)(1� y(�c))

d

da�(a) = �(a)(1� �(a))

Exercise: what are the dimensions of 𝚽?

Page 32: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

• The iterative parameter update is (Newton-Raphson)

32

Iterative Update

208 4. LINEAR MODELS FOR CLASSIFICATION

where we have made use of (4.88). Also, we have introduced the N × N diagonalmatrix R with elements

Rnn = yn(1 − yn). (4.98)

We see that the Hessian is no longer constant but depends on w through the weight-ing matrix R, corresponding to the fact that the error function is no longer quadratic.Using the property 0 < yn < 1, which follows from the form of the logistic sigmoidfunction, we see that uTHu > 0 for an arbitrary vector u, and so the Hessian matrixH is positive definite. It follows that the error function is a concave function of wand hence has a unique minimum.Exercise 4.15

The Newton-Raphson update formula for the logistic regression model then be-comes

w(new) = w(old) − (ΦTRΦ)−1ΦT(y − t)= (ΦTRΦ)−1

!ΦTRΦw(old) − ΦT(y − t)

"

= (ΦTRΦ)−1ΦTRz (4.99)

where z is an N -dimensional vector with elements

z = Φw(old) − R−1(y − t). (4.100)

We see that the update formula (4.99) takes the form of a set of normal equations for aweighted least-squares problem. Because the weighing matrix R is not constant butdepends on the parameter vector w, we must apply the normal equations iteratively,each time using the new weight vector w to compute a revised weighing matrixR. For this reason, the algorithm is known as iterative reweighted least squares, orIRLS (Rubin, 1983). As in the weighted least-squares problem, the elements of thediagonal weighting matrix R can be interpreted as variances because the mean andvariance of t in the logistic regression model are given by

E[t] = σ(x) = y (4.101)var[t] = E[t2] − E[t]2 = σ(x) − σ(x)2 = y(1 − y) (4.102)

where we have used the property t2 = t for t ∈ {0, 1}. In fact, we can interpret IRLSas the solution to a linearized problem in the space of the variable a = wTφ. Thequantity zn, which corresponds to the nth element of z, can then be given a simpleinterpretation as an effective target value in this space obtained by making a locallinear approximation to the logistic sigmoid function around the current operatingpoint w(old)

an(w) ≃ an(w(old)) +dan

dyn

####w(old)

(tn − yn)

= φTnw(old) − (yn − tn)

yn(1 − yn)= zn. (4.103)

Page 33: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

• Let πi be the probability of sampling example i in the training data

• We say a data sample is biased when πi is not uniform

• Correcting for bias in the likelihood function:

where Cc is the class of car c.

• Generally, it is a good idea to test training data for different weights

• The weights can be used to protect against sampling designs which could cause selection bias.

• The weights can be used to protect against misspecification of the model.

• Unbalanced datasets:• Suppose there are more males than females in dataset.

What will happen to decision boundary? How to fix it?33

Training Logistic Regression with Data Selection Bias

log p(t|w) =

X

c2TrainingDataCars

1

⇡c(tc log y(�c) + (1� tc) log(1� y(�c)))

Page 34: lecture5 prediction class - Purdue University · Optimization over score functions • Smoothfunctions: • If a function is smooth, it is differentiable and the derivatives are continuous,

• Consider K classes and N observations• Let Ci be the class of the i-th example with feature vector 𝜙i

• If we assume one-hot encoding of target variable tc ,the log-likelihood function is

34

Multiclass Logistic Regression (MLR)

a.k.a. softmaxP (Cc = tc|�c) =exp(wT

k �c)PKh=1 exp(w

Th�c)

=

X

c2TrainingDataCars

KX

k=1

tc,kwTk �c �

X

c2TrainingDataCars

log

KX

h=1

exp(wTh�c)

X

c2TrainingDataCars

KX

k=1

tc,k logexp(wT

k �c)PKh=1 exp(w

Th�c)