fall 2021 lecture 9: logistic regression

Roy Fox | CS 273A | Fall 2021 | Lecture 9: Logistic Regression

CS 273A: Machine Learning Fall 2021

Lecture 9: Logistic Regression

Roy FoxDepartment of Computer ScienceBren School of Information and Computer SciencesUniversity of California, Irvine

All slides in this course adapted from Alex Ihler & Sameer Singh


Logistics

assignments • Assignment 3 due next Tuesday, Oct 26

midterm

• Midterm exam on Nov 4, 11am–12:20 in SH 128

• If you're eligible to be remote — let us know by Oct 28

• If you're eligible for more time — let us know by Oct 28

• Review during lecture next Thursday


Separability• Separable dataset = there's a model (in our class) with perfect prediction

• Separable problem = there's a model with 0 test loss

‣ Also called realizable

• Linearly separable = separable by a linear classifiers (hyperplanes)

𝔼x,y∼p[ℓ(y, y(x))] = 0

x1

x2

Linearly separable data

x1

x2

Linearly non-separable data


Adding features• How to make the perceptron more expressive?

‣ Add features — recall linear → polynomial regression

• Linearly non-separable:

• Linearly separable in quadratic features:

• Visualized in original feature space:

‣ Decision boundary: ax2 + bx + c = 0

x1

x1

x2 = x21

y = T(ax2 + bx + c)


Adding features

• Which functions do we need to represent the decision boundary?

‣ When linear functions aren't sufficiently expressive

‣ Perhaps quadratic functions are

ax21 + bx1 + cx2

2 + dx2 + ex1x2 + f = 0

x1 x2 y0 0 -10 1 -11 0 -11 1 1

ANDx1 x2 y0 0 -10 1 11 0 11 1 -1

XOR


Separability in high dimension

• As we add more features → dimensionality of instance increases:

‣ Separability becomes easier: more parameters, more models that could separate

‣ Add enough (good) features, and even a linear classifier can separate

- Given a decision boundary : add as a feature → linearly separable!

• However:

‣ Do these features explain test data or just training data?

‣ Increasing model complexity can lead to overfitting

x

f(x) = 0 f(x)


Recap• Perceptron = linear classifier

‣ Linear response → step decision function → discrete class prediction

‣ Linear decision boundary

• Separability = existence of a perfect model (in the class)

‣ Separable data: 0 loss on this data

‣ Separable problem: 0 loss on the data distribution

‣ Perceptron: linear separability

• Adding features:

‣ Complex features: complex decision boundary, easier separability

‣ Can lead to overfitting


Today's lecture

Learning perceptrons

Logistic regression

Multi-class classifiers

VC dimension


Learning a perceptron

• What do we need to learn the parameters of a perceptron?

‣ Training data = labeled instances

‣ Loss function = error rate on labeled data

‣ Optimization algorithm = method for minimizing training loss

θ

𝒟

ℒθ


Error rate

• Error rate:

‣ With the indicator

ℒθ = 1m ∑

i

δ(y(i) ≠ fθ(x(i)))

δ(y ≠ y) = {1 y ≠ y0 else

x

y

ℒθ = 29


Use linear regression?• Idea: find using linear regression

• Affected by large regression losses

‣ We only care about the classification loss

θ


Perceptron: gradient-based learning

• Problem: loss function not differentiable

‣ Write differently:

- almost everywhere

‣ But we also don't want MSE =

‣ Compromise:

-

ℒθ(x, y) = δ(y ≠ sign(θ⊺x))

ℒθ(x, y) = 14 (y − sign(θ⊺x))2

∇θsign(θ⊺x) = 0

ℒθ(x, y) = 12 (y − θ⊺x)2

ℒθ(x, y) = (y − sign(θ⊺x))(y − θ⊺x)

∇θℒθ = − (y − sign(θ⊺x))x = − (y − y)x

T(r)

r


Perceptron training algorithm

• Similar to linear regression with MSE loss

‣ Except that is the class prediction, not the linear response

‣ No update for correct predictions

‣ For incorrect predictions:

- update towards (for false negative) or (for false positive)

y

y( j) = y( j)

y( j) − y( j) = ± 2

⟹ x −x

predict output for point

gradient step on weird loss

j





j

incorrect prediction: update weights





j

correct prediction: no update





j

convergence: no more updates


Today's lecture


Logistic regression


VC dimension


Perceptron

• Perceptron = linear classifier

‣ Parameters = weights (also denoted )

‣ Response = weighted sum of the features

‣ Prediction = thresholded response

‣ Decision function:

• Update rule:

θ w

r = θ⊺x

y(x) = T(r) = T(θ⊺x)

y(x) = {+1 if θ⊺x > 0−1 otherwise

(for T(r) = sign(r))

θ ← θ − α( y − y⏟

)x

T(r)

r

error


Widening the classification margin• Which decision boundary is “better”?

‣ Both have 0 training loss

‣ But one seems more robust, expected to generalize better

• Benefit of smooth loss function: care about margin

‣ Encourage distancing the boundary from data points


Surrogate loss functions• Alternative: use differentiable loss function

‣ E.g., approximate step function with smooth sigmoid function (= ”looks like s”)

‣ Popular choice: logistic / sigmoid function

‣ For this part, let's assume

• MSE loss:

‣ Far from the boundary: , loss approximates 0–1 loss

‣ Near the boundary: , loss near , but clear improvement direction

σ(x) = 11 + exp(−x) ∈ [0,1]

y ∈ {0,1}

ℒθ(x, y) = (y − σ(r(x)))2

σ ≈ 0 or 1

σ ≈ 12

14

T(r)

r

σ(r)

r0.5


Learning smooth linear classifiers• Use gradient-based optimizer on the loss

• Logistic function:

• It's derivative:

‣ Saturates for both ,

• Confidently correct prediction:

• Confidently incorrect prediction:

ℒθ(x, y) = (y − σ(θ⊺x))2

−∇θℒθ(x, y) = 2(y − σ(θ⊺x))σ′ (θ⊺x) x

σ(r) = 11 + exp(−r)

σ′ (r) = σ(r)(1 − σ(r))

r → ∞ r → − ∞

σ(r) ≈ y ∈ {0,1} ⟹ ∇θℒθ ≈ 0

σ(r) ≈ 1 − y ⟹ ∇θℒθ ≈ 0

error

σ(r)

r

sensitivity

good

bad


Learning smooth linear classifiers

• With a smooth loss function with can use Stochastic Gradient Descent

ℒθ(x, y) = (y − σ(r(x)))2

ℒ = 1.9




ℒθ(x, y) = (y − σ(r(x)))2

ℒ = 0.4




ℒθ(x, y) = (y − σ(r(x)))2

ℒ = 0.1

Minimum training MSE


Maximum likelihood• What if we had a probabilistic predictor ?

• The better the parameter , the more likely the training data:

• Bayesian interpretation?

‣ MAP:

• Maximum log-likelihood:

pθ(y |x)

θ

pθ(y(1), …, y(m) |x(1), …, x(m)) = ∏j

pθ(y( j) |x( j))

arg maxθ

p(θ |𝒟) = arg maxθ

p(θ)p(𝒟x)pθ(𝒟y |𝒟x) = arg maxθ

pθ(𝒟y |𝒟x)

maxθ

1m ∑

j

log pθ(y( j) |x( j))

except, often there's no uniform distribution over parameter space

average over training dataset


Logistic Regression

• Can we turn a linear response into a probability? Sigmoid!

• Think of

• Negative Log-Likelihood (NLL) loss:

σ : ℝ → [0,1]

σ(θ⊺x) = pθ(y = 1 |x)

ℒθ(x, y) = − log pθ(y |x) = − y log σ(θ⊺x) − (1 − y)log(1 − σ(θ⊺x))for y = 1 for y = 0

−log 0.99

−log 0.7 −log 0.98

−log 0.1−log 0.7−log 0.99


Logistic Regression: gradient• Logistic NLL loss:

•Gradient:

• Compare:

‣ Perceptron:

‣ Logistic MSE:

ℒθ(x, y) = − y log σ(θ⊺x) − (1 − y)log(1 − σ(θ⊺x))

−∇θℒθ(x, y) = (yσ′ (θ⊺x)σ(θ⊺x)

− (1 − y)σ′ (θ⊺x)

1 − σ(θ⊺x) ) x

= (y (1 − σ(θ⊺x)) − (1 − y)σ(θ⊺x))x

= (y − pθ(y = 1 |x))x

(y − y)x

−∇θℒθ(x, y) = 2(y − σ(θ⊺x))σ′ (θ⊺x)x

error for y = 1 error for y = 0

but update toward −x

constant error ( ), insensitive to margin±2

0 gradient for bad mistakes

T(r)

r

σ(r)

r


Recap• Linear classifiers:

‣ Perceptron

‣ Logistic classifier

• Measuring decision quality:

‣ Error rate / 0–1 loss

‣ MSE loss

‣ Negative log-likelihood (Logistic Regression)

• Learning the weights

‣ Perceptron algorithm — not quite gradient-based (or gradient of weird loss)

‣ Gradient-based optimization of surrogate loss (MSE / NLL)


Today's lecture


Logistic regression


VC dimension


Multi-class linear models

• How to predict multiple classes?

• Idea: have a linear response per class

‣ Choose class with largest response:

• Linear boundary between classes , :

‣

rc = θ⊺c x

fθ(x) = arg maxc

θ⊺c x

c1 c2

θ⊺c1

x ≶ θ⊺c2

x ⟺ (θc1− θc2

)⊺x ≶ 0



• More generally: add features — can even depend on !

• Example:

‣

y

fθ(x) = arg maxy

θ⊺Φ(x, y)

y = ± 1

Φ(x, y) = xy

⟹ fθ(x) = arg maxy

yθ⊺x = {+1 +θ⊺x > − θ⊺x−1 +θ⊺x < − θ⊺x

= sign(θ⊺x) perceptron!



• More generally: add features — can even depend on !

• Example:

‣

‣

y

fθ(x) = arg maxy

θ⊺Φ(x, y)

y ∈ {1,2,…, C}

Φ(x, y) = [0 0 ⋯ x ⋯ 0] = one-hot(y) ⊗ x

θ = [θ1 ⋯ θC]

⟹ fθ(x) = arg maxc

θ⊺c x largest linear response


Multi-class perceptron algorithm• While not done:

‣ For each data point :

- Predict:

- Increase response for true class:

- Decrease response for predicted class:

• More generally:

‣ Predict:

‣ Update:

(x, y) ∈ 𝒟

y = arg maxc

θ⊺c x

θy ← θy + αx

θ y ← θ y − αx

y = arg maxy

θ⊺Φ(x, y)

θ ← θ + α(Φ(x, y) − Φ(x, y))


Multilogit Regression

• Define multi-class probabilities:

‣For binary :

• Benefits:

‣ Probabilistic predictions: knows its confidence

‣ Linear decision boundary:

‣ NLL is convex

pθ(y |x) =exp(θ⊺

y x)∑c exp(θ⊺

c x)= soft max

cθ⊺

c xy

ypθ(y = 1 |x) =

exp(θ⊺1x)

exp(θ⊺1x) + exp(θ⊺

2x)

=1

1 + exp((θ2 − θ1)⊺x)= σ((θ1 − θ2)⊺x)

arg maxy

exp(θ⊺y x) = arg max

yθ⊺

y x

Logistic Regression with θ = θ1 − θ2


Multilogit Regression: gradient

• NLL loss:

• Gradient:

• Compare to multi-class perceptron:

ℒθ(x, y) = − log pθ(y |x) = − θ⊺y x + log∑

c

exp(θ⊺c x)

−∇θcℒθ(x, y) = δ(y = c)x −

∇θc∑c′

exp(θ⊺c′

x)

∑c′

exp(θ⊺c′

x)

= (δ(y = c) −exp(θ⊺

c x)∑c′

exp(θ⊺c′

x) ) x

= (δ(y = c) − pθ(c |x))x

(δ(y = c) − δ( y = c))xmake true class more likely make all other classes less likely


Today's lecture


Logistic regression


VC dimension


Complexity measures• What are we looking for in a measure of model class complexity?

‣ Tell us something about generalization error

‣ Tell us how it depends on amount of data

‣ Be easy to find for a given model class — haha jk not gonna happen (more later)

• Ideally: a way to select model complexity (other than validation)

‣ Akaike Information Criterion (AIC) — roughly: loss + #parameters

‣ Bayesian Information Criterion (BIC) — roughly: loss + #parameters

- But what's the #parameters, effectively? Should change the complexity?

ℒtest − ℒtraining

m

⋅ log m

fθ1,θ2= gθ=h(θ1,θ2)

also called: risk – empirical risk


Model expressiveness• Model complexity also measures expressiveness / representational power

• Tradeoff:

‣ More expressive can reduce error, but may also overfit to training data

‣ Less expressive may not be able to represent true pattern / trend

• Example:

⟹

⟹

sign(θ0 + θ1x1 + θ2x2)

-3 -2 -1 0 1 2 3-3

-2

-1

0

1

2

3


Model expressiveness• Model complexity also measures expressiveness / representational power

• Tradeoff:

‣ More expressive can reduce error, but may also overfit to training data

‣ Less expressive may not be able to represent true pattern / trend

• Example:

⟹

⟹

sign(x21 + x2

2 − θ)

-3 -2 -1 0 1 2 3-3

-2

-1

0

1

2

3


Shattering• Separability / realizability: there's a model that classifies all points correctly

• Shattering: the points are separable regardless of their labels

‣ Our model class can shatter points

if for any labeling

there exists a model that classifies all of them correctly

- The model class must have at least as many models as labelings

x(1), …, x(h)

y(1), …, y(h)

Ch





if for any labeling


• Example: can shatter these points?

x(1), …, x(h)

y(1), …, y(h)

fθ(x) = sign(θ0 + θ1x1 + θ2x2)





if for any labeling


• Example: can shatter these points?

x(1), …, x(h)

y(1), …, y(h)

fθ(x) = sign(x21 + x2

2 − θ)


Vapnik–Chervonenkis (VC) dimension• VC dimension: maximum number of points that can be shattered by a class

• A game:

‣ Fix a model class

‣ Player 1: choose points

‣ Player 2: choose labels

‣ Player 1: choose model

‣ Are all ? Player 1 wins

• Player 1 can win, otherwise cannot win

H

fθ : x → y θ ∈ Θ

h x(1), …, x(h)

y(1), …, y(h)

θ

y( j) = fθ(x( j)) ⟹

h ≤ H ⟹

∃x(1), …, x(h) : ∀y(1), …, y(h) : ∃θ : ∀j : y( j) = fθ(x( j))


VC dimension: example (1)• VC dimension: maximum number of points that can be shattered by a class

• To find , think like the winning player: 1 for , 2 for

• Example:

‣ We can place one point and ”shatter” it

‣ We can prevent shattering any two points: make the distant one blue

‣

H

H h ≤ H h > H

fθ(x) = sign(x21 + x2

2 − θ)

H = 1


VC dimension: example (2)• Example:

‣ We can place 3 points and shatter them

‣ We can prevent shattering any 4 points:

- If they form a convex shape, alternate labels

- Otherwise, label differently the point in the triangle

‣

• Linear classifiers (perceptrons) of features have VC-dim

‣ But VC-dim is generally not #parameters

fθ(x) = sign(θ0 + θ1x1 + θ2x2)

H = 3

d d + 1


VC Generalization bound• VC-dim of a model class can be used to bound generalization loss:

‣ With probability at least , we will get a ”good” dataset, for which

•

• We need larger training size :

‣ The better generalization we need

‣ The more complex (higher VC-dim) our model class

‣ The more likely we want to get a good training sample

1 − η

test loss − training loss ≤H log(2m/H) + H − log(η/4)

m

mgeneralization loss


Model selection with VC-dim• Using validation / cross-validation:

‣ Estimate loss on held out set

‣ Use validation loss to select model

• Using VC dimension:

‣ Use generalization bound to select model

‣ Structural Risk Minimization (SRM)

‣ Bound not tight, must too conservative

training loss validation lossmodel complexity

training loss VC bound test loss boundmodel complexity


Logistics

assignments • Assignment 3 due next Tuesday, Oct 26

midterm

• Midterm exam on Nov 4, 11am–12:20 in SH 128

• If you're eligible to be remote — let us know by Oct 28

• If you're eligible for more time — let us know by Oct 28

• Review during lecture next Thursday

fall 2021 lecture 9: logistic regression

Documents