s lecture08–part01 modelcomplexitysierra.ucsd.edu/cse151a-2020-sp/lectures/08... ·...

Post on 25-Jun-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

4 2 0 2 4y

0.0

0.2

0.4

0.6

0.8

1.0

s(y)

Lecture 08 – Part 01Model Complexity

0.0 0.2 0.4 0.6 0.8 1.0x

2.5

0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

y

Empirical Risk Minimization1. Pick a model.

▶ E.g., linear prediction rules.

2. Pick a loss.▶ E.g., mean squared error.

3. Find a prediction rule minimizing the risk.

Big Decision▶ Pick a model.

▶ Picking the wrong model causes problems.

Underfitting▶ Fit 𝐻(𝑥) = 𝑤0 + 𝑤1𝑥?

▶ We have underfit the data.

0.0 0.2 0.4 0.6 0.8 1.02.5

0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

Overfitting

▶ Fit 𝐻(𝑥) = 𝑤0 + 𝑤1𝑥 + 𝑤2𝑥2 + … + 𝑤10𝑥10?▶ We have overfit the data.

0.0 0.2 0.4 0.6 0.8 1.0

0

5

10

15

Model Complexity▶ Difference? Complexity.

▶ Complex models are highly flexible.▶ They tend to overfit.

▶ Simple models are stiff.▶ They tend to underfit.

Degree 10 Polynomial

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Degree 10 Polynomial

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Degree 10 Polynomial

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Degree 10 Polynomial

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Degree 10 Polynomial

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Degree 10 Polynomial

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Degree 10 Polynomial

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Degree 10 Polynomial

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Degree 10 Polynomial

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Degree 10 Polynomial

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Degree 1 Polynomial

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Degree 1 Polynomial

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Degree 1 Polynomial

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Degree 1 Polynomial

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Degree 1 Polynomial

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Degree 1 Polynomial

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Degree 1 Polynomial

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Degree 1 Polynomial

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Degree 1 Polynomial

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Degree 1 Polynomial

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Example: kNN▶ 1NN: complex model, likely to overfit.

▶ 20NN: less complex model, likely to underfit.

Choosing Model Complexity▶ How do we choose between two models?

▶ Between degree 10 and degree 1?▶ Between 1NN and 20NN?

▶ Not always obvious.

Bad Idea: Use training MSE▶ Which has smaller MSE on training data?

▶ Problem: Best 10-degree polynomial will alwayshave smaller MSE on training data.

Bad Idea: Use training MSE▶ Which has smaller MSE on training data?

▶ Problem: Best 10-degree polynomial will alwayshave smaller MSE on training data.

Good Idea: Use validation MSE▶ We care about generalization.

▶ So keep a small amount of data hidden in avalidation set.

▶ Fit model on training data, compute MSE onvalidation set.

▶ Pick whichever model has smaller validationerror.

What do you expect?▶ You fit a complex model on training data.

▶ Test it on a validation set.

▶ Likely: validation MSE > training MSE.

What do you expect?▶ You fit a very simple model on training data.

▶ Test it on a validation set.

▶ Likely: validation MSE ≈ training MSE.

Cross-Validation▶ We want all the training data we can get.

▶ Reserving some of it is wasteful.

▶ Idea: split data into pieces, each takes turn asvalidation set.

k-Fold Cross Validation1. Split data set into 𝑘 pieces, 𝑆1, … , 𝑆𝑘.

2. Loop 𝑘 times; on iteration 𝑖:▶ Use 𝑆𝑖 as validation set; rest as training.▶ Compute validation error 𝜖𝑖

3. Overall error: 1𝑘 ∑𝜖𝑖

Leave-One-Out Cross Validation▶ Suppose we have 𝑛 labeled data points.

▶ LOOCV: 𝑘-fold CV with 𝑘 = 𝑛.

Another Approach▶ We can control complexity by choosing model.

▶ Also: via regularization.

Regularization

▶ Let’s fit a complex model: 𝑤0 + 𝑤𝑥𝑥 + … + 𝑤10𝑥10.

▶ But impose a budget on weights, 𝑤0, … , 𝑤𝑑 .

0.0 0.2 0.4 0.6 0.8 1.0x

2.5

0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5y

Budgeting Weights▶ One way to budget: ask that ‖�⃗�‖2 is small.

▶ Before: minimize

𝑅sq(�⃗�) =1𝑛

𝑛∑𝑖=1(�⃗� ⋅ ⃗𝑥(𝑖) − 𝑦𝑖)2

▶ Now: minimize

�̃�sq(�⃗�) =1𝑛

𝑛∑𝑖=1(�⃗� ⋅ ⃗𝑥(𝑖) − 𝑦𝑖)2 + 𝜆‖�⃗�‖2

Solution▶ The regularized Normal Equations:

(𝑋𝑇𝑋 + 𝜆𝐼)�⃗� = 𝑋𝑇 ⃗𝑦

Example

𝜆 = 0

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Example

𝜆 = 1 × 10−4

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Example

𝜆 = 1

0.0 0.2 0.4 0.6 0.8 1.05

0

5

10

15

20

25

Regularization▶ As 𝜆 increases, simpler models preferred.

▶ Pick 𝜆 using cross-validation.

Other Penalizations▶ ‖�⃗�‖22 is ℓ2 regularization (explicit solution)

▶ a.k.a., ridge regression

▶ ‖�⃗�‖1 is ℓ1 regularization (no explicit solution)▶ a.k.a., the LASSO▶ encourages sparse �⃗�

4 2 0 2 4y

0.0

0.2

0.4

0.6

0.8

1.0

s(y)

Lecture 08 – Part 02Logistic Regression

Note▶ The midterm will cover everything up to rightnow.

▶ Regularization: yes.

▶ Logistic regression: no.

Predicting Heart Disease▶ Classification problem: Does a patient haveheart disease?

▶ Features: blood pressure, cholesterol level,exercise amount, maximum heart rate, sex

Better idea...▶ Instead of predicting yes/no...

▶ Give a probability that they have heart disease.▶ 1 = definitely yes▶ 0 = definitely no▶ 0.75 = probably, yes▶ …

Associations▶ If cholesterol is high, increased likelihood.

▶ Positive association.

▶ If exercise is low, increased likelihood.▶ Negative association.

The Model▶ Measure cholesterol (𝑥1), exercise (𝑥2), etc.

▶ Idea: weighted1 “vote” for heart disease:

𝑤1𝑥1 + 𝑤2𝑥2 + … + 𝑤𝑑𝑥𝑑

▶ Convention:▶ A positive number = vote for yes▶ A negative number = vote for no

1We’ll learn weights later.

The Model▶ Add a “bias” term:

𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2 + … + 𝑤𝑑𝑥𝑑= �⃗� ⋅ Aug( ⃗𝑥)

▶ The more positive �⃗� ⋅ Aug( ⃗𝑥), the more likely.

▶ The more negative �⃗� ⋅ Aug( ⃗𝑥), the less likely.

Converting to a Probability▶ Probabilities are between 0 and 1.

▶ Problem: �⃗� ⋅ Aug( ⃗𝑥) can be anything in (−∞,∞)

▶ We need to convert it to a probability.

The Logistic Function

𝜎(𝑡) = 11 + 𝑒−𝑡

4 2 0 2 4y

0.0

0.2

0.4

0.6

0.8

1.0

s(y)

The Model▶ Our simplified model for probability of heartdisease:

𝐻( ⃗𝑥) = 𝜎(�⃗� ⋅ Aug( ⃗𝑥))

▶ What should �⃗� be?

▶ To find �⃗�, use principle of maximum likelihood.

Maximum Likelihood▶ Suppose you have an unfair coin.

▶ Probability of heads is 𝑝, unknown.

▶ Flip 8 times and see: H, H, T, H, H, H, H, T

▶ Which is more likely: 𝑝 = 0.5 or 𝑝 = 0.75?

Maximum Likelihood▶ Assume coin flips are independent.

▶ The likelihood of H, H, T, H, H, H, H, T is:

L(𝑝) = 𝑝 ⋅ 𝑝 ⋅ (1 − 𝑝) ⋅ 𝑝 ⋅ 𝑝 ⋅ 𝑝 ⋅ 𝑝 ⋅ (1 − 𝑝)= 𝑝6(1 − 𝑝)2

▶ Idea: find 𝑝 maximizing L(𝑝)▶ Equivalently, find 𝑝 maximizing logL(𝑝)

Maximum LikelihoodFind 𝑝 maximizing logL(𝑝) = log𝑝6(1 − 𝑝)2:

Maximum Likelihood▶ In general, given 𝑛1 observed heads, 𝑛2observed tails, maximize:

log [𝑃(𝐹1 = 𝑓1) ⋅ 𝑃(𝐹2 = 𝑓2) ⋯ ⋅ 𝑃(𝐹𝑛 = 𝑓𝑛)]

=𝑛∑𝑖=1

log𝑃(𝐹1 = 𝑓1)

Back to Logistic Regression▶ The probability that person 𝑖 has heart disease:

𝐻( ⃗𝑥(𝑖)) = �⃗� ⋅ Aug( ⃗𝑥(𝑖))

▶ Gather a data set, ( ⃗𝑥(1), 𝑦1), … , ( ⃗𝑥(𝑛), 𝑦𝑛).

▶ What is the most likely �⃗�?

Maximum Likelihood▶ Suppose 3 people, (+,-,+).

▶ Likelihood:

𝐻( ⃗𝑥(1)) ⋅ (1 − 𝐻( ⃗𝑥(2))) ⋅ 𝐻( ⃗𝑥(3))

= 𝜎(�⃗� ⋅ Aug( ⃗𝑥(1))) ⋅ (1 − 𝜎(�⃗� ⋅ Aug( ⃗𝑥(2)))) ⋅ 𝜎(�⃗� ⋅ Aug( ⃗𝑥(3)))

= 11 + 𝑒−�⃗�⋅Aug( ⃗𝑥(1))

⋅ (1 − 11 + 𝑒−�⃗�⋅Aug( ⃗𝑥(2))

) ⋅ 11 + 𝑒−�⃗�⋅Aug( ⃗𝑥(3))

Observation▶ Note:

1 − 11 + 𝑒−𝑡 =

11 + 𝑒𝑡

Maximum Likelihood▶ The likelihood:

11 + 𝑒−�⃗�⋅Aug( ⃗𝑥(1))

⋅ 11 + 𝑒�⃗�⋅Aug( ⃗𝑥(2))

⋅ 11 + 𝑒−�⃗�⋅Aug( ⃗𝑥(3))

▶ Suppose 𝑦𝑖 = 1 if positive, 𝑦𝑖 = −1 if negative:

11 + 𝑒−𝑦1�⃗�⋅Aug( ⃗𝑥(1))

⋅ 11 + 𝑒−𝑦2�⃗�⋅Aug( ⃗𝑥(2))

⋅ 11 + 𝑒−𝑦3�⃗�⋅Aug( ⃗𝑥(3))

Maximum Likelihood▶ In general, the likelihood is:

L(�⃗�) =𝑛∏𝑖=1

11 + 𝑒−𝑦𝑖�⃗�⋅Aug( ⃗𝑥(𝑖))

▶ The log likelihood is:

logL(�⃗�) = −𝑛∑𝑖=1

log [1 + 𝑒−𝑦𝑖�⃗�⋅Aug( ⃗𝑥(𝑖))]

Maximizing Likelihood▶ Goal: find �⃗� maximizing logL

▶ Take gradient, set to zero, solve?

▶ Problem: try it, you’ll get stuck.

▶ Unlike least-squares regression, there is noexplicit solution.

Next TimeHow to maximize the log loss with gradient descent.

top related