iml tutorial 2 regression - eth z

48
IML Tutorial 2 Regression 26.02.2020 Xianyao Zhang (CGL/DRS) [email protected] 1

Upload: others

Post on 07-Jul-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IML Tutorial 2 Regression - ETH Z

IML Tutorial 2 Regression

26.02.2020 Xianyao Zhang (CGL/DRS) [email protected]

1

Page 2: IML Tutorial 2 Regression - ETH Z

Correction of HW 1 on Moodle• Question 14

• https://piazza.com/class/k6i4ygvjdai2re?cid=40

2

Page 3: IML Tutorial 2 Regression - ETH Z

Today’s Tutorial• A recap on recent lectures regarding regression

• More in-depth demos based on Prof. Krause’s demos

• Also a bit about python usage

• Please only ask questions about this tutorial

• Unless you think it is relevant enough and I can definitely answer it :)

3

Page 4: IML Tutorial 2 Regression - ETH Z

Linear Regression• Model: y = w1x1 + w2x2 + … + wd−1xd−1 + w0

4

Page 5: IML Tutorial 2 Regression - ETH Z

Linear Regression• Model:

• OR: ,

y = w1x1 + w2x2 + … + wd−1xd−1 + w0

y = wTx w = [w0, w1, w2, …, wd−1]T, x = [1,x1, x2, …, xd−1]T

5

Page 6: IML Tutorial 2 Regression - ETH Z

Linear Regression• Model:

• OR: ,

• Data distribution:

• e.g. , ; e.g.

y = w1x1 + w2x2 + … + wd−1xd−1 + w0

y = wTx w = [w0, w1, w2, …, wd−1]T, x = [1,x1, x2, …, xd−1]T

(xi, yi) ∼ P(x,y)

y = uTx + ϵ ϵ ∼ N(0,1) y ∼ N(∥x∥2, σ2)

6

Page 7: IML Tutorial 2 Regression - ETH Z

Linear Regression• Model:

• OR: ,

• Data distribution:

• e.g. , ; e.g.

• True risk of model :

y = w1x1 + w2x2 + … + wd−1xd−1 + w0

y = wTx w = [w0, w1, w2, …, wd−1]T, x = [1,x1, x2, …, xd−1]T

(xi, yi) ∼ P(x,y)

y = uTx + ϵ ϵ ∼ N(0,1) y ∼ N(∥x∥2, σ2)

w R(w) = 𝔼P(x,y)[(y − wTx)2]

7

Page 8: IML Tutorial 2 Regression - ETH Z

Linear Regression• Model:

• OR: ,

• Data distribution:

• e.g. , ; e.g.

• True risk of model :

• Data is usually infinite. How to estimate the true risk?

y = w1x1 + w2x2 + … + wd−1xd−1 + w0

y = wTx w = [w0, w1, w2, …, wd−1]T, x = [1,x1, x2, …, xd−1]T

(xi, yi) ∼ P(x,y)

y = uTx + ϵ ϵ ∼ N(0,1) y ∼ N(∥x∥2, σ2)

w R(w) = 𝔼P(x,y)[(y − wTx)2]

8

Page 9: IML Tutorial 2 Regression - ETH Z

Monte Carlo Estimation• Given a function and a distribution in a domain ,

estimate f( ⋅ ) p( ⋅ ) Ω

𝔼X∼p[ f(X)]

9

Page 10: IML Tutorial 2 Regression - ETH Z

Monte Carlo Estimation• Given a function and a distribution in a domain ,

estimate

• are i.i.d. samples from distribution

f( ⋅ ) p( ⋅ ) Ω𝔼X∼p[ f(X)]

𝔼X∼p[ f(X)] = ∫Ωf(x)p(x) dx ≈

1N

N

∑i=1

f(x(i))

x(i) p

10

Page 11: IML Tutorial 2 Regression - ETH Z

Monte Carlo Estimation• Given a function and a distribution in a domain ,

estimate

• are i.i.d. samples from distribution

• This estimate is unbiased:

f( ⋅ ) p( ⋅ ) Ω𝔼X∼p[ f(X)]

𝔼X∼p[ f(X)] = ∫Ωf(x)p(x) dx ≈

1N

N

∑i=1

f(x(i))

x(i) p

𝔼x(i)∼p[1N

N

∑i=1

f(x(i))] = ∫Ωf(x)p(x) dx

11

Page 12: IML Tutorial 2 Regression - ETH Z

Monte Carlo Estimation• In general, to estimate integral

• Use samples from distribution

• instead of

∫Ωf(x) dx

q( ⋅ )

∫Ωf(x) dx ≈

1N

N

∑i=1

f(x(i))q(x(i))

x(i) ∼ q p

12

Page 13: IML Tutorial 2 Regression - ETH Z

Monte Carlo Estimation• In general, to estimate integral

• Use samples from distribution

• instead of

• Unbiased if is non-zero wherever is non-zero

∫Ωf(x) dx

q( ⋅ )

∫Ωf(x) dx ≈

1N

N

∑i=1

f(x(i))q(x(i))

x(i) ∼ q p

q(x) f(x)

13

Page 14: IML Tutorial 2 Regression - ETH Z

14 Image credit: Wojciech Jarosz

Page 15: IML Tutorial 2 Regression - ETH Z

Monte Carlo Estimation•

• Can also use another distribution

• instead of

• Unbiased if is non-zero wherever is non-zero

𝔼X∼p[ f(X)] = ∫Ωf(x)p(x) dx ≈

1N

N

∑i=1

f(x(i))

q( ⋅ )

∫Ωf(x)p(x) dx = ∫Ω

f(x)p(x)q(x)

q(x) dx ≈1N

N

∑i=1

f(x(i))p(x(i))q(x(i))

x(i) ∼ q p

q(x) f(x)p(x)

15

Page 16: IML Tutorial 2 Regression - ETH Z

Linear Regression• Hence we can use a finite dataset to estimate true risk

R(w)

R(w) = 𝔼P(x,y)[(y − wTx)2]

16

Page 17: IML Tutorial 2 Regression - ETH Z

Linear Regression• Hence we can use a finite dataset to estimate true risk

• Dataset: , , i.i.d. data examples

• Empirical risk of model on :

R(w)

R(w) = 𝔼P(x,y)[(y − wTx)2]

D = {(xi, yi)}Ni=1 D ∼ PD (xi, yi) ∼ P(x, y)

w D RD(w) =1N

N

∑i=1

(yi − wTxi)2

17

Page 18: IML Tutorial 2 Regression - ETH Z

Linear Regression• Hence we can use a finite dataset to estimate true risk

• Dataset: , , i.i.d. data examples

• Empirical risk of model on :

• Unbiased estimate of if we only use it to evaluate

• But we want to find a good model with (training)!

R(w)

R(w) = 𝔼P(x,y)[(y − wTx)2]

D = {(xi, yi)}Ni=1 D ∼ PD (xi, yi) ∼ P(x, y)

w D RD(w) =1N

N

∑i=1

(yi − wTxi)2

R(w) w

w D

18

Page 19: IML Tutorial 2 Regression - ETH Z

Closed-form solution•

• ,

• Reformulate: , usually over-constrained ( )

• Least squares!

w = arg minw

N

∑i=1

(yi − wTxi)2 ⟹ w = (XTX)−1XTy

X = [x1, x2, …, xN]T ∈ ℝN×d y = [y1, y2, …, yN]T ∈ ℝN

y = Xw N ≫ d

19

Page 20: IML Tutorial 2 Regression - ETH Z

Gradient Descent• : initialization

• : update at step

w0 ∈ ℝd

wt = wt−1 − ηt ∇R(wt) t = 1,2,3,…

20

Page 21: IML Tutorial 2 Regression - ETH Z

Gradient Descent• : initialization

• : update at step

• Convex function: convergence guaranteed for small

w0 ∈ ℝd

wt = wt−1 − ηt ∇R(wt) t = 1,2,3,…

ηt

21

Page 22: IML Tutorial 2 Regression - ETH Z

Gradient Descent• : initialization

• : update at step

• Convex function: convergence guaranteed for small

w0 ∈ ℝd

wt = wt−1 − ηt ∇R(wt) t = 1,2,3,…

ηt

22

Page 23: IML Tutorial 2 Regression - ETH Z

Non-linear Features• , y = wTx → y = vTϕ(x) w ∈ ℝm, v ∈ ℝn

23

Page 24: IML Tutorial 2 Regression - ETH Z

Non-linear Features• ,

• is nonlinear:

• e.g.

y = wTx → y = vTϕ(x) w ∈ ℝm, v ∈ ℝn

ϕ(x) ℝm → ℝn

x = [x1, x2, x3]T, ϕ(x) = [1,x1, x21 , x2

2 x3, ln 5x3, ex2−x1]T

24

Page 25: IML Tutorial 2 Regression - ETH Z

Non-linear Features• ,

• is nonlinear:

• e.g.

• Can lead to better models if a good is selected

y = wTx → y = vTϕ(x) w ∈ ℝm, v ∈ ℝn

ϕ(x) ℝm → ℝn

x = [x1, x2, x3]T, ϕ(x) = [1,x1, x21 , x2

2 x3, ln 5x3, ex2−x1]T

ϕ( ⋅ )

25

Page 26: IML Tutorial 2 Regression - ETH Z

Non-linear Features• ,

• is nonlinear:

• e.g.

• Can lead to better models if a good is selected

• Worse models when picking a bad one :/

• Also, tricky to pick the “just right” ones

y = wTx → y = vTϕ(x) w ∈ ℝm, v ∈ ℝn

ϕ(x) ℝm → ℝn

x = [x1, x2, x3]T, ϕ(x) = [1,x1, x21 , x2

2 x3, ln 5x3, ex2−x1]T

ϕ( ⋅ )

26

Page 27: IML Tutorial 2 Regression - ETH Z

Which Models Are Better?• The models with lower true risk w R(w)

27

Page 28: IML Tutorial 2 Regression - ETH Z

Which Models Are Better?• The models with lower true risk

• Under-fitting: not enough capacity

• Over-fitting: too much capacity —> fitting the noise!

w R(w)

28

Page 29: IML Tutorial 2 Regression - ETH Z

Which Models Are Better?• The models with lower true risk

• Under-fitting: not enough capacity

• Over-fitting: too much capacity —> fitting the noise!

• Good model: neither under-fitting nor over-fitting

w R(w)

29

Page 30: IML Tutorial 2 Regression - ETH Z

Training/Testing Split• Empirical risk usually underestimate true risk

RD(wD) R(wD)

𝔼D[RD(wD)] ≤ 𝔼D[R(wD)]

30

Page 31: IML Tutorial 2 Regression - ETH Z

31

Page 32: IML Tutorial 2 Regression - ETH Z

Training/Testing Split• Empirical risk usually underestimate true risk

• “Too optimistic” about the model

• OR: the model only performs well on training data

RD(wD) R(wD)

𝔼D[RD(wD)] ≤ 𝔼D[R(wD)]

32

Page 33: IML Tutorial 2 Regression - ETH Z

Training/Testing Split• Empirical risk usually underestimate true risk

• “Too optimistic” about the model

• OR: the model only performs well on training data

• Unbiasedly estimate the true risk: random test set

RD(wD) R(wD)

𝔼D[RD(wD)] ≤ 𝔼D[R(wD)]

𝔼Dtest[RDtest

(wDtrain)] = R(wDtrain

)33

Page 34: IML Tutorial 2 Regression - ETH Z

Validation and Testing Sets?• If we use only the training/testing split, we can overfit the testing set

• RDtest(wDtrain

) ≠ R(wDtrain)

34

Page 35: IML Tutorial 2 Regression - ETH Z

Validation and Testing Sets?• If we use only the training/testing split, we can overfit the testing set

• Do not select the model based on test set

RDtest(wDtrain

) ≠ R(wDtrain)

35

Page 36: IML Tutorial 2 Regression - ETH Z

Validation and Testing Sets?• If we use only the training/testing split, we can overfit the testing set

• Do not select the model based on test set

• Validation set = reserve part of training set for model selection

• Actually the “test set” before is a validation set

• Cross-validation = avoid bias of the validation set selection

RDtest(wDtrain

) ≠ R(wDtrain)

36

Page 37: IML Tutorial 2 Regression - ETH Z

Cross-validation• Demo: k-fold CV for model selection

37 Image credit: sklearn

Page 38: IML Tutorial 2 Regression - ETH Z

Regularization• “Our models cannot be that complex, those large weights can only

come from noise”

• Penalize large weights in the loss functions⟹

38

Page 39: IML Tutorial 2 Regression - ETH Z

Regularization•

• Linear regression:

minw

RD(w) + λC(w)

RD(w) =1N

N

∑i=1

(yi − wTx)2

39

Page 40: IML Tutorial 2 Regression - ETH Z

Regularization•

• Linear regression:

• Ridge ( ): , has closed form solution

minw

RD(w) + λC(w)

RD(w) =1N

N

∑i=1

(yi − wTx)2

L2 C(w) = ∥w∥22 =

d

∑k=1

w2k

40

Page 41: IML Tutorial 2 Regression - ETH Z

Regularization•

• Linear regression:

• Ridge ( ): , has closed form solution

• Lasso ( ): , doesn’t have closed form solution

minw

RD(w) + λC(w)

RD(w) =1N

N

∑i=1

(yi − wTx)2

L2 C(w) = ∥w∥22 =

d

∑k=1

w2k

L1 C(w) = ∥w∥1 =d

∑k=1

|wk|

41

Page 43: IML Tutorial 2 Regression - ETH Z

Standardization• The “small-weight” idea only applies when the data is standardized

43

Page 44: IML Tutorial 2 Regression - ETH Z

Standardization• The “small-weight” idea only applies when the data is standardized

• e.g. is income ( ), is altitude ( ), is height ( )x1 104 x2 103 x3 100

44

Page 45: IML Tutorial 2 Regression - ETH Z

Standardization• The “small-weight” idea only applies when the data is standardized

• e.g. is income ( ), is altitude ( ), is height ( )

• Originally

• Penalize and get

• will be useless!

x1 104 x2 103 x3 100

w1 = 0.1,w2 = 2,w3 = 2000

∥w∥22 w1 = w2 = w3 = 1

x3

45

Page 46: IML Tutorial 2 Regression - ETH Z

Standardization• The “small-weight” idea only applies when the data is standardized

• e.g. is income ( ), is altitude ( ), is height ( )

• Penalize and get

• will be useless!

• Standardize when using regularization:

x1 104 x2 103 x3 100

∥w∥22 w1 = w2 = w3 = 1

x3

xij =xij − μj

σj

46

Page 47: IML Tutorial 2 Regression - ETH Z

End of Presentation Beginning of Q&A

47

Page 48: IML Tutorial 2 Regression - ETH Z

IML Tutorial 2 Regression

26.02.2020 Xianyao Zhang (CGL/DRS) [email protected]

48