outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · outline • rization regula- ......

22
x y x y h * f N Q f

Upload: others

Post on 31-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

Review of Le ture 11• OverttingFitting the data more than is warranted

PSfrag repla ements

x

y

Fit

DataTargetFit-1-0.500.51-30-20-10010

VC allows it; doesn't predi t it

Fitting the noise, sto hasti /deterministi • Deterministi noise

PSfrag repla ements

x

y

h∗

f

-0.8-0.6-0.4-0.200.20.40.60.81-100-80-60-40-2002040

PSfrag repla ements

Number of data points, N

Target omplexity,Q

f

00.51

80 100 120 -0.2-0.100.10.2

050

100

Page 2: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

Learning From DataYaser S. Abu-MostafaCalifornia Institute of Te hnologyLe ture 12: Regularization

Sponsored by Calte h's Provost O e, E&AS Division, and IST • Thursday, May 10, 2012

Page 3: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

Outline• Regularization - informal• Regularization - formal• Weight de ay• Choosing a regularizer

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 12 2/21

Page 4: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

Two approa hes to regularizationMathemati al:

Ill-posed problems in fun tion approximationHeuristi :

Handi apping the minimization of Ein © AM

L Creator: Yaser Abu-Mostafa - LFD Le ture 12 3/21

Page 5: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

A familiar example

PSfrag repla ements

x

y

-1-0.8-0.6-0.4-0.200.20.40.60.81-8-6-4-20246 without regularization with regularization

PSfrag repla ements

x

y

-1-0.8-0.6-0.4-0.200.20.40.60.81-2-1.5-1-0.500.511.52 © AM

L Creator: Yaser Abu-Mostafa - LFD Le ture 12 4/21

Page 6: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

and the winner is . . .

without regularization with regularization

PSfrag repla ements

x

y g(x)

sin(πx)

-1-0.8-0.6-0.4-0.200.20.40.60.81-3-2-10123

PSfrag repla ements

x

y g(x)

sin(πx)

-1-0.8-0.6-0.4-0.200.20.40.60.81-1.5-1-0.500.511.5bias = 0.21 var = 1.69 bias = 0.23 var = 0.33 © AML Creator: Yaser Abu-Mostafa - LFD Le ture 12 5/21

Page 7: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

The polynomial modelHq: polynomials of order Q linear regression in Z spa e

z =

1

L1(x)...Lq(x)

Hq =

Q∑

q=0

wq Lq(x)

Legendre polynomials:PSfrag repla ements

L1

x

PSfrag repla ementsL2

12(3x2 − 1)

PSfrag repla ementsL3

12(5x3 − 3x)

PSfrag repla ementsL4

18(35x4 − 30x2 + 3)

PSfrag repla ementsL5

18(63x5 · · · )

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 12 6/21

Page 8: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

Un onstrained solutionGiven (x1, y1), · · · , (xN , yn) −→ (z1, y1), · · · , (zN , yn)

Minimize Ein(w) =1

N

N∑

n=1

(wTzn − yn)2

Minimize 1N

(Zw − y)T(Zw − y)

wlin = (ZTZ)−1ZTy © AM

L Creator: Yaser Abu-Mostafa - LFD Le ture 12 7/21

Page 9: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

Constraining the weightsHard onstraint: H2 is onstrained version of H10 with wq = 0 for q > 2

Softer version: Q∑

q=0

w2q ≤ C soft-order onstraint

Minimize 1N

(Zw − y)T(Zw − y)

subje t to: wTw ≤ C

Solution: wreg instead of wlin © AM

L Creator: Yaser Abu-Mostafa - LFD Le ture 12 8/21

Page 10: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

Solving for wreg

wlin

wtw = C

w

Ein = onst.

∇Einnormal

Minimize Ein(w) = 1N

(Zw − y)T(Zw − y)subje t to: wTw ≤ C

∇Ein(wreg) ∝ −wreg= −2 λ

Nwreg

∇Ein(wreg) + 2 λNwreg = 0

Minimize Ein(w) + λNwTw C ↑ λ ↓

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 12 9/21

Page 11: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

Augmented errorMinimizing Eaug(w) = Ein(w) + λ

NwTw

= 1N

(Zw − y)T(Zw − y) + λNwTw un onditionally

− solves −Minimizing Ein(w) = 1

N(Zw − y)T(Zw − y)

subje t to: wTw ≤ C ←− VC formulation © AM

L Creator: Yaser Abu-Mostafa - LFD Le ture 12 10/21

Page 12: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

The solutionMinimize Eaug(w) = Ein(w) + λ

NwTw

=1

N

(

(Zw − y)T(Zw − y) + λ wtw)

∇Eaug(w) = 0 =⇒ ZT(Zw − y) + λw = 0

wreg = (ZTZ + λI)−1 ZTy (with regularization)as opposed to wlin = (ZTZ)−1ZTy (without regularization) © AM

L Creator: Yaser Abu-Mostafa - LFD Le ture 12 11/21

Page 13: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

The resultMinimizing Ein(w) + λ

NwTw for dierent λ's:

λ = 0 λ = 0.0001 λ = 0.01 λ = 1

PSfrag repla ements

x

y

Fit DataTargetFit-1-0.500.51-30-20-10010

PSfrag repla ements

x

y-1-0.500.5100.511.52

PSfrag repla ements

x

y

-1-0.500.5100.511.52

PSfrag repla ements

x

y

-1-0.500.5100.511.52overtting −→ −→ −→ −→ undertting

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 12 12/21

Page 14: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

Weight `de ay'Minimizing Ein(w) + λ

NwTw is alled weight de ay. Why?

Gradient des ent:w(t + 1) = w(t) − η ∇Ein (

w(t))

− 2 ηλ

Nw(t)

= w(t) (1− 2ηλ

N) − η ∇Ein (

w(t))

Applies in neural networks:wTw =

L∑

l=1

d(l−1)∑

i=0

d(l)∑

j=1

(

w(l)ij

)2

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 12 13/21

Page 15: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

Variations of weight de ayEmphasis of ertain weights: Q

q=0

γq w2q

Examples: γq = 2q =⇒ low-order tγq = 2−q =⇒ high-order tNeural networks: dierent layers get dierent γ'sTikhonov regularizer: wTΓTΓw

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 12 14/21

Page 16: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

Even weight growth!We ` onstrain' the weights to be large - bad!Pra ti al rule:

sto hasti noise is `high-frequen y'deterministi noise is also non-smooth

=⇒ onstrain learning towards smoother hypotheses

PSfrag repla ements

Regularization Parameter, λ

Expe tedE

out weight growthweight de ay

00.511.500.511.522.533.54

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 12 15/21

Page 17: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

General form of augmented errorCalling the regularizer Ω = Ω(h), we minimize

Eaug(h) = Ein(h) +λ

NΩ(h)

Rings a bell? ↓ ↓

Eout(h) ≤ Ein(h) + Ω(H)

Eaug is better than Ein as a proxy for Eout © AM

L Creator: Yaser Abu-Mostafa - LFD Le ture 12 16/21

Page 18: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

Outline• Regularization - informal• Regularization - formal• Weight de ay• Choosing a regularizer

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 12 17/21

Page 19: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

The perfe t regularizer Ω

Constraint in the `dire tion' of the target fun tion (going in ir les )Guiding prin iple:Dire tion of smoother or simplerChose a bad Ω?We still have λ!regularization is a ne essary evil

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 12 18/21

Page 20: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

Neural-network regularizersWeight de ay: From linear to logi al

PSfrag repla ements

lineartanhhard threshold

+1

−1

-4-2024-1-0.500.51Weight elimination:Fewer weights =⇒ smaller VC dimension

Soft weight elimination:Ω(w) =

i,j,l

(

w(l)ij

)2

β2 +(

w(l)ij

)2

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 12 19/21

Page 21: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

Early stopping as a regularizer

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

0.5

1

1.5

2

2.5

3

3.5

Epochs

Err

or

E

E

top

bottom

Early stopping

in

out

Regularization through the optimizer!When to stop? validation

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 12 20/21

Page 22: Outlinework.caltech.edu/slides/slides12.pdf · 2014-12-17 · Outline • rization Regula- ... Lecture 12 18/ 21. rk o w Neural-net rizers regula eight W y: deca rom F r linea to

The optimal λ

PSfrag repla ements

Regularization Parameter, λ

Expe tedEout

σ2 = 0

σ2 = 0.25

σ2 = 0.5

0.5 1 1.5 2.250.5.751

Sto hasti noise Deterministi noise

PSfrag repla ements

Regularization Parameter, λ

Expe tedE

outQf = 15

Qf = 30

Qf = 100

0.5 1 1.5 20.20.40.6

© AML Creator: Yaser Abu-Mostafa - LFD Le ture 12 21/21