machine learning basics iii · machine learning basics iii benjamin roth, nina poerner cis lmu...

62
Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU M¨ unchen Benjamin Roth, Nina Poerner (CIS LMU M¨ unchen) Machine Learning Basics III 1 / 29

Upload: others

Post on 03-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Machine Learning Basics III

Benjamin Roth, Nina Poerner

CIS LMU Munchen

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1 / 29

Page 2: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Outline

1 Deep Feedforward NetworksMotivationNon-LinearitiesTraining

2 Regularization

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 2 / 29

Page 3: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Outline

1 Deep Feedforward NetworksMotivationNon-LinearitiesTraining

2 Regularization

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 3 / 29

Page 4: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Why Regression is not Enough

Let x1, x2 ∈ {0, 1}We want XOR function, s.t.

f (x1, x2) =

{1 if x1 6= x2

0 otherwise

Can we learn this function using only logistic regression?

x1

x2

10

0

1 f (x) = 0

f (x) = 0

f (x) = 1

f (x) = 1

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 4 / 29

Page 5: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

x1

x2

10

0

1 f (x) = 0

f (x) = 0

f (x) = 1

f (x) = 1

I g(x1, x2) = σ(θ0 + θ1x1 + θ2x2)

I f (x1, x2) =

{1 if g(x1, x2) > 0.5

0 otherwise

=⇒ f (x1, x2) =

{1 if θ0 + θ1x1 + θ2x2 > 0

0 otherwise

I θ0 ≤ 0I θ0 + θ1 > 0I θ0 + θ2 > 0 =⇒ θ2 > 0I θ0 + θ1 + θ2 ≤ 0 XXXI The classes are not linearly separable!

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 5 / 29

Page 6: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

x1

x2

10

0

1 f (x) = 0

f (x) = 0

f (x) = 1

f (x) = 1

I g(x1, x2) = σ(θ0 + θ1x1 + θ2x2)

I f (x1, x2) =

{1 if g(x1, x2) > 0.5

0 otherwise

=⇒ f (x1, x2) =

{1 if θ0 + θ1x1 + θ2x2 > 0

0 otherwise

I θ0 ≤ 0I θ0 + θ1 > 0I θ0 + θ2 > 0 =⇒ θ2 > 0I θ0 + θ1 + θ2 ≤ 0 XXXI The classes are not linearly separable!

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 5 / 29

Page 7: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

x1

x2

10

0

1 f (x) = 0

f (x) = 0

f (x) = 1

f (x) = 1

I g(x1, x2) = σ(θ0 + θ1x1 + θ2x2)

I f (x1, x2) =

{1 if g(x1, x2) > 0.5

0 otherwise

=⇒ f (x1, x2) =

{1 if θ0 + θ1x1 + θ2x2 > 0

0 otherwise

I θ0 ≤ 0I θ0 + θ1 > 0I θ0 + θ2 > 0 =⇒ θ2 > 0I θ0 + θ1 + θ2 ≤ 0 XXXI The classes are not linearly separable!

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 5 / 29

Page 8: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

x1

x2

10

0

1 f (x) = 0

f (x) = 0

f (x) = 1

f (x) = 1

I g(x1, x2) = σ(θ0 + θ1x1 + θ2x2)

I f (x1, x2) =

{1 if g(x1, x2) > 0.5

0 otherwise

=⇒ f (x1, x2) =

{1 if θ0 + θ1x1 + θ2x2 > 0

0 otherwise

I θ0 ≤ 0I θ0 + θ1 > 0I θ0 + θ2 > 0 =⇒ θ2 > 0I θ0 + θ1 + θ2 ≤ 0 XXXI The classes are not linearly separable!

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 5 / 29

Page 9: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

x1

x2

10

0

1 f (x) = 0

f (x) = 0

f (x) = 1

f (x) = 1

I g(x1, x2) = σ(θ0 + θ1x1 + θ2x2)

I f (x1, x2) =

{1 if g(x1, x2) > 0.5

0 otherwise

=⇒ f (x1, x2) =

{1 if θ0 + θ1x1 + θ2x2 > 0

0 otherwise

I θ0 ≤ 0

I θ0 + θ1 > 0I θ0 + θ2 > 0 =⇒ θ2 > 0I θ0 + θ1 + θ2 ≤ 0 XXXI The classes are not linearly separable!

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 5 / 29

Page 10: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

x1

x2

10

0

1 f (x) = 0

f (x) = 0

f (x) = 1

f (x) = 1

I g(x1, x2) = σ(θ0 + θ1x1 + θ2x2)

I f (x1, x2) =

{1 if g(x1, x2) > 0.5

0 otherwise

=⇒ f (x1, x2) =

{1 if θ0 + θ1x1 + θ2x2 > 0

0 otherwise

I θ0 ≤ 0I θ0 + θ1 > 0

I θ0 + θ2 > 0 =⇒ θ2 > 0I θ0 + θ1 + θ2 ≤ 0 XXXI The classes are not linearly separable!

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 5 / 29

Page 11: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

x1

x2

10

0

1 f (x) = 0

f (x) = 0

f (x) = 1

f (x) = 1

I g(x1, x2) = σ(θ0 + θ1x1 + θ2x2)

I f (x1, x2) =

{1 if g(x1, x2) > 0.5

0 otherwise

=⇒ f (x1, x2) =

{1 if θ0 + θ1x1 + θ2x2 > 0

0 otherwise

I θ0 ≤ 0I θ0 + θ1 > 0I θ0 + θ2 > 0 =⇒ θ2 > 0

I θ0 + θ1 + θ2 ≤ 0 XXXI The classes are not linearly separable!

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 5 / 29

Page 12: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

x1

x2

10

0

1 f (x) = 0

f (x) = 0

f (x) = 1

f (x) = 1

I g(x1, x2) = σ(θ0 + θ1x1 + θ2x2)

I f (x1, x2) =

{1 if g(x1, x2) > 0.5

0 otherwise

=⇒ f (x1, x2) =

{1 if θ0 + θ1x1 + θ2x2 > 0

0 otherwise

I θ0 ≤ 0I θ0 + θ1 > 0I θ0 + θ2 > 0 =⇒ θ2 > 0I θ0 + θ1 + θ2 ≤ 0

XXXI The classes are not linearly separable!

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 5 / 29

Page 13: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

x1

x2

10

0

1 f (x) = 0

f (x) = 0

f (x) = 1

f (x) = 1

I g(x1, x2) = σ(θ0 + θ1x1 + θ2x2)

I f (x1, x2) =

{1 if g(x1, x2) > 0.5

0 otherwise

=⇒ f (x1, x2) =

{1 if θ0 + θ1x1 + θ2x2 > 0

0 otherwise

I θ0 ≤ 0I θ0 + θ1 > 0I θ0 + θ2 > 0 =⇒ θ2 > 0I θ0 + θ1 + θ2 ≤ 0 XXX

I The classes are not linearly separable!

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 5 / 29

Page 14: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

x1

x2

10

0

1 f (x) = 0

f (x) = 0

f (x) = 1

f (x) = 1

I g(x1, x2) = σ(θ0 + θ1x1 + θ2x2)

I f (x1, x2) =

{1 if g(x1, x2) > 0.5

0 otherwise

=⇒ f (x1, x2) =

{1 if θ0 + θ1x1 + θ2x2 > 0

0 otherwise

I θ0 ≤ 0I θ0 + θ1 > 0I θ0 + θ2 > 0 =⇒ θ2 > 0I θ0 + θ1 + θ2 ≤ 0 XXXI The classes are not linearly separable!

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 5 / 29

Page 15: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Why Regression is not Enough

x1

x2

10

0

1 f (x) = 0

f (x) = 0

f (x) = 1

f (x) = 1

L R

L R

red

blue

RL

L

R f (x) = 0

f (x) = 0 f (x) = 1

b(x1, x2) = σ(θb0 + θb1 · x1 + θb2 · x2)

r(x1, x2) = σ(θr0 + θr1 · x1 + θr2 · x2)

g(x1, x2) = σ(θg0 + θg1 · b(x1, x2) + θg2 · r(x1, x2))

f (x1, x2) = I[g(x1, x2) > 0.5]

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 6 / 29

Page 16: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Why Regression is not Enough

x1

x2

10

0

1 f (x) = 0

f (x) = 0

f (x) = 1

f (x) = 1

L R

L R

red

blue

RL

L

R f (x) = 0

f (x) = 0 f (x) = 1

b(x1, x2) = σ(θb0 + θb1 · x1 + θb2 · x2)

r(x1, x2) = σ(θr0 + θr1 · x1 + θr2 · x2)

g(x1, x2) = σ(θg0 + θg1 · b(x1, x2) + θg2 · r(x1, x2))

f (x1, x2) = I[g(x1, x2) > 0.5]

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 6 / 29

Page 17: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Why Regression is not Enough

x1

x2

10

0

1 f (x) = 0

f (x) = 0

f (x) = 1

f (x) = 1

L R

L R

red

blue

RL

L

R f (x) = 0

f (x) = 0 f (x) = 1

b(x1, x2) = σ(θb0 + θb1 · x1 + θb2 · x2)

r(x1, x2) = σ(θr0 + θr1 · x1 + θr2 · x2)

g(x1, x2) = σ(θg0 + θg1 · b(x1, x2) + θg2 · r(x1, x2))

f (x1, x2) = I[g(x1, x2) > 0.5]

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 6 / 29

Page 18: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Why Regression is not Enough

x1

x2

10

0

1 f (x) = 0

f (x) = 0

f (x) = 1

f (x) = 1

L R

L R

red

blue

RL

L

R f (x) = 0

f (x) = 0 f (x) = 1

b(x1, x2) = σ(θb0 + θb1 · x1 + θb2 · x2)

r(x1, x2) = σ(θr0 + θr1 · x1 + θr2 · x2)

g(x1, x2) = σ(θg0 + θg1 · b(x1, x2) + θg2 · r(x1, x2))

f (x1, x2) = I[g(x1, x2) > 0.5]

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 6 / 29

Page 19: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Deep Feedforward Networks

Network: f (x;θ) is a composition of two or more functions f (n)

e.g., f (x) = f (3)(f (2)(f (1)(x))))

Each f (n) represents one layer in the network.

Input layer → hidden layer(s) → output layer

x1

x2

r

bg

Input layer Hidden layer Output layer

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 7 / 29

Page 20: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Deep Feedforward Networks

Network: f (x;θ) is a composition of two or more functions f (n)

e.g., f (x) = f (3)(f (2)(f (1)(x))))

Each f (n) represents one layer in the network.

Input layer → hidden layer(s) → output layer

x1

x2

r

bg

Input layer Hidden layer Output layer

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 7 / 29

Page 21: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Outline

1 Deep Feedforward NetworksMotivationNon-LinearitiesTraining

2 Regularization

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 8 / 29

Page 22: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

“Neural” NetworksInspired by biological neurons (nerve cells)Neurons are connected to each other, and receive and send electricalpulses“If the [input] voltage changes by a large enough amount, anall-or-none electrochemical pulse called an action potential isgenerated, which travels rapidly along the cell’s axon, and activatessynaptic connections with other cells when it arrives.” (Wikipedia)all-or-none ≈ nonlinear

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 9 / 29

Page 23: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Why we need Non-Linearities

Fully linear multi-layer neural networks are not very expressive:

f (x1, x2) = θg1(θr1x1 + θr2x2) + θg2(θb1x1 + θb2x2)

⇐⇒ f (x1, x2) = (θg1θr1 + θg2θb1)x1 + (θg1θr2 + θg2θb2)x2

Apply non-linear activation functions to neurons!

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 10 / 29

Page 24: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Non-Linearities for Hidden Layers

Rectified Linear Unit (relu)I relu(z) = max(0, z)I relu has consistent gradient of 1 when a neuron is active, but zero

gradient otherwise

Two-layer FFN with relu can solve XOR:

f (x;W,b, v) = vT relu(WTx + b)

W =

[1 11 1

]b =

[0−1

]v =

[1−2

]Question: Would this FFN still solve XOR if we remove relu? Why not?

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 11 / 29

Page 25: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Non-Linearities for Hidden Layers (contd.)

σ(z)

tanh(z) = 2σ(2z)− 1

Sigmoidal functions have only a small “linear” region before theysaturate (“flatten out”) in both directions.

This means that gradients become very small for big inputs

Practice shows that this is okay in conjunction with log-loss

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 12 / 29

Page 26: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Non-Linearities for Output Units

Depends on what you are trying to predict!

If you are predicting a real number (e.g., house price), a linearactivation might work...

For classification:I To predict every class individually:

F Elementwise σF → no constraints on how many classes can be trueF n independent Bernouilli distributions

I To select one out of n classes:F softmax(z)i =

exp(zi )∑j exp(zj )

F → all probabilities sum to 1.F Multinoulli (categorical) distribution.

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 13 / 29

Page 27: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Outline

1 Deep Feedforward NetworksMotivationNon-LinearitiesTraining

2 Regularization

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 14 / 29

Page 28: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Deep Feedforward Networks: Training

Loss function defined on output layer, e.g. ||y − f (x;θ)||2No loss function defined directly on hidden layers

Instead, training algorithm must decide how to use hidden layers mosteffectively to minimize the loss on output layer

Hidden layers can be viewed as providing a complex, more usefulfeature function φ(x) of the input (e.g., blue and red separators)

Conceptually similar to hand-engineered input features to linearmodels, but fully data-driven

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 15 / 29

Page 29: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Deep Feedforward Networks: Training

Loss function defined on output layer, e.g. ||y − f (x;θ)||2No loss function defined directly on hidden layers

Instead, training algorithm must decide how to use hidden layers mosteffectively to minimize the loss on output layer

Hidden layers can be viewed as providing a complex, more usefulfeature function φ(x) of the input (e.g., blue and red separators)

Conceptually similar to hand-engineered input features to linearmodels, but fully data-driven

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 15 / 29

Page 30: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Backpropagation

Forward propagation: Input information x propagates throughnetwork to produce output y .

Calculate cost J(θ), as you would with regression.

Compute gradients w.r.t. all model parameters θ...

... how?I We know how to compute gradients w.r.t. parameters of the output

layer (just like regression).I How to calculate them w.r.t. parameters of the hidden layers?

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 16 / 29

Page 31: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Chain Rule of Calculus

Let x , y , z ∈ R.

Let functions f , g : R→ R.

y = g(x)

z = f (g(x))

Thendz

dx=

dz

dy

dy

dx

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 17 / 29

Page 32: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Chain Rule of Calculus: Vector-valued Functions

Let x ∈ Rm, y ∈ Rn, z ∈ RLet functions f : Rn → R, g : Rm → Rn

y = g(x)

z = f (g(x)) = f (y)

Then∂z

∂xi=

n∑j=1

∂z

∂yj

∂yj∂xi

In order to write this in vector notation, we need to define theJacobian matrix.

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 18 / 29

Page 33: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Jacobian

The Jacobian is the matrix of all first-order partial derivatives of avector-valued function.

∂y

∂x=

∂y1∂x1

· · · ∂y1∂xm

∂y2∂x1

∂y2∂xm

.... . .

...

∂yn∂x1

· · · ∂yn∂xm

How to write in terms of gradients?

We can write the chain rule as:

∇xzm×1

=(∂y∂x

)n×m

T∇yzn×1

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 19 / 29

Page 34: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Viewing the Network as a Graph

Nodes are function outputs (can be scalar or vector valued)

Arrows are functions

Example:

y = vT relu(WTx)

z = WTx; r = relu(z)

x

W

v

zWTx rmax(0, z)

y

vT r

yJ

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 20 / 29

Page 35: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Viewing the Network as a Graph

Nodes are function outputs (can be scalar or vector valued)

Arrows are functions

Example:

y = vT relu(WTx)

z = WTx; r = relu(z)

x

W

v

zWTx

rmax(0, z)

y

vT r

yJ

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 20 / 29

Page 36: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Viewing the Network as a Graph

Nodes are function outputs (can be scalar or vector valued)

Arrows are functions

Example:

y = vT relu(WTx)

z = WTx; r = relu(z)

x

W

v

zWTx rmax(0, z)

y

vT r

yJ

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 20 / 29

Page 37: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Viewing the Network as a Graph

Nodes are function outputs (can be scalar or vector valued)

Arrows are functions

Example:

y = vT relu(WTx)

z = WTx; r = relu(z)

x

W

v

zWTx rmax(0, z)

y

vT r

yJ

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 20 / 29

Page 38: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Viewing the Network as a Graph

Nodes are function outputs (can be scalar or vector valued)

Arrows are functions

Example:

y = vT relu(WTx)

z = WTx; r = relu(z)

x

W

v

zWTx rmax(0, z)

y

vT r

yJ

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 20 / 29

Page 39: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Forward pass

Green: Known or computed node

x

W

zWTx rmax(0, z)

y

v

yJ

vT r

z r y J

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 21 / 29

Page 40: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Forward pass

Green: Known or computed node

x

W

zWTx rmax(0, z)

y

v

yJ

vT r

z

r y J

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 21 / 29

Page 41: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Forward pass

Green: Known or computed node

x

W

zWTx rmax(0, z)

y

v

yJ

vT r

z r

y J

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 21 / 29

Page 42: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Forward pass

Green: Known or computed node

x

W

zWTx rmax(0, z)

y

v

yJ

vT r

z r y

J

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 21 / 29

Page 43: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Forward pass

Green: Known or computed node

x

W

zWTx rmax(0, z)

y

v

yJ

vT r

z r y J

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 21 / 29

Page 44: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Backward pass

Red: Gradient of J w.r.t. node known or computed

x

W

z r y

v

max(0, z)yJ

y

dJdy

∂y∂v

dJdyv

∂y∂r

dJdy

r

( ∂r∂z)T ∂y

∂rdJdy

z

( ∂z∂W )T ( ∂r

∂z)T ∂y∂r

dJdy

W

( ∂z∂W )T ( ∂r

∂z)T ∂y∂r

dJdy

∂y∂v

dJdy

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 22 / 29

Page 45: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Backward pass

Red: Gradient of J w.r.t. node known or computed

x

W

z r y

v

max(0, z)yJy

dJdy

∂y∂v

dJdyv

∂y∂r

dJdy

r

( ∂r∂z)T ∂y

∂rdJdy

z

( ∂z∂W )T ( ∂r

∂z)T ∂y∂r

dJdy

W

( ∂z∂W )T ( ∂r

∂z)T ∂y∂r

dJdy

∂y∂v

dJdy

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 22 / 29

Page 46: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Backward pass

Red: Gradient of J w.r.t. node known or computed

x

W

z r y

v

max(0, z)yJy

dJdy

∂y∂v

dJdyv

∂y∂r

dJdy

r

( ∂r∂z)T ∂y

∂rdJdy

z

( ∂z∂W )T ( ∂r

∂z)T ∂y∂r

dJdy

W

( ∂z∂W )T ( ∂r

∂z)T ∂y∂r

dJdy

∂y∂v

dJdy

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 22 / 29

Page 47: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Backward pass

Red: Gradient of J w.r.t. node known or computed

x

W

z r y

v

max(0, z)yJy

dJdy

∂y∂v

dJdyv

∂y∂r

dJdy

r

( ∂r∂z)T ∂y

∂rdJdy

z

( ∂z∂W )T ( ∂r

∂z)T ∂y∂r

dJdy

W

( ∂z∂W )T ( ∂r

∂z)T ∂y∂r

dJdy

∂y∂v

dJdy

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 22 / 29

Page 48: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Backward pass

Red: Gradient of J w.r.t. node known or computed

x

W

z r y

v

max(0, z)yJy

dJdy

∂y∂v

dJdyv

∂y∂r

dJdy

r

( ∂r∂z)T ∂y

∂rdJdy

z

( ∂z∂W )T ( ∂r

∂z)T ∂y∂r

dJdy

W

( ∂z∂W )T ( ∂r

∂z)T ∂y∂r

dJdy

∂y∂v

dJdy

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 22 / 29

Page 49: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Backward pass

Red: Gradient of J w.r.t. node known or computed

x

W

z r y

v

max(0, z)yJy

dJdy

∂y∂v

dJdyv

∂y∂r

dJdy

r

( ∂r∂z)T ∂y

∂rdJdy

z

( ∂z∂W )T ( ∂r

∂z)T ∂y∂r

dJdy

W

( ∂z∂W )T ( ∂r

∂z)T ∂y∂r

dJdy

∂y∂v

dJdy

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 22 / 29

Page 50: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Backward pass

Red: Gradient of J w.r.t. node known or computed

x

W

z r y

v

max(0, z)yJy

dJdy

∂y∂v

dJdyv

∂y∂r

dJdy

r

( ∂r∂z)T ∂y

∂rdJdy

z

( ∂z∂W )T ( ∂r

∂z)T ∂y∂r

dJdy

W

( ∂z∂W )T ( ∂r

∂z)T ∂y∂r

dJdy

∂y∂v

dJdy

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 22 / 29

Page 51: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Outline

1 Deep Feedforward NetworksMotivationNon-LinearitiesTraining

2 Regularization

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 23 / 29

Page 52: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Regularization

feature

prediction

xx

xx

xx

xx x

feature

prediction

xx

xx

xx

xx x

feature

prediction

xx

xx

xx

xx x

Overfitting vs. underfitting

Regularization: Any modification to a learning algorithm for reducingits generalization error but not its training error

Build a “preference” into model for some solutions in hypothesis space

Unpreferred solutions are penalized: only chosen if they fit trainingdata much better than preferred solutions

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 24 / 29

Page 53: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Regularization

feature

prediction

xx

xx

xx

xx x

feature

prediction

xx

xx

xx

xx x

feature

prediction

xx

xx

xx

xx x

Overfitting vs. underfitting

Regularization: Any modification to a learning algorithm for reducingits generalization error but not its training error

Build a “preference” into model for some solutions in hypothesis space

Unpreferred solutions are penalized: only chosen if they fit trainingdata much better than preferred solutions

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 24 / 29

Page 54: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Regularization

Large parameters → overfitting

Prefer models with smaller weights

Popular regularizers:I Penalize large L2 norm (= Euclidian norm) of weight vectorsI Penalize large L1 norm (= Manhattan norm) of weight vectors

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 25 / 29

Page 55: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

L2-Regularization

Add term that penalizes large L2 norm of weight vector θ

The amount of penalty is controlled by a parameter λ

J ′(θ) = J(θ, x, y) +λ

2θTθ

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 26 / 29

Page 56: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

L2-Regularization

The surface of the objective function is now a combination of theoriginal loss and the regularization penalty.

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 27 / 29

Page 57: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Summary

Feedforward networks: layers of (non-linear) function compositions

Non-Linearities for hidden layers: relu, tanh, ...

Non-Linearities for output units (classification): σ, softmax

Training via backpropagation: compute gradient of cost w.r.t.parameters using chain rule

Regularization: penalize large parameter values, e.g. by addingL2-norm of parameter vector to loss

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 28 / 29

Page 58: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Summary

Feedforward networks: layers of (non-linear) function compositions

Non-Linearities for hidden layers: relu, tanh, ...

Non-Linearities for output units (classification): σ, softmax

Training via backpropagation: compute gradient of cost w.r.t.parameters using chain rule

Regularization: penalize large parameter values, e.g. by addingL2-norm of parameter vector to loss

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 28 / 29

Page 59: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Summary

Feedforward networks: layers of (non-linear) function compositions

Non-Linearities for hidden layers: relu, tanh, ...

Non-Linearities for output units (classification): σ, softmax

Training via backpropagation: compute gradient of cost w.r.t.parameters using chain rule

Regularization: penalize large parameter values, e.g. by addingL2-norm of parameter vector to loss

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 28 / 29

Page 60: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Summary

Feedforward networks: layers of (non-linear) function compositions

Non-Linearities for hidden layers: relu, tanh, ...

Non-Linearities for output units (classification): σ, softmax

Training via backpropagation: compute gradient of cost w.r.t.parameters using chain rule

Regularization: penalize large parameter values, e.g. by addingL2-norm of parameter vector to loss

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 28 / 29

Page 61: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Summary

Feedforward networks: layers of (non-linear) function compositions

Non-Linearities for hidden layers: relu, tanh, ...

Non-Linearities for output units (classification): σ, softmax

Training via backpropagation: compute gradient of cost w.r.t.parameters using chain rule

Regularization: penalize large parameter values, e.g. by addingL2-norm of parameter vector to loss

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 28 / 29

Page 62: Machine Learning Basics III · Machine Learning Basics III Benjamin Roth, Nina Poerner CIS LMU Munchen Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 1

Outlook

“Manually” defining forward- and backward passes in numpy istime-consuming

Deep Learning frameworks let you define forward pass as a“computation graph” made up of simple, differentiable operations(e.g., dot products).

They do the backward pass for you

tensorflow + keras, pytorch, theano, MXNet, CNTK, caffe, ...

Benjamin Roth, Nina Poerner (CIS LMU Munchen) Machine Learning Basics III 29 / 29