review series of recent deep learning papers: parameter ... · learning to learn meta-learning /...

21
Review Series of Recent Deep Learning Papers: Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descent Marcin Andrychowicz, Misha Denil, Sergio Gmez Colmenarejo, Matthew W. Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, Nando de Freitas NIPS 2016 Reviewed by : Arshdeep Sekhon 1 Department of Computer Science, University of Virginia https://qdata.github.io/deep2Read/ August 25, 2018 Reviewed by : Arshdeep Sekhon (University of Virginia) Review Series of Recent Deep Learning Papers:Parameter Pre August 25, 2018 1 / 10

Upload: others

Post on 07-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Review Series of Recent Deep Learning Papers:

Parameter Prediction Paper: Learning toLearn by gradient descent by gradient

descentMarcin Andrychowicz, Misha Denil, Sergio Gmez Colmenarejo, MatthewW. Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, Nando de

FreitasNIPS 2016

Reviewed by : Arshdeep Sekhon

1Department of Computer Science, University of Virginiahttps://qdata.github.io/deep2Read/

August 25, 2018Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 1 / 10

Page 2: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Optimizer

A standard machine learning algorithm

θ∗ = arg minθ

f (θ) (1)

A standard optimizer

θt + 1 = θt − αt∇f (θt) (2)

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 2 / 10

Page 3: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Optimizer

A standard machine learning algorithm

θ∗ = arg minθ

f (θ) (1)

A standard optimizer

θt + 1 = θt − αt∇f (θt) (2)

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 2 / 10

Page 4: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Motivation

Optimization strategies tailored to different classes of tasks:

Deep Learning: High Dimensional, non convex optimization problemsAdagrad,RMSprop, Rprop, etc.Combinatorial Optimization: Relaxations

Generally, hand designed update rules.

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 3 / 10

Page 5: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Motivation

Optimization strategies tailored to different classes of tasks:

Deep Learning: High Dimensional, non convex optimization problemsAdagrad,RMSprop, Rprop, etc.Combinatorial Optimization: Relaxations

Generally, hand designed update rules.

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 3 / 10

Page 6: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Learning to Learn

Meta-learning / Learning to Learn

Replace hand designed update rules with learned update rule

θt+1 = θt + gt(∇(f (θt)), φ) (3)

gt is the optimizer with its own parameters

gt is a recurrent neural network that predicts update at each timestep,parameterised by φ

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 4 / 10

Page 7: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Learning to Learn

Meta-learning / Learning to Learn

Replace hand designed update rules with learned update rule

θt+1 = θt + gt(∇(f (θt)), φ) (3)

gt is the optimizer with its own parameters

gt is a recurrent neural network that predicts update at each timestep,parameterised by φ

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 4 / 10

Page 8: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Generalization for optimizers

Goal

Find an optimizer with learned updates (instead of hand designed updates)that performs well on a class of optimization problems.

Generalization in machine learning: Capacity to make predictions aboutthe target at novel unseen points

Generalization in ’Learning to Learn’ context: Optimizer should performwell in unseen problems of the same ’type’:

transfer learning

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 5 / 10

Page 9: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Generalization for optimizers

Goal

Find an optimizer with learned updates (instead of hand designed updates)that performs well on a class of optimization problems.

Generalization in machine learning: Capacity to make predictions aboutthe target at novel unseen points

Generalization in ’Learning to Learn’ context: Optimizer should performwell in unseen problems of the same ’type’:

transfer learning

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 5 / 10

Page 10: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Generalization for optimizers

Goal

Find an optimizer with learned updates (instead of hand designed updates)that performs well on a class of optimization problems.

Generalization in machine learning: Capacity to make predictions aboutthe target at novel unseen points

Generalization in ’Learning to Learn’ context: Optimizer should performwell in unseen problems of the same ’type’:

transfer learning

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 5 / 10

Page 11: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Generalization for optimizers

Goal

Find an optimizer with learned updates (instead of hand designed updates)that performs well on a class of optimization problems.

Generalization in machine learning: Capacity to make predictions aboutthe target at novel unseen points

Generalization in ’Learning to Learn’ context: Optimizer should performwell in unseen problems of the same ’type’:

transfer learning

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 5 / 10

Page 12: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Learning to Learn with Recurrent Neural Networks

1 When can you say an optimizer is good?2 θ∗ (optimal parameters) is a function of the class of functions f being

optimized and the optimizer parameters

θ∗(f , φ) (4)

3 Expected Loss

L(φ) = Ef

[f (θ∗(f , φ))

](5)

4 Expected Loss with RNN optimizer

L(φ) = Ef

[ T∑t=1

wt f (θt)]

(6)

θt+1 = θt + gt (7)(gt

ht+1

)= m(∇t , ht , φ)

5 Generate updates at each timestep using a recurrent neural network.

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 6 / 10

Page 13: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Learning to Learn with Recurrent Neural Networks

1 When can you say an optimizer is good?2 θ∗ (optimal parameters) is a function of the class of functions f being

optimized and the optimizer parameters

θ∗(f , φ) (4)

3 Expected Loss

L(φ) = Ef

[f (θ∗(f , φ))

](5)

4 Expected Loss with RNN optimizer

L(φ) = Ef

[ T∑t=1

wt f (θt)]

(6)

θt+1 = θt + gt (7)(gt

ht+1

)= m(∇t , ht , φ)

5 Generate updates at each timestep using a recurrent neural network.

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 6 / 10

Page 14: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Learning to Learn with Recurrent Neural Networks

1 When can you say an optimizer is good?2 θ∗ (optimal parameters) is a function of the class of functions f being

optimized and the optimizer parameters

θ∗(f , φ) (4)

3 Expected Loss

L(φ) = Ef

[f (θ∗(f , φ))

](5)

4 Expected Loss with RNN optimizer

L(φ) = Ef

[ T∑t=1

wt f (θt)]

(6)

θt+1 = θt + gt (7)(gt

ht+1

)= m(∇t , ht , φ)

5 Generate updates at each timestep using a recurrent neural network.

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 6 / 10

Page 15: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Learning to Learn with Recurrent Neural Networks

1 When can you say an optimizer is good?2 θ∗ (optimal parameters) is a function of the class of functions f being

optimized and the optimizer parameters

θ∗(f , φ) (4)

3 Expected Loss

L(φ) = Ef

[f (θ∗(f , φ))

](5)

4 Expected Loss with RNN optimizer

L(φ) = Ef

[ T∑t=1

wt f (θt)]

(6)

θt+1 = θt + gt (7)(gt

ht+1

)= m(∇t , ht , φ)

5 Generate updates at each timestep using a recurrent neural network.Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 6 / 10

Page 16: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Optimizing the optimizer

Computational Graph used for computing the gradient

1 Minimize loss L(φ)w.r.t the parameters of the optimizer (φ)

2 calculateδLδφ

and backpropagate through time

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 7 / 10

Page 17: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Optimizing the optimizer

1 backpropagation through the computational graph

2 Assumption: Optimizee gradients do not depend on optimizerparameters

3 ∇t = ∇θf (θt) ;∇t

δφ= 0: Drop gradients along dotted edges

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 8 / 10

Page 18: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Optimizing the optimizer

1 backpropagation through the computational graph2 Assumption: Optimizee gradients do not depend on optimizer

parameters

3 ∇t = ∇θf (θt) ;∇t

δφ= 0: Drop gradients along dotted edges

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 8 / 10

Page 19: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Optimizing the optimizer

1 backpropagation through the computational graph2 Assumption: Optimizee gradients do not depend on optimizer

parameters

3 ∇t = ∇θf (θt) ;∇t

δφ= 0: Drop gradients along dotted edges

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 8 / 10

Page 20: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Coordinate-wise LSTM optimizer

1 Issue: Huge hidden state RNN if we want an update for eachparameter

θt+1 = θt + gt (8)(gt

ht+1

)= m(∇t , ht , φ)

2 Solution: use an RNN to get an update coordinate wise3 Coordinate wise LSTM shares the weights across all parameters

Separate hidden states, but shared parameters

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 9 / 10

Page 21: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Experiments and Results: Quadratic Functions

1 Quadratic Functions:

f (θ) = ||W θ − y ||22 (9)

2 Optimzer trained on random functions from this family3 Test on a random function sampled from this family distribution

4

Learning Curves

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 10 / 10