Transcript
Page 1: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Review Series of Recent Deep Learning Papers:

Parameter Prediction Paper: Learning toLearn by gradient descent by gradient

descentMarcin Andrychowicz, Misha Denil, Sergio Gmez Colmenarejo, MatthewW. Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, Nando de

FreitasNIPS 2016

Reviewed by : Arshdeep Sekhon

1Department of Computer Science, University of Virginiahttps://qdata.github.io/deep2Read/

August 25, 2018Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 1 / 10

Page 2: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Optimizer

A standard machine learning algorithm

θ∗ = arg minθ

f (θ) (1)

A standard optimizer

θt + 1 = θt − αt∇f (θt) (2)

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 2 / 10

Page 3: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Optimizer

A standard machine learning algorithm

θ∗ = arg minθ

f (θ) (1)

A standard optimizer

θt + 1 = θt − αt∇f (θt) (2)

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 2 / 10

Page 4: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Motivation

Optimization strategies tailored to different classes of tasks:

Deep Learning: High Dimensional, non convex optimization problemsAdagrad,RMSprop, Rprop, etc.Combinatorial Optimization: Relaxations

Generally, hand designed update rules.

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 3 / 10

Page 5: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Motivation

Optimization strategies tailored to different classes of tasks:

Deep Learning: High Dimensional, non convex optimization problemsAdagrad,RMSprop, Rprop, etc.Combinatorial Optimization: Relaxations

Generally, hand designed update rules.

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 3 / 10

Page 6: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Learning to Learn

Meta-learning / Learning to Learn

Replace hand designed update rules with learned update rule

θt+1 = θt + gt(∇(f (θt)), φ) (3)

gt is the optimizer with its own parameters

gt is a recurrent neural network that predicts update at each timestep,parameterised by φ

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 4 / 10

Page 7: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Learning to Learn

Meta-learning / Learning to Learn

Replace hand designed update rules with learned update rule

θt+1 = θt + gt(∇(f (θt)), φ) (3)

gt is the optimizer with its own parameters

gt is a recurrent neural network that predicts update at each timestep,parameterised by φ

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 4 / 10

Page 8: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Generalization for optimizers

Goal

Find an optimizer with learned updates (instead of hand designed updates)that performs well on a class of optimization problems.

Generalization in machine learning: Capacity to make predictions aboutthe target at novel unseen points

Generalization in ’Learning to Learn’ context: Optimizer should performwell in unseen problems of the same ’type’:

transfer learning

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 5 / 10

Page 9: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Generalization for optimizers

Goal

Find an optimizer with learned updates (instead of hand designed updates)that performs well on a class of optimization problems.

Generalization in machine learning: Capacity to make predictions aboutthe target at novel unseen points

Generalization in ’Learning to Learn’ context: Optimizer should performwell in unseen problems of the same ’type’:

transfer learning

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 5 / 10

Page 10: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Generalization for optimizers

Goal

Find an optimizer with learned updates (instead of hand designed updates)that performs well on a class of optimization problems.

Generalization in machine learning: Capacity to make predictions aboutthe target at novel unseen points

Generalization in ’Learning to Learn’ context: Optimizer should performwell in unseen problems of the same ’type’:

transfer learning

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 5 / 10

Page 11: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Generalization for optimizers

Goal

Find an optimizer with learned updates (instead of hand designed updates)that performs well on a class of optimization problems.

Generalization in machine learning: Capacity to make predictions aboutthe target at novel unseen points

Generalization in ’Learning to Learn’ context: Optimizer should performwell in unseen problems of the same ’type’:

transfer learning

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 5 / 10

Page 12: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Learning to Learn with Recurrent Neural Networks

1 When can you say an optimizer is good?2 θ∗ (optimal parameters) is a function of the class of functions f being

optimized and the optimizer parameters

θ∗(f , φ) (4)

3 Expected Loss

L(φ) = Ef

[f (θ∗(f , φ))

](5)

4 Expected Loss with RNN optimizer

L(φ) = Ef

[ T∑t=1

wt f (θt)]

(6)

θt+1 = θt + gt (7)(gt

ht+1

)= m(∇t , ht , φ)

5 Generate updates at each timestep using a recurrent neural network.

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 6 / 10

Page 13: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Learning to Learn with Recurrent Neural Networks

1 When can you say an optimizer is good?2 θ∗ (optimal parameters) is a function of the class of functions f being

optimized and the optimizer parameters

θ∗(f , φ) (4)

3 Expected Loss

L(φ) = Ef

[f (θ∗(f , φ))

](5)

4 Expected Loss with RNN optimizer

L(φ) = Ef

[ T∑t=1

wt f (θt)]

(6)

θt+1 = θt + gt (7)(gt

ht+1

)= m(∇t , ht , φ)

5 Generate updates at each timestep using a recurrent neural network.

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 6 / 10

Page 14: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Learning to Learn with Recurrent Neural Networks

1 When can you say an optimizer is good?2 θ∗ (optimal parameters) is a function of the class of functions f being

optimized and the optimizer parameters

θ∗(f , φ) (4)

3 Expected Loss

L(φ) = Ef

[f (θ∗(f , φ))

](5)

4 Expected Loss with RNN optimizer

L(φ) = Ef

[ T∑t=1

wt f (θt)]

(6)

θt+1 = θt + gt (7)(gt

ht+1

)= m(∇t , ht , φ)

5 Generate updates at each timestep using a recurrent neural network.

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 6 / 10

Page 15: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Learning to Learn with Recurrent Neural Networks

1 When can you say an optimizer is good?2 θ∗ (optimal parameters) is a function of the class of functions f being

optimized and the optimizer parameters

θ∗(f , φ) (4)

3 Expected Loss

L(φ) = Ef

[f (θ∗(f , φ))

](5)

4 Expected Loss with RNN optimizer

L(φ) = Ef

[ T∑t=1

wt f (θt)]

(6)

θt+1 = θt + gt (7)(gt

ht+1

)= m(∇t , ht , φ)

5 Generate updates at each timestep using a recurrent neural network.Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 6 / 10

Page 16: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Optimizing the optimizer

Computational Graph used for computing the gradient

1 Minimize loss L(φ)w.r.t the parameters of the optimizer (φ)

2 calculateδLδφ

and backpropagate through time

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 7 / 10

Page 17: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Optimizing the optimizer

1 backpropagation through the computational graph

2 Assumption: Optimizee gradients do not depend on optimizerparameters

3 ∇t = ∇θf (θt) ;∇t

δφ= 0: Drop gradients along dotted edges

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 8 / 10

Page 18: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Optimizing the optimizer

1 backpropagation through the computational graph2 Assumption: Optimizee gradients do not depend on optimizer

parameters

3 ∇t = ∇θf (θt) ;∇t

δφ= 0: Drop gradients along dotted edges

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 8 / 10

Page 19: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Optimizing the optimizer

1 backpropagation through the computational graph2 Assumption: Optimizee gradients do not depend on optimizer

parameters

3 ∇t = ∇θf (θt) ;∇t

δφ= 0: Drop gradients along dotted edges

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 8 / 10

Page 20: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Coordinate-wise LSTM optimizer

1 Issue: Huge hidden state RNN if we want an update for eachparameter

θt+1 = θt + gt (8)(gt

ht+1

)= m(∇t , ht , φ)

2 Solution: use an RNN to get an update coordinate wise3 Coordinate wise LSTM shares the weights across all parameters

Separate hidden states, but shared parameters

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 9 / 10

Page 21: Review Series of Recent Deep Learning Papers: Parameter ... · Learning to Learn Meta-learning / Learning to Learn Replace hand designed update rules with learned update rule t+1

Experiments and Results: Quadratic Functions

1 Quadratic Functions:

f (θ) = ||W θ − y ||22 (9)

2 Optimzer trained on random functions from this family3 Test on a random function sampled from this family distribution

4

Learning Curves

Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 10 / 10


Top Related