review series of recent deep learning papers: parameter ... · learning to learn meta-learning /...
TRANSCRIPT
Review Series of Recent Deep Learning Papers:
Parameter Prediction Paper: Learning toLearn by gradient descent by gradient
descentMarcin Andrychowicz, Misha Denil, Sergio Gmez Colmenarejo, MatthewW. Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, Nando de
FreitasNIPS 2016
Reviewed by : Arshdeep Sekhon
1Department of Computer Science, University of Virginiahttps://qdata.github.io/deep2Read/
August 25, 2018Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 1 / 10
Optimizer
A standard machine learning algorithm
θ∗ = arg minθ
f (θ) (1)
A standard optimizer
θt + 1 = θt − αt∇f (θt) (2)
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 2 / 10
Optimizer
A standard machine learning algorithm
θ∗ = arg minθ
f (θ) (1)
A standard optimizer
θt + 1 = θt − αt∇f (θt) (2)
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 2 / 10
Motivation
Optimization strategies tailored to different classes of tasks:
Deep Learning: High Dimensional, non convex optimization problemsAdagrad,RMSprop, Rprop, etc.Combinatorial Optimization: Relaxations
Generally, hand designed update rules.
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 3 / 10
Motivation
Optimization strategies tailored to different classes of tasks:
Deep Learning: High Dimensional, non convex optimization problemsAdagrad,RMSprop, Rprop, etc.Combinatorial Optimization: Relaxations
Generally, hand designed update rules.
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 3 / 10
Learning to Learn
Meta-learning / Learning to Learn
Replace hand designed update rules with learned update rule
θt+1 = θt + gt(∇(f (θt)), φ) (3)
gt is the optimizer with its own parameters
gt is a recurrent neural network that predicts update at each timestep,parameterised by φ
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 4 / 10
Learning to Learn
Meta-learning / Learning to Learn
Replace hand designed update rules with learned update rule
θt+1 = θt + gt(∇(f (θt)), φ) (3)
gt is the optimizer with its own parameters
gt is a recurrent neural network that predicts update at each timestep,parameterised by φ
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 4 / 10
Generalization for optimizers
Goal
Find an optimizer with learned updates (instead of hand designed updates)that performs well on a class of optimization problems.
Generalization in machine learning: Capacity to make predictions aboutthe target at novel unseen points
Generalization in ’Learning to Learn’ context: Optimizer should performwell in unseen problems of the same ’type’:
transfer learning
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 5 / 10
Generalization for optimizers
Goal
Find an optimizer with learned updates (instead of hand designed updates)that performs well on a class of optimization problems.
Generalization in machine learning: Capacity to make predictions aboutthe target at novel unseen points
Generalization in ’Learning to Learn’ context: Optimizer should performwell in unseen problems of the same ’type’:
transfer learning
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 5 / 10
Generalization for optimizers
Goal
Find an optimizer with learned updates (instead of hand designed updates)that performs well on a class of optimization problems.
Generalization in machine learning: Capacity to make predictions aboutthe target at novel unseen points
Generalization in ’Learning to Learn’ context: Optimizer should performwell in unseen problems of the same ’type’:
transfer learning
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 5 / 10
Generalization for optimizers
Goal
Find an optimizer with learned updates (instead of hand designed updates)that performs well on a class of optimization problems.
Generalization in machine learning: Capacity to make predictions aboutthe target at novel unseen points
Generalization in ’Learning to Learn’ context: Optimizer should performwell in unseen problems of the same ’type’:
transfer learning
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 5 / 10
Learning to Learn with Recurrent Neural Networks
1 When can you say an optimizer is good?2 θ∗ (optimal parameters) is a function of the class of functions f being
optimized and the optimizer parameters
θ∗(f , φ) (4)
3 Expected Loss
L(φ) = Ef
[f (θ∗(f , φ))
](5)
4 Expected Loss with RNN optimizer
L(φ) = Ef
[ T∑t=1
wt f (θt)]
(6)
θt+1 = θt + gt (7)(gt
ht+1
)= m(∇t , ht , φ)
5 Generate updates at each timestep using a recurrent neural network.
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 6 / 10
Learning to Learn with Recurrent Neural Networks
1 When can you say an optimizer is good?2 θ∗ (optimal parameters) is a function of the class of functions f being
optimized and the optimizer parameters
θ∗(f , φ) (4)
3 Expected Loss
L(φ) = Ef
[f (θ∗(f , φ))
](5)
4 Expected Loss with RNN optimizer
L(φ) = Ef
[ T∑t=1
wt f (θt)]
(6)
θt+1 = θt + gt (7)(gt
ht+1
)= m(∇t , ht , φ)
5 Generate updates at each timestep using a recurrent neural network.
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 6 / 10
Learning to Learn with Recurrent Neural Networks
1 When can you say an optimizer is good?2 θ∗ (optimal parameters) is a function of the class of functions f being
optimized and the optimizer parameters
θ∗(f , φ) (4)
3 Expected Loss
L(φ) = Ef
[f (θ∗(f , φ))
](5)
4 Expected Loss with RNN optimizer
L(φ) = Ef
[ T∑t=1
wt f (θt)]
(6)
θt+1 = θt + gt (7)(gt
ht+1
)= m(∇t , ht , φ)
5 Generate updates at each timestep using a recurrent neural network.
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 6 / 10
Learning to Learn with Recurrent Neural Networks
1 When can you say an optimizer is good?2 θ∗ (optimal parameters) is a function of the class of functions f being
optimized and the optimizer parameters
θ∗(f , φ) (4)
3 Expected Loss
L(φ) = Ef
[f (θ∗(f , φ))
](5)
4 Expected Loss with RNN optimizer
L(φ) = Ef
[ T∑t=1
wt f (θt)]
(6)
θt+1 = θt + gt (7)(gt
ht+1
)= m(∇t , ht , φ)
5 Generate updates at each timestep using a recurrent neural network.Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 6 / 10
Optimizing the optimizer
Computational Graph used for computing the gradient
1 Minimize loss L(φ)w.r.t the parameters of the optimizer (φ)
2 calculateδLδφ
and backpropagate through time
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 7 / 10
Optimizing the optimizer
1 backpropagation through the computational graph
2 Assumption: Optimizee gradients do not depend on optimizerparameters
3 ∇t = ∇θf (θt) ;∇t
δφ= 0: Drop gradients along dotted edges
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 8 / 10
Optimizing the optimizer
1 backpropagation through the computational graph2 Assumption: Optimizee gradients do not depend on optimizer
parameters
3 ∇t = ∇θf (θt) ;∇t
δφ= 0: Drop gradients along dotted edges
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 8 / 10
Optimizing the optimizer
1 backpropagation through the computational graph2 Assumption: Optimizee gradients do not depend on optimizer
parameters
3 ∇t = ∇θf (θt) ;∇t
δφ= 0: Drop gradients along dotted edges
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 8 / 10
Coordinate-wise LSTM optimizer
1 Issue: Huge hidden state RNN if we want an update for eachparameter
θt+1 = θt + gt (8)(gt
ht+1
)= m(∇t , ht , φ)
2 Solution: use an RNN to get an update coordinate wise3 Coordinate wise LSTM shares the weights across all parameters
Separate hidden states, but shared parameters
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 9 / 10
Experiments and Results: Quadratic Functions
1 Quadratic Functions:
f (θ) = ||W θ − y ||22 (9)
2 Optimzer trained on random functions from this family3 Test on a random function sampled from this family distribution
4
Learning Curves
Reviewed by : Arshdeep Sekhon (University of Virginia)Review Series of Recent Deep Learning Papers:Parameter Prediction Paper: Learning to Learn by gradient descent by gradient descentAugust 25, 2018 10 / 10