natural gradient works efficiently in learning s amari 11.03.18.(fri) computational modeling of...
TRANSCRIPT
![Page 1: Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e755503460f94b7599a/html5/thumbnails/1.jpg)
Natural Gradient Works Efficiently in LearningS Amari
11.03.18.(Fri)Computational Modeling of Intelli-
genceSummarized by Joon Shik Kim
![Page 2: Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e755503460f94b7599a/html5/thumbnails/2.jpg)
Abstract
• The ordinary gradient of a function does not represent its steepest direction, but the natu-ral gradient does.
• The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient.
• The plateau phenomenon, which appears in the backpropagation learning algorithm of multilayer perceptron, might disappear or might not be so serious when the natural gradient is used.
![Page 3: Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e755503460f94b7599a/html5/thumbnails/3.jpg)
Introduction (1/2)
• The stochastic gradient method is a popular learning method in the gen-eral nonlinear optimization framework.
• The parameter space is not Euclidean but has a Riemannian metric structure in many cases.
• In these cases, the ordinary gradient does not give the steepest direction of target function.
![Page 4: Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e755503460f94b7599a/html5/thumbnails/4.jpg)
Introduction (2/2)
• Barkai, Seung, and Sompolisky (1995) proposed an adaptive method of adjusting the learning rate. We generalize their idea and evaluate its performance based on the Riemann-ian metric of errors.
![Page 5: Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e755503460f94b7599a/html5/thumbnails/5.jpg)
Natural Gradient (1/5)
• The squared length of a small incre-mental vector dw,
• When the coordinate system is nonorthogonal, the squared length is given by the quadratic form,
![Page 6: Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e755503460f94b7599a/html5/thumbnails/6.jpg)
Natural Gradient (2/5)
• The steepest descent direction of a function L(w) at w is defined by the vector dw has that minimizes L(w+dw) where |dw| has a fixed length, that is, under the constant,
![Page 7: Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e755503460f94b7599a/html5/thumbnails/7.jpg)
Natural Gradient (3/5)
• The steepest descent direction of L(w) in a Riemannian space is given by,
![Page 8: Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e755503460f94b7599a/html5/thumbnails/8.jpg)
Natural Gradient (4/5)
![Page 9: Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e755503460f94b7599a/html5/thumbnails/9.jpg)
Natural Gradient (5/5)
![Page 10: Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e755503460f94b7599a/html5/thumbnails/10.jpg)
Natural Gradient Learning
• Risk function or average loss,
• Learning is a procedure to search for the optimal w* that minimizes L(w).
• Stochastic gradient descent learning
![Page 11: Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e755503460f94b7599a/html5/thumbnails/11.jpg)
Statistical Estimation of Probability Density Function (1/2)
• In the case of statistical estimation, we assume a statistical model {p(z,w)}, and the problem is to ob-tain the probability distribution that approximates the unknown den-sity function q(z) in the best way.
• Loss function is
ˆ( , )p z w
![Page 12: Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e755503460f94b7599a/html5/thumbnails/12.jpg)
Statistical Estimation of Probability Density Function (2/2)
• The expected loss is then given by
Hz is the entropy of q(z) not depending on w.
• Riemannian metric is Fisher informa-tion
![Page 13: Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e755503460f94b7599a/html5/thumbnails/13.jpg)
Fisher Information as the Metric of Kullback-Leibler Divergence (1/2)
• p=q(θ+h)
( || ) lnq
D q p q dp
lnp
q dq
2
11 1
2
p pq d
q q
2
1 1( )( )
2q p q p q dq
2 2
( ( ) || ( )) 1 ( ) ( ) ( ) ( )lim limh h
D q q h q h q q h qq d
h q h h
![Page 14: Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e755503460f94b7599a/html5/thumbnails/14.jpg)
Fisher Information as the Metric of Kullback-Leibler Divergence (2/2)
2 2
( ( ) || ( )) 1 1 ( ) ( ) ( ) ( )lim lim
2h h
D q q h q h q q h qq d
h q h h
2
1 1
2
q qq dq
1 ln ln
2
q qq d
1
2I I: Fisher information
![Page 15: Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e755503460f94b7599a/html5/thumbnails/15.jpg)
Multilayer Neural Network (1/2)
![Page 16: Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e755503460f94b7599a/html5/thumbnails/16.jpg)
Multilayer Neural Network (2/2)
c is a normalizing constant
![Page 17: Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e755503460f94b7599a/html5/thumbnails/17.jpg)
Natural Gradient Gives Fisher-Efficient Online Learning Algorithms (1/4)
• DT = {(x1,y1),…,(xT,yT)} is T-indepen-dent input-output examples gener-ated by the teacher network having parameter w*.
• Minimizing the log loss over the training data DT is to obtain that minimizes the training error
ˆTw
![Page 18: Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e755503460f94b7599a/html5/thumbnails/18.jpg)
Natural Gradient Gives Fisher-Efficient Online Learning Algorithms (2/4)
• The Cramér-Rao theorem states that the expected squared error of an un-biased estimator satisfies
• An estimator is said to be efficient or Fisher efficient when it satisfies above equation.
1ˆ ˆ[( *) ( *)]TT TE w w w w
I
![Page 19: Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e755503460f94b7599a/html5/thumbnails/19.jpg)
Natural Gradient Gives Fisher-Efficient Online Learning Algorithms (3/4)
• Theorem 2. The natural gradient on-line estimator is Fisher efficient.
• Proof
![Page 20: Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim](https://reader030.vdocuments.us/reader030/viewer/2022032605/56649e755503460f94b7599a/html5/thumbnails/20.jpg)
Natural Gradient Gives Fisher-Efficient Online Learning Algorithms (4/4)