neural networks and singular learning theorywatanabe-...in 1990, it is well known that neural...
TRANSCRIPT
Neural Networks and Singular Learning Theory
Sumio WatanabeTokyo Institute of Technology
2
Contents
1 Statistics and Learning
2 Regular Theory
3 Neural Networks are Singular
4 Algebraic Geometry
5 General Theory
6 Conclusion
1 Statistics and Learning
Unknown true and Learning Machine
p(y|x,w) q(x,y)=q(x)q(y|x)
X
Y
X
Y
Learning
=
Statistical
Estimation
Unknown
True
Notations
Datum : (x,y) in RMN
Unknown True : q(x,y)
I.I.D. Sample Dn=(Xn, Yn)= {(X1, Y1,),…,(Xn,Yn)}
Parameter w in W contained in Rd
Learning Machine : p(y|x,w) Prior : φ(w)
True q(x) q(y|x) Sample
Generalization Error
Predictive p*(y|x)
Posterior
Model and Prior
Learning Process
MLE
MAP
ML, MAP, and Bayes
Log Loss : Ln(w) = -(1/n) Σi=1n log p(Yi|Xi,w)
Posterior p(w|Dn) = (1/Zn) ϕ(w) exp(-nLn (w))
Maximum Likelihood: p(y|x,w*) where w* minimizes Ln (w).
MAP: p(y|x,w+) where w+ maximizes p(w|Dn).
Bayes: p(y|x,Dn) = Ew [p (y|x,w)]
How to determine the predictive : p*(y|x)
Ew[ ], Vw[ ] : posterior mean and variance
The purpose of Statistical Learning Theory
Training Loss: T = -(1/n) Σi=1n
log p*(Yi|Xi)
Generalization Loss: G = - Exy[ log p*(Y|X) ]
Free energy : F = -log φ(w) exp(-nLn (w)) dw
Main Purpose of Statistical Learning Theory
In ML, MAP, and Bayes,
clarify the distributions of T, G, and F.
Let p*(y|x) be a predictive by some method.
Summary 1
The purpose of the statistical learning theory is
to clarify the distributions of training loss,
Generalization loss, and free energy.
2 Regular Theory
11
Regular Case
L(w) = - Exy[ log p(Y|X,w) ]
w0 is the parameter that minimizes L(w).
If L(w) can be approximated bya positive definitequadratic form in a neighborhood of w0, then statistical estimationis called regular.
In regular cases, statistics was already established.
I = Exy[ log p(Y|X,w0) ( log p(Y|X,w0))T ]
J = - Exy [2 log p(Y|X,w0) ]
Definition. Positive definite matrices I and J are defined.
Remark. If q(y|x)=p(y|x,w0), then I=J.
In regular cases, the distributions of MLE and MAP andthe posterior distribution converge to w0, when samplesize n tends to infinity, resulting that the regular statistical theory was established in 1970.
Regular Theory : Training and Generalization
Theorem. In a regular case, the followings hold, where d is the dimension of the parameter w.
Bayes E[ G ] = L(w0) + d/(2n) + o(1/n),
E[ T ] = L(w0) + { d –2 tr(IJ-1) } /(2n) + o(1/n).
ML, MAP E[ G ] = L(w0) + tr(IJ-1) / (2n) + o(1/n),
E[ T ] = L(w0) –tr(IJ-1) / (2n) + o(1/n).
14
Regular Case: Free Energy
Theorem. In a regular case, the following holds.
F = nLn(w0) + (d/2) log n +Op(1).
Regular Case : Information Criteria
(1) (Akaike 1974, Takeuchi 1976) In ML, MAP, and Bayes,
AIC= 2n Tn + 2 tr(IJ-1). Then E[G] = E[AIC]/(2n)+o(1/n).
Theorem. In a regular case, the followings hold.
(2) (Shwarz,1978) BIC = 2nLn(w*) + d log n. Then
F = BIC/2 +Op(1) .
(3) (Stone,1977) AIC is asymptotically equivalent to
leave-one-out cross validation (LOOCV).
Summary 2
In regular cases, statistical learning theory
was established in 1970.
3 Neural Networks are Singular
Singular Case
L(w) = - Exy[ log p(Y|X,w) ]
If L(w) can not beapproximated byany positive definitequadratic formin a neighborhood of the minimum ofL(w), then statistical estimationis called singular.
In 1990, it is well known that neural networks are singular learning machines.
(1) K. Hagiwara, et.al. Nonuniqueness of Connecting Weights and AIC in Multi-Layered Neural Networks. IEICE Transactions,D-II, 76(9), 2058-2065, 1993.
(2) S.Watanabe. A generalized Bayesian frameworkfor neural networks with singular Fisher information matrices.Proc. of NOLTA, pp.207-210, 1995
20
Neural Networks have many singularities in parameter space.
Parameter space of a neural network.
Hierarchical structure generatessingularities.
Parameter space of a neural network.
Hidden Markov Models
Normal mixtures
Probabilisticgrammar
Bayesian Networks
Neural Networks
X Y
ZW
U
V S T
=
Matrix factorization
Almost all learning machines are singular.
23
Neural Networks are singular.
(1) The map from a parameter to a probability
distribution is not one-to-one.
(2) The likelihood function can not be approximated
by any quadratic form.
(3) Neither MLE nor MAP has asymptotic normality.
(4) Bayes posterior distribution can not be
approximated by any normal distribution.
Summary 3
Neural Networks are Singular.
4 Algebraic Geometry
The Load from Learning theory to Algebraic Geometry.
The set {w ; K(w) = 0 } contains many singularities.
… There was no statistical theory.
… There was no probability theory.
… … … We need algebraic geometry.
K(w)= L(w)-L(w0), where w0 minimizes L(w).
Neural Networks
Algebraic Geometry
Neural Networks
Hyperfunction
D - module
ResolutionTheorem
Zeta function
EmpiricalProcess
Birational Invariants
29
R
Rd
In each local coordinate,
w=g(u)
K(g(u))= u12k1 u2
2k2 ・・ ud2kd
Manifold M
For any K(w) >=0
Parameter Set
Resolution of Singularities (Hironaka Theorem)
Singular Likelihood is made to be a standard form.
Ln(w) = -(1/n) Σi=1n
log p(Yi|Xi,w) is made to be
nLn(g(u)) – n Ln (w0) = nu2k - n1/2 uk ξn (u),
where ξn(u) converges to a Gaussian process
in distribution.
By using K(g(u)) = u2k = u12k1 u2
2k2 ・・ ud2kd ,
Summary 4
By using resolution theorem,
any singular log likelihood function can be made
to be a common standard form on a manifold.
5 General Theory
General theory contains regular theory as a special case.
General Theory : MLE and MAP
u*= arg maxu { (0, ξn(u)) 2 / 4 + log ϕ(g(u)) }, where maxu shows the maximum value on {u ; K(g(u))=0 }.For MLE, ϕ(w) is set to be a constant.
Theorem. Assume that the parameter set is compact.
E[G] = L(w0) + µ / n +o(1/n)
Proof. See Main Theorem 6.4, in Sumio Watanabe, Algebraic geometryAnd statistical learning theory, Cambridge University Press, 2009, P.211.
E[T] = L (w0) – µ / n +o(1/n),
Theorem. In ML and MAP,
µ = Eξ [ max (0, ξ(u*))2 ],
m = the number of j that attain the maximum.
General Theory : Bayes
Theorem. In Bayesian estimation,
E[ G ] = L(w0) + λ /n+ o(1/n)
E[ T ] = L(w0) + {λ – ν } / n + o(1/n)
Definition. Constants λ, m, ν are defined as follows.
λ = min min (hj+1)/(2kj)RLCTLocal j=1,2,…,d
Multiplicity
ν = Eξ[ <t1/2 ξ(u)> ] / 2 Singular Fluctuation
Proof. See Main Theorem 6.3, in Sumio Watanabe, Algebraic geometryAnd statistical learning theory, Cambridge University Press, 2009, P.179.
35
General Theory : Free Energy
Theorem. The following holds.
F = n Ln(w0) + λ log n +(m-1) log log n +Op(1).
Proof. See Main Theorem 6.2, in Sumio Watanabe, Algebraic geometryAnd statistical learning theory, Cambridge University Press, 2009, P.174.
See also, Sumio Watanabe,"Algebraic analysis for nonidentifiable learning machines",Neural Computation, Vol.13, No.4, pp.899-933, 2001.
36
General Theory : Information Criteria
(1) By the definition, WAIC = T+ (1/n) Σi=1n Vw[ log p(Xi|w)],
it follows that E[G] = E[WAIC] +o(1/n).
Theorem. Even in singular cases, the followings hold.
(2) By the definition, WBIC = Ew(1/ log n) [n Ln(w)], it follows
that F=WBIC +op(log n).
(3) (Drton.et.al. 2017) By estimating λ,
sBIC = nLn(w*)+ λ log n. Then F=sBIC +op(log n).
(4) WAIC is asymptotically equivalent to LOOCV.
Information Criteria for Singular Models
(1) Sumio Watanabe, Equations of states in singular statistical estimation. Neural Networks. 2010 Jan. vol.23 (1):pp.20-34.
(2) Sumio Watanabe, A widely applicable Bayesian information criterion. JMLR, 2013, pp.867-897.
(3) Mathias Drton, Martyn Plummer. A Bayesian information criterion for singular models. J. R. Statist. Soc. B. , Part 2, pp.1-38, 2017.
(4) Sumio Watanabe, Asymptotic equivalence of Bayes cross validation and widely applicable information criterion. JMLR, 2010, pp.3571-3594.
38
An Experiment Model p(y|x,a,b) = (1/2π)1/2 exp(-(1/2)(y-a tanh(bx))2)Prior ϕ(a,b)∝ 1True q(y|x)=p(y|x,0,0). True q(x) is the uniform on [-2,2].
F – nSn
WBIC – nSn
BIC – nSn
Theory – nSn
In this case, λ =1, m=2.
For n =20, …, 450,BIC WBIC F
Theory were compared.
39
Neural Networks
x1 x2 x10
y1 y2 y10
True input, hidden, output: 10, 5, 10
Candidates :10, (1, 3, 5, 7, 9),10
n =200n_test=1000
Posterior was madeby Langevin eq.
40
Experimental results for 10 trials
GeneralizationLoss WAIC
LOOCV AIC
Candidates Candidates
Candidates Candidates
Recent Advances of Singular Learning Theory
(1) Keisuke Yamazaki. Asymptotic accuracy of Bayes estimation for latent variables with redundancy. Machine Learning (2016) 102: pp.1-28, DOI 10.1007/s10994-015-5482-3.(2) Keisuke Yamazaki, Daisuke Kaji. Comparing two Bayes methods based on the free energy functions in Bernoulli mixtures. Neural Networks, 44 pp.36-43, 2013.(3) Miki Aoyagi. Learning coefficient in Bayesian estimation of restricted Boltzmann machine. Journal of Algebraic Statistics, vol. 4, No. 1, pp.30-57, 2013.(4) Miki Aoyagi, Kenji Nagata. Learning coefficient of generalization error in Bayesian estimation and Vandermonde matrix type singularity. Neural Computation, vol. 24, No. 6, pp.1569 -1610, 2012.(5) Miki Aoyagi, Sumio Watanabe. Stochastic Complexities of Reduced Rank Regression in Bayesian Estimation, Neural Networks, No. 18, pp.924-933, 2005.(6) Kazuho Watanabe. An alternative view of variational Bayes and asymptotic approximations of free energy, Machine Learning, 86(2), 273-293, 2012.(7) Shinichi Nakajima, Masashi Sugiyama. Theoretical analysis of Bayesian matrix factorization. Journal of Machine Learning Research 12 (Sep), pp.2583-2648, 2011.(8) Naoki Hayashi, Sumio Watanabe. Upper Bound of Bayesian Generalization Error in Non-Negative Matrix Factorization", Neurocomputing, Vol. 266C, pp.21-28, 2017.
42
Summary 5
General theory which contains both singular and
regular cases was stablished by using algebraic
geometry.
Singularities make the generalization error very
small. Neural Networks utilize singularities.
6 ConclusionGeneralization problem of neural networks was clarified by algebraic geometry.