neural networks and singular learning theorywatanabe-...in 1990, it is well known that neural...

Neural Networks and Singular Learning Theory

Sumio WatanabeTokyo Institute of Technology

2

Contents

1 Statistics and Learning

2 Regular Theory

3 Neural Networks are Singular

4 Algebraic Geometry

5 General Theory

6 Conclusion

1 Statistics and Learning

Unknown true and Learning Machine

p(y|x,w) q(x,y)=q(x)q(y|x)

X

Ｙ

X

Ｙ

Learning

=

Statistical

Estimation

Unknown

True

Notations

Datum : (x,y) in RMN

Unknown True : q(x,y)

I.I.D. Sample Dn=(Xn, Yn)= {(X1, Y1,),…,(Xn,Yn)}

Parameter w in W contained in Rd

Learning Machine : p(y|x,w) Prior : φ(w)

True q(x) q(y|x) Sample

Generalization Error

Predictive p*(y|x)

Posterior

Model and Prior

Learning Process

MLE

MAP

The purpose of Statistical Learning Theory

Training Loss: Ｔ = －(1/n) Σi=1n

log p*(Yi|Xi)

Generalization Loss: G = - Exy[ log p*(Y|X) ]

Free energy : F = -log φ(w) exp(-nLn (w)) dw

Main Purpose of Statistical Learning Theory

In ML, MAP, and Bayes,

clarify the distributions of T, G, and F.

Let p*(y|x) be a predictive by some method.

Summary 1

The purpose of the statistical learning theory is

to clarify the distributions of training loss,

Generalization loss, and free energy.

2 Regular Theory

11

Regular Case

L(w) = - Exy[ log p(Y|X,w) ]

w0 is the parameter that minimizes L(w).

If L(w) can be approximated bya positive definitequadratic form in a neighborhood of w0, then statistical estimationis called regular.

In regular cases, statistics was already established.

I = Exy[ log p(Y|X,w0) ( log p(Y|X,w0))T ]

J = - Exy [2 log p(Y|X,w0) ]

Definition. Positive definite matrices I and J are defined.

Remark. If q(y|x)=p(y|x,w0), then I=J.

In regular cases, the distributions of MLE and MAP andthe posterior distribution converge to w0, when samplesize n tends to infinity, resulting that the regular statistical theory was established in 1970.

Regular Theory : Training and Generalization

Theorem. In a regular case, the followings hold, where d is the dimension of the parameter w.

Bayes E[ G ] = L(w0) + d/(2n) + o(1/n),

E[ T ] = L(w0) + { d –2 tr(IJ-1) } /(2n) + o(1/n).

ML, MAP E[ G ] = L(w0) + tr(IJ-1) / (2n) + o(1/n),

E[ T ] = L(w0) –tr(IJ-1) / (2n) + o(1/n).

14

Regular Case: Free Energy

Theorem. In a regular case, the following holds.

F = nLn(w0) + (d/2) log n +Op(1).

Regular Case : Information Criteria

(1) (Akaike 1974, Takeuchi 1976) In ML, MAP, and Bayes,

AIC= 2n Tn + 2 tr(IJ-1). Then E[G] = E[AIC]/(2n)+o(1/n).

Theorem. In a regular case, the followings hold.

(2) (Shwarz,1978) BIC = 2nLn(w*) + d log n. Then

F = BIC/2 +Op(1) .

(3) (Stone,1977) ＡＩＣ is asymptotically equivalent to

leave-one-out cross validation (LOOCV).

Summary 2

In regular cases, statistical learning theory

was established in 1970.

3 Neural Networks are Singular

Singular Case

L(w) = - Exy[ log p(Y|X,w) ]

If L(w) can not beapproximated byany positive definitequadratic formin a neighborhood of the minimum ofL(w), then statistical estimationis called singular.

In 1990, it is well known that neural networks are singular learning machines.

(1) K. Hagiwara, et.al. Nonuniqueness of Connecting Weights and AIC in Multi-Layered Neural Networks. IEICE Transactions,D-II, 76(9), 2058-2065, 1993.

(2) S.Watanabe. A generalized Bayesian frameworkfor neural networks with singular Fisher information matrices.Proc. of NOLTA, pp.207-210, 1995

20

Neural Networks have many singularities in parameter space.

Parameter space of a neural network.

Hierarchical structure generatessingularities.

Parameter space of a neural network.

Hidden Markov Models

Normal mixtures

Probabilisticgrammar

Bayesian Networks

Neural Networks

X Y

ZW

U

V S T

＝

Matrix factorization

Almost all learning machines are singular.

23

Neural Networks are singular.

(1) The map from a parameter to a probability

distribution is not one-to-one.

(2) The likelihood function can not be approximated

by any quadratic form.

(3) Neither MLE nor MAP has asymptotic normality.

(4) Bayes posterior distribution can not be

approximated by any normal distribution.

Summary 3

Neural Networks are Singular.

4 Algebraic Geometry

The Load from Learning theory to Algebraic Geometry.

The set {w ; K(w) = 0 } contains many singularities.

… There was no statistical theory.

… There was no probability theory.

… … … We need algebraic geometry.

K(w)= L(w)-L(w0), where w0 minimizes L(w).

Neural Networks

Algebraic Geometry

Neural Networks

Hyperfunction

D - module

ResolutionTheorem

Zeta function

EmpiricalProcess

Birational Invariants

29

R

Rd

In each local coordinate,

w=g(u)

K(g(u))= u12k1 u2

2k2 ・・ ud2kd

Manifold M

For any K(w) >=0

Parameter Set

Resolution of Singularities (Hironaka Theorem)

Singular Likelihood is made to be a standard form.

Ln(w) = －(1/n) Σi=1n

log p(Yi|Xi,w) is made to be

nLn(g(u)) – n Ln (w0) = nu2k - n1/2 uk ξn (u),

where ξn(u) converges to a Gaussian process

in distribution.

By using K(g(u)) = u2k = u12k1 u2

2k2 ・・ ud2kd ,

Summary 4

By using resolution theorem,

any singular log likelihood function can be made

to be a common standard form on a manifold.

5 General Theory

General theory contains regular theory as a special case.

General Theory : MLE and MAP

u*= arg maxu { (0, ξn(u)) 2 / 4 + log ϕ(g(u)) }, where maxu shows the maximum value on {u ; K(g(u))=0 }.For MLE, ϕ(w) is set to be a constant.

Theorem. Assume that the parameter set is compact.

E[G] = L(w0) + µ / n +o(1/n)

Proof. See Main Theorem 6.4, in Sumio Watanabe, Algebraic geometryAnd statistical learning theory, Cambridge University Press, 2009, P.211.

E[T] = L (w0) – µ / n +o(1/n),

Theorem. In ML and MAP,

µ = Eξ [ max (0, ξ(u*))2 ],

m = the number of j that attain the maximum.

General Theory : Bayes

Theorem. In Bayesian estimation,

E[ G ] = L(w0) + λ /n+ o(1/n)

E[ T ] = L(w0) + {λ – ν } / n + o(1/n)

Definition. Constants λ, m, ν are defined as follows.

λ = min min (hj+1)/(2kj)RLCTLocal j=1,2,…,d

Multiplicity

ν = Eξ[ <t1/2 ξ(u)> ] / 2 Singular Fluctuation


35

General Theory : Free Energy

Theorem. The following holds.

F = n Ln(w0) + λ log n +(m-1) log log n +Op(1).


See also, Sumio Watanabe,"Algebraic analysis for nonidentifiable learning machines",Neural Computation, Vol.13, No.4, pp.899-933, 2001.

36

General Theory : Information Criteria

(1) By the definition, WAIC = T+ (1/n) Σi=1n Vw[ log p(Xi|w)],

it follows that E[G] = E[WAIC] +o(1/n).

Theorem. Even in singular cases, the followings hold.

(2) By the definition, WBIC = Ew(1/ log n) [n Ln(w)], it follows

that F=WBIC +op(log n).

(3) (Drton.et.al. 2017) By estimating λ,

sBIC = nLn(w*)+ λ log n. Then F=sBIC +op(log n).

(4) WAIC is asymptotically equivalent to LOOCV.

Information Criteria for Singular Models

(1) Sumio Watanabe, Equations of states in singular statistical estimation. Neural Networks. 2010 Jan. vol.23 (1):pp.20-34.

(2) Sumio Watanabe, A widely applicable Bayesian information criterion. JMLR, 2013, pp.867-897.

(3) Mathias Drton, Martyn Plummer. A Bayesian information criterion for singular models. J. R. Statist. Soc. B. , Part 2, pp.1-38, 2017.

(4) Sumio Watanabe, Asymptotic equivalence of Bayes cross validation and widely applicable information criterion. JMLR, 2010, pp.3571-3594.

38

An Experiment Model p(y|x,a,b) = (1/2π）1/2 exp(-(1/2)(y-a tanh(bx))2)Prior ϕ(a,b)∝ 1True q(y|x)=p(y|x,0,0). True q(x) is the uniform on [-2,2].

F – nSn

WBIC – nSn

BIC – nSn

Theory – nSn

In this case, λ =1, m=2.

For n =20, …, 450,BIC WBIC F

Theory were compared.

39

Neural Networks

x1 x2 x10

y1 y2 y10

True input, hidden, output: 10, 5, 10

Candidates :10, (1, 3, 5, 7, 9),10

n =200n_test=1000

Posterior was madeby Langevin eq.

40

Experimental results for 10 trials

GeneralizationLoss WAIC

LOOCV AIC

Candidates Candidates

Candidates Candidates

Recent Advances of Singular Learning Theory

(1) Keisuke Yamazaki. Asymptotic accuracy of Bayes estimation for latent variables with redundancy. Machine Learning (2016) 102: pp.1-28, DOI 10.1007/s10994-015-5482-3.(2) Keisuke Yamazaki, Daisuke Kaji. Comparing two Bayes methods based on the free energy functions in Bernoulli mixtures. Neural Networks, 44 pp.36-43, 2013.(3) Miki Aoyagi. Learning coefficient in Bayesian estimation of restricted Boltzmann machine. Journal of Algebraic Statistics, vol. 4, No. 1, pp.30-57, 2013.(4) Miki Aoyagi, Kenji Nagata. Learning coefficient of generalization error in Bayesian estimation and Vandermonde matrix type singularity. Neural Computation, vol. 24, No. 6, pp.1569 -1610, 2012.(5) Miki Aoyagi, Sumio Watanabe. Stochastic Complexities of Reduced Rank Regression in Bayesian Estimation, Neural Networks, No. 18, pp.924-933, 2005.(6) Kazuho Watanabe. An alternative view of variational Bayes and asymptotic approximations of free energy, Machine Learning, 86(2), 273-293, 2012.(7) Shinichi Nakajima, Masashi Sugiyama. Theoretical analysis of Bayesian matrix factorization. Journal of Machine Learning Research 12 (Sep), pp.2583-2648, 2011.(8) Naoki Hayashi, Sumio Watanabe. Upper Bound of Bayesian Generalization Error in Non-Negative Matrix Factorization", Neurocomputing, Vol. 266C, pp.21-28, 2017.

42

Summary 5

General theory which contains both singular and

regular cases was stablished by using algebraic

geometry.

Singularities make the generalization error very

small. Neural Networks utilize singularities.

6 ConclusionGeneralization problem of neural networks was clarified by algebraic geometry.

neural networks and singular learning theorywatanabe-...in 1990, it is well known that neural...

Documents