neural networks and singular learning theorywatanabe-...in 1990, it is well known that neural...

43
Neural Networks and Singular Learning Theory Sumio Watanabe Tokyo Institute of Technology

Upload: others

Post on 11-Feb-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

Neural Networks and Singular Learning Theory

Sumio WatanabeTokyo Institute of Technology

Page 2: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

2

Contents

1 Statistics and Learning

2 Regular Theory

3 Neural Networks are Singular

4 Algebraic Geometry

5 General Theory

6 Conclusion

Page 3: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

1 Statistics and Learning

Page 4: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

Unknown true and Learning Machine

p(y|x,w) q(x,y)=q(x)q(y|x)

X

X

Learning

=

Statistical

Estimation

Unknown

True

Page 5: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

Notations

Datum : (x,y) in RMN

Unknown True : q(x,y)

I.I.D. Sample Dn=(Xn, Yn)= {(X1, Y1,),…,(Xn,Yn)}

Parameter w in W contained in Rd

Learning Machine : p(y|x,w) Prior : φ(w)

Page 6: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

True q(x) q(y|x) Sample

Generalization Error

Predictive p*(y|x)

Posterior

Model and Prior

Learning Process

MLE

MAP

Page 7: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

ML, MAP, and Bayes

Log Loss : Ln(w) = -(1/n) Σi=1n log p(Yi|Xi,w)

Posterior p(w|Dn) = (1/Zn) ϕ(w) exp(-nLn (w))

Maximum Likelihood: p(y|x,w*) where w* minimizes Ln (w).

MAP: p(y|x,w+) where w+ maximizes p(w|Dn).

Bayes: p(y|x,Dn) = Ew [p (y|x,w)]

How to determine the predictive : p*(y|x)

Ew[ ], Vw[ ] : posterior mean and variance

Page 8: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

The purpose of Statistical Learning Theory

Training Loss: T = -(1/n) Σi=1n

log p*(Yi|Xi)

Generalization Loss: G = - Exy[ log p*(Y|X) ]

Free energy : F = -log φ(w) exp(-nLn (w)) dw

Main Purpose of Statistical Learning Theory

In ML, MAP, and Bayes,

clarify the distributions of T, G, and F.

Let p*(y|x) be a predictive by some method.

Page 9: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

Summary 1

The purpose of the statistical learning theory is

to clarify the distributions of training loss,

Generalization loss, and free energy.

Page 10: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

2 Regular Theory

Page 11: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

11

Regular Case

L(w) = - Exy[ log p(Y|X,w) ]

w0 is the parameter that minimizes L(w).

If L(w) can be approximated bya positive definitequadratic form in a neighborhood of w0, then statistical estimationis called regular.

Page 12: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

In regular cases, statistics was already established.

I = Exy[ log p(Y|X,w0) ( log p(Y|X,w0))T ]

J = - Exy [2 log p(Y|X,w0) ]

Definition. Positive definite matrices I and J are defined.

Remark. If q(y|x)=p(y|x,w0), then I=J.

In regular cases, the distributions of MLE and MAP andthe posterior distribution converge to w0, when samplesize n tends to infinity, resulting that the regular statistical theory was established in 1970.

Page 13: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

Regular Theory : Training and Generalization

Theorem. In a regular case, the followings hold, where d is the dimension of the parameter w.

Bayes E[ G ] = L(w0) + d/(2n) + o(1/n),

E[ T ] = L(w0) + { d –2 tr(IJ-1) } /(2n) + o(1/n).

ML, MAP E[ G ] = L(w0) + tr(IJ-1) / (2n) + o(1/n),

E[ T ] = L(w0) –tr(IJ-1) / (2n) + o(1/n).

Page 14: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

14

Regular Case: Free Energy

Theorem. In a regular case, the following holds.

F = nLn(w0) + (d/2) log n +Op(1).

Page 15: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

Regular Case : Information Criteria

(1) (Akaike 1974, Takeuchi 1976) In ML, MAP, and Bayes,

AIC= 2n Tn + 2 tr(IJ-1). Then E[G] = E[AIC]/(2n)+o(1/n).

Theorem. In a regular case, the followings hold.

(2) (Shwarz,1978) BIC = 2nLn(w*) + d log n. Then

F = BIC/2 +Op(1) .

(3) (Stone,1977) AIC is asymptotically equivalent to

leave-one-out cross validation (LOOCV).

Page 16: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

Summary 2

In regular cases, statistical learning theory

was established in 1970.

Page 17: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

3 Neural Networks are Singular

Page 18: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

Singular Case

L(w) = - Exy[ log p(Y|X,w) ]

If L(w) can not beapproximated byany positive definitequadratic formin a neighborhood of the minimum ofL(w), then statistical estimationis called singular.

Page 19: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

In 1990, it is well known that neural networks are singular learning machines.

(1) K. Hagiwara, et.al. Nonuniqueness of Connecting Weights and AIC in Multi-Layered Neural Networks. IEICE Transactions,D-II, 76(9), 2058-2065, 1993.

(2) S.Watanabe. A generalized Bayesian frameworkfor neural networks with singular Fisher information matrices.Proc. of NOLTA, pp.207-210, 1995

Page 20: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

20

Neural Networks have many singularities in parameter space.

Parameter space of a neural network.

Page 21: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

Hierarchical structure generatessingularities.

Parameter space of a neural network.

Page 22: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

Hidden Markov Models

Normal mixtures

Probabilisticgrammar

Bayesian Networks

Neural Networks

X Y

ZW

U

V S T

Matrix factorization

Almost all learning machines are singular.

Page 23: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

23

Neural Networks are singular.

(1) The map from a parameter to a probability

distribution is not one-to-one.

(2) The likelihood function can not be approximated

by any quadratic form.

(3) Neither MLE nor MAP has asymptotic normality.

(4) Bayes posterior distribution can not be

approximated by any normal distribution.

Page 24: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

Summary 3

Neural Networks are Singular.

Page 25: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

4 Algebraic Geometry

Page 26: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

The Load from Learning theory to Algebraic Geometry.

The set {w ; K(w) = 0 } contains many singularities.

… There was no statistical theory.

… There was no probability theory.

… … … We need algebraic geometry.

K(w)= L(w)-L(w0), where w0 minimizes L(w).

Page 27: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

Neural Networks

Algebraic Geometry

Page 28: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

Neural Networks

Hyperfunction

D - module

ResolutionTheorem

Zeta function

EmpiricalProcess

Birational Invariants

Page 29: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

29

R

Rd

In each local coordinate,

w=g(u)

K(g(u))= u12k1 u2

2k2 ・・ ud2kd

Manifold M

For any K(w) >=0

Parameter Set

Resolution of Singularities (Hironaka Theorem)

Page 30: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

Singular Likelihood is made to be a standard form.

Ln(w) = -(1/n) Σi=1n

log p(Yi|Xi,w) is made to be

nLn(g(u)) – n Ln (w0) = nu2k - n1/2 uk ξn (u),

where ξn(u) converges to a Gaussian process

in distribution.

By using K(g(u)) = u2k = u12k1 u2

2k2 ・・ ud2kd ,

Page 31: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

Summary 4

By using resolution theorem,

any singular log likelihood function can be made

to be a common standard form on a manifold.

Page 32: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

5 General Theory

General theory contains regular theory as a special case.

Page 33: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

General Theory : MLE and MAP

u*= arg maxu { (0, ξn(u)) 2 / 4 + log ϕ(g(u)) }, where maxu shows the maximum value on {u ; K(g(u))=0 }.For MLE, ϕ(w) is set to be a constant.

Theorem. Assume that the parameter set is compact.

E[G] = L(w0) + µ / n +o(1/n)

Proof. See Main Theorem 6.4, in Sumio Watanabe, Algebraic geometryAnd statistical learning theory, Cambridge University Press, 2009, P.211.

E[T] = L (w0) – µ / n +o(1/n),

Theorem. In ML and MAP,

µ = Eξ [ max (0, ξ(u*))2 ],

Page 34: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

m = the number of j that attain the maximum.

General Theory : Bayes

Theorem. In Bayesian estimation,

E[ G ] = L(w0) + λ /n+ o(1/n)

E[ T ] = L(w0) + {λ – ν } / n + o(1/n)

Definition. Constants λ, m, ν are defined as follows.

λ = min min (hj+1)/(2kj)RLCTLocal j=1,2,…,d

Multiplicity

ν = Eξ[ <t1/2 ξ(u)> ] / 2 Singular Fluctuation

Proof. See Main Theorem 6.3, in Sumio Watanabe, Algebraic geometryAnd statistical learning theory, Cambridge University Press, 2009, P.179.

Page 35: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

35

General Theory : Free Energy

Theorem. The following holds.

F = n Ln(w0) + λ log n +(m-1) log log n +Op(1).

Proof. See Main Theorem 6.2, in Sumio Watanabe, Algebraic geometryAnd statistical learning theory, Cambridge University Press, 2009, P.174.

See also, Sumio Watanabe,"Algebraic analysis for nonidentifiable learning machines",Neural Computation, Vol.13, No.4, pp.899-933, 2001.

Page 36: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

36

General Theory : Information Criteria

(1) By the definition, WAIC = T+ (1/n) Σi=1n Vw[ log p(Xi|w)],

it follows that E[G] = E[WAIC] +o(1/n).

Theorem. Even in singular cases, the followings hold.

(2) By the definition, WBIC = Ew(1/ log n) [n Ln(w)], it follows

that F=WBIC +op(log n).

(3) (Drton.et.al. 2017) By estimating λ,

sBIC = nLn(w*)+ λ log n. Then F=sBIC +op(log n).

(4) WAIC is asymptotically equivalent to LOOCV.

Page 37: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

Information Criteria for Singular Models

(1) Sumio Watanabe, Equations of states in singular statistical estimation. Neural Networks. 2010 Jan. vol.23 (1):pp.20-34.

(2) Sumio Watanabe, A widely applicable Bayesian information criterion. JMLR, 2013, pp.867-897.

(3) Mathias Drton, Martyn Plummer. A Bayesian information criterion for singular models. J. R. Statist. Soc. B. , Part 2, pp.1-38, 2017.

(4) Sumio Watanabe, Asymptotic equivalence of Bayes cross validation and widely applicable information criterion. JMLR, 2010, pp.3571-3594.

Page 38: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

38

An Experiment Model p(y|x,a,b) = (1/2π)1/2 exp(-(1/2)(y-a tanh(bx))2)Prior ϕ(a,b)∝ 1True q(y|x)=p(y|x,0,0). True q(x) is the uniform on [-2,2].

F – nSn

WBIC – nSn

BIC – nSn

Theory – nSn

In this case, λ =1, m=2.

For n =20, …, 450,BIC WBIC F

Theory were compared.

Page 39: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

39

Neural Networks

x1 x2 x10

y1 y2 y10

True input, hidden, output: 10, 5, 10

Candidates :10, (1, 3, 5, 7, 9),10

n =200n_test=1000

Posterior was madeby Langevin eq.

Page 40: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

40

Experimental results for 10 trials

GeneralizationLoss WAIC

LOOCV AIC

Candidates Candidates

Candidates Candidates

Page 41: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

Recent Advances of Singular Learning Theory

(1) Keisuke Yamazaki. Asymptotic accuracy of Bayes estimation for latent variables with redundancy. Machine Learning (2016) 102: pp.1-28, DOI 10.1007/s10994-015-5482-3.(2) Keisuke Yamazaki, Daisuke Kaji. Comparing two Bayes methods based on the free energy functions in Bernoulli mixtures. Neural Networks, 44 pp.36-43, 2013.(3) Miki Aoyagi. Learning coefficient in Bayesian estimation of restricted Boltzmann machine. Journal of Algebraic Statistics, vol. 4, No. 1, pp.30-57, 2013.(4) Miki Aoyagi, Kenji Nagata. Learning coefficient of generalization error in Bayesian estimation and Vandermonde matrix type singularity. Neural Computation, vol. 24, No. 6, pp.1569 -1610, 2012.(5) Miki Aoyagi, Sumio Watanabe. Stochastic Complexities of Reduced Rank Regression in Bayesian Estimation, Neural Networks, No. 18, pp.924-933, 2005.(6) Kazuho Watanabe. An alternative view of variational Bayes and asymptotic approximations of free energy, Machine Learning, 86(2), 273-293, 2012.(7) Shinichi Nakajima, Masashi Sugiyama. Theoretical analysis of Bayesian matrix factorization. Journal of Machine Learning Research 12 (Sep), pp.2583-2648, 2011.(8) Naoki Hayashi, Sumio Watanabe. Upper Bound of Bayesian Generalization Error in Non-Negative Matrix Factorization", Neurocomputing, Vol. 266C, pp.21-28, 2017.

Page 42: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

42

Summary 5

General theory which contains both singular and

regular cases was stablished by using algebraic

geometry.

Singularities make the generalization error very

small. Neural Networks utilize singularities.

Page 43: Neural Networks and Singular Learning Theorywatanabe-...In 1990, it is well known that neural networks are singular learning machines. (1) K. Hagiwara, et.al. Nonuniqueness of Connecting

6 ConclusionGeneralization problem of neural networks was clarified by algebraic geometry.