four types of learning curves - shigeru shinomoto

Communicated by Haim Sompolinsky

Four Types of Learning Curves

Shun-ichi Amari Naotake Fujita Department of Mathematical Engineering and Information Physics, University of Tokyo, Tokyo 113, Japan

Shigeru Shinomoto Department of Physics, Kyoto University, Kyoto 606, Japan

If machines are learning to make decisions given a number of examples, the generalization error E ( t) is defined as the average probability that an incorrect decision is made for a new example by a machine when trained with t examples. The generalization error decreases as t increases, and the curve E ( t ) is called a learning curve. The present paper uses the Bayesian approach to show that given the annealed approximation, learning curves can be classified into four asymptotic types. If the machine is deterministic with noiseless teacher signals, then (1) E N atp1 when the correct machine parameter is unique, and (2) E N atp2 when the set of the correct parameters has a finite measure. If the teacher signals are noisy, then (3) E N at-'/2 for a deterministic machine, and (4) E N c + at-' for a stochastic machine.

1 Introduction

A number of approaches have been proposed for machine learning. A classical example is the perceptron algorithm proposed by Rosenblatt (1961) for which a convergence theorem was given. A general theory of parametric learning was proposed by Amari (1967), Rumelhart et al. (19861, White (1989), and others, based on the stochastic gradient descent algorithm. See for example, Amari (1990) for a review of mathematical theory of neurocomputing.

A new framework of PAC learning was proposed by Valiant (1984), in which both the computational complexity and stochastic evaluation of performance are taken into account. The theory was successfully ap- plied to neural networks by Baum and Haussler (1989), where the VC dimension of a dichotomy class plays an important role. However, the framework is too restrictive, and Haussler et al. (1988) studied the general convergence rate of a learning curve by removing the algorithmic complexity constraint, while Baum (1990) has attempted to remove the worst case constraint on the probability distribution.

Neural Computation 4, 605-618 (1992) @ 1992 Massachusetts Institute of TechnoIogy

606 S. Amari, N. Fujita, and S. Shinomoto

A different approach is taken by Levin et al. (1990) in which the statistical mechanical approach is coupled with the Bayesian approach. See also Schwartz et al. (1990). A generalization error is defined by the probability that a machine that has been trained with t examples misclassifies a novel example. The statistical average of the generalization error over randomly generated examples is formulated using the Bayes formula. This theory can also be viewed as a straightforward application of the predictive minimum description length method proposed by Rissanen (1986). However, it is in general difficult to calculate the generalization error, so the “annealed approximation” is suggested (Levin et al. 1990).

The same problem has been treated by physicists (e.g., Hansel and Sompolinsky 1990; Sompolinsky et al. 1990; Gyiirgyi and Tishby 1990; Seung et al. 1992). They use the techniques of statistical mechanics such as the thermodynamic limit, replica method, and annealed approximation to evaluate the average generalization error ~ ( t ) . They have succeeded in obtaining an asymptotic form of ~ ( t ) and its phase transition for some specific models to which their methods are applicable.

In the present paper, we also discuss the average generalization error under the annealed approximation in the Bayesian framework. It is not necessary, however, to use a statistical-mechanical framework or to assume a Gibbs-type probability distribution. Our theory is statistical and is applicable to more general models beyond the limitation of the appli- cability of physical methods such as the thermodynamical limit and the replica method. We obtain four types of asymptotic behaviors in the way &( t ) decreases with t when t is large. The results are in agreement with those obtained by other methods for specific models. The asymptotic behavior does not depend on a specific structure of the target function, or on a specific architecture of the machine; they are universal in this sense. The asymptotic behavior of a learning curve depends only on whether the teacher signals are noisy or not, whether the machine is deterministic or stochastic, and whether there is a unique correct machine.

The main concern of the present paper is the deterministic case. An exact analysis of stochastic cases will be given in a forthcoming paper (Amari and Murata 1992) without using the annealed approximation.

2 Main Results

The problem is stated as follows. Let us consider a dichotomy of an n-dimensional Euclidean space R“,

where x E D+ is called a positive example and x E D- a negative example. A target signal y accompanies each x, where y = 1 for a positive example and y = -1 for a negative example. Given t randomly chosen examples XI, x2, . . . , xt independently drawn from a probability distribution p ( x ) together with corresponding target signals y1, . . . , yt, a learning

R” = D+ U D-,D+ n D- = 4

Four Types of Learning Curves 607

machine is required to estimate the underlying dichotomy. The machine is evaluated by its generalization error E ( t ) , that is, the probability that the next example xt+l produced by the same probability distribution is mis- classified by the machine. We evaluate the average generalization error under the so-called annealed approximation and give universal theorems on the convergence rate of ~ ( t ) as t tends to infinity.

A machine considered here is specified by a set of continuous parameters w = ( ~ 1 , . . . , w,) E R" and it calculates a function f(x, w). When the output of a machine is uniquely determined by the signum of f (x , w), the machine is said to be deterministic. In the deterministic case, the function f(x, w) specifies a dichotomy by

D+ = {x I f ( X > W ) > 01

and D- = {x If(x,w) 5 0 )

If the output is not deterministic but is given by a probability that is specified as a function of f(x, w), then it is said to be stochastic. A deterministic or stochastic neural network with modifiable synaptic weights gives a typical example of such a machine. For example, a layered feed- forward neural network calculates a dichotomy function f (x, w), where w is a vector summarizing all the modifiable synaptic connection weights.

The main subject of the present paper is asymptotic learning behavior of deterministic machines. The smoothness or differentiability of dichotomy functions f(x, w) is not required in the deterministic case, so that it is applicable to multilayer deterministic thereshold-element networks as well as to analog-element neural networks. We also discuss stochastic cases to compare the difference in their asymptotic behaviors. Since we use the regular statistical estimation technique, the smoothness of f(x, w) is required in the stochastic case to guarantee the existence of the Fisher information matrix. This type of network is a generalization of the fully smooth network introduced by Sompolinsky ef al. (1990) to the case where the network functions are smooth except for a threshold operation at the output.

Suppose that there exists parameter wo such that the true machine calculates f (x , wg) and generates the teacher signal y based on it. In some deterministic cases, there exists a set of parameters all of which give the correct classification behavior. A typical case is that there is a neutral zone between the set of positive examples and the set of negative examples where no input signals are generated. We treat the cases both that a unique correct classifier exists and that a set of correct classifiers exists.

The teacher signal y is said to be noiseless if y is given by the sign of f(x, WO) and noisy if y is stochastically produced depending on the value f(x, WO), irrespective of the machine itself being deterministic or stochastic.


The following are the main results on the asymptotic behaviors of learning curves under the Bayesian framework and the annealed approximation.

Case 1. The average generalization error behaves asymptotically as

m t

& ( t ) N -

when a machine is deterministic, the teacher signal is noiseless, and the machine giving correct classification is uniquely specified by the m- dimensional parameter WO.


C &( t ) N -

t 2

when a machine is deterministic, the teacher signal is noiseless, and the set of correct classifiers has finite measure in the parameter space.


when a machine is deterministic with a unique correct machine, but the teacher signal is noisy.


when a machine is stochastic.

3 The Average Generalization Error

We review here the Bayesian framework of learning along the line of Levin et al. (1990). However, it is not necessary to use the statistical- mechanical framework or to assume a Gibbs-type distribution.

Let p(y I x, w) be the probability that a machine specified by w generates output y when x is input. It is given by a monotone function k ( f ) off in the stochastic case,

P(Y = 1 I x,w) = kLf(x,w)l, 0 I k ( f ) I 1, k(O) = 1/2

In the deterministic case,

P(Y I x,w) = eIyf(X,W)l

where e ( z ) = 1 when z > 0 and 0 otherwise, that is, p(y I x, w) is equal to 1 when yf(x, w) > 0 and is otherwise 0. Let q(w) be a prior distribution


of parameter w. Then, the joint probability density that the parameter w is chosen and t examples of input-output pairs

E( ' ) = [ (Xl ,YI) , (x2,y2), . . . , (xt,yt)l

~ ( w , E" ' ) = q(w) n p(yi 1 xi, w)p(xi)

are generated by the machine is t

i=l

By using the Bayes formula, the posterior probability density of w is given by

where

is the probability measure of ws generating (yl, . . . , yt) when inputs (XI, . . . , xt) are chosen.

In the deterministic case, the probability

is the measure of such w that are compatible with t examples $'), that is, those w satisfying yif(xi, w) > 0, for all i = 1,. . . , t. Therefore, the smaller this is, the easier it is to estimate the w that resolves the dichotomy. In the stochastic case, the probability Z($')) can also be used as a measure of identifiability of the true w. The quantity Zt is related to the parti- tion function of the Gibbs distribution in the special but important case studied by Levin et al. (1990) or the physicist approach with the thermodynamical limit of the dimensionality of w tending to infinity (Seung et al. 1992). The quantity defined here is more general, although we use the same notation Zt.

The generalization error ct* based on t examples J(t) is defined, in the deterministic case, as the probability that a machine that classifies t examples E@f correctly fails to correctly classify a new example xt+l. This is given by

&+I Et* = Prob{yt+l f(xtfl , w) < 0 I yif(xi, w) > 0, i = 1 , . . . , t } = 1 - - zt because

Prob{yt+~f(xt+~,w) > 0 I yif(xi,w) > 0, i = 1 , - . . i t }

- Prob{y;f(x,, w) > 0, i = 1,. . . , t , t + 1) i = 1,. . . , t }

-

Prob{yif(xi, w) > 0,


This quantity can also be considered as the generalization error in the stochastic case, because

= Prob{yr+l I @'), xt+l} Zf

is the probability that the machine will correctly output y,+1 given xt+1

under the condition that t examples (('1 have been observed. The generalization error .cf* is a random variable depending on the

randomly generated examples The average generalization error E( t) is the average of E ~ * over all the possible examples $f) and a new pair ( Y t + l l X t + I )

E ( t ) = ( E ' * ) = 1 - (Zl+l /ZJ ( ) denoting the expectation with respect to [ ( t -t 1) = [ [ ( t ) , (y,+l,xt+l)]. This quantity is closely related to the stochastic complexity E: introduced by Rissanen (1986),

E: = -(ln(l - E p ) ) = (lnz,) - (lnZ,+1)

The actual evaluation of the quantity such as (Z,+l/Zf) and (lnZt) is gen- erally a very hard problem and has been obtained only for a few model systems (see for example, Hansel and Sompolinsky 1990; Sompolinsky et al. 1990; Gyorgyi and Tishby 1990). We will show an exact example later. To obtain a rough estimate of E ( t ) or ~ f , we introduce approximations,

(Z,+l/Zt) N (.&+I)/(&) and (In&) - In(&)

called the "annealed average" (Levin et al. 1990), see also Schwartz et al. 1990). The approximations are valid if Zt does not depend sensitively on the most probable (xl, . . . , xt). For this reason, we may call it the random phase approximation. The validity of the approximation is still open and we will return to this point in the final section (see also Seung etal . 1992). It is easy to show under the approximation that the average generalization error E ( t ) and the stochastic complexity E: are closely related in the asymptotic limit t -+ 00 in that

&( t ) N &;

provided E ( t ) -+ 0. [It is proved in Amari and Murata (1992) that the annealed approximation for E F gives a correct result in the stochastic case. See also Amari (1992) for the deterministic case.] Thus the remaining work is based on the evaluation of the average phase volume (Z,).

4 Case 1: A Unique Correct Deterministic Machine with a Noiseless Teacher

The expectation (Z , ) is calculated for a deterministic machine as follows. Let s(w) be the probability that a machine specified by w classifies a


randomly chosen x correctly, that is, as the true classifier specified by wo does, and hence,

s(w) = Probu(x, w) .f(x, wo) > 0)

Since yi is the signum of f(x;, wo),

(Z,) = Prob{yif(x;, w) > 0, i = 1,. . . , t } = /q(~)Pr~b{yif(xi,w) > 0, i = 1,. . ., t I w}dw

= /9(w){s(w)Pw

where the last equality follows because yi = sgnf(xi,wo) and because f(x;, w) .f(xi, WO) > 0, for i = 1,. . . , t, are conditionally independent when w is fixed.

When w is slightly deviated from the true wo in a unit direction e, I e I = 1,

w = wo + r e the regions D+(w) and D-(w) are slightly deviated from the true D+(WO) and D-(wo). The classifier with w misclassifies those examples that be- long to AD, which is the difference between D+(w) and D-(wo). There- fore, we have

s(w) = 1 - J p(x)dx AD

We assume that the directional derivative

a(e) = lim 1 / p(x) dx r-0 Y AD

exists and is strictly positive for any direction e. This holds when the probability of x belonging to AD caused by a small change Aw in w is in proportion to 1 Awl, irrespectively of the differentiability of f(x, w). Note that s(w) is not usually differentiable at w = wo in the deterministic case.

We use a method similar to the saddle point approximation to calculate (Zt), namely,

1 = /exp{t[logs(w) + :logq(w)]}dw

and by expanding logs(w) = -a(e)r + o(?)

and neglecting smaller order terms when 9(w) is regular, then for large t,

(Z,) = /exp{-ta(e)r}dw

Since the volume element dw can be written

dw = rm-' dr dR


where do is an angular volume element, then

(Zt) = / e x p { - t a ( e ) r } F ' d r d O

where

is a constant. From this, we have

proving the following theorem.

Theorem 1. Given the annealed approximation, a noiseless teacher and that wo is unique, the average generalization error of a deterministic machine decreases according to the universal formula

m t &( t ) = -

where m is the dimension of w.

Remark We have assumed as regularity conditions in deriving the above result the existence of nonzero directional derivative a(e) and a regular prior distribution 9(w) . These conditions hold in usual situations, however, it is possible to extend our result to more general cases.

When the set w of correct classifiers forms a k-dimensional submani- fold, we have,

(Z,) 0: t-(m-k)

so that m - k

t &(t) N ~

In the case where the probability distribution p(x ) is extremely densely concentrated on or sparsely distributed in the neighborhood of the boundary of D+ and D-, we have the following expansion

s(w) - 1 - u(e)ra, a > 0

The result in this case is m &(f) - - at so that the l / t law still holds in agreement with results obtained by other methods for many models (Haussler et al. 1988; Sompolinsky et al. 1990).


5 Case 2: Deterministic Case with a Noiseless Teacher, Where a Finite Measure of Correct Classifiers Exists

In this case, s(w) = 1 for w E So, where SO is the set of correct classifiers. We assume as a regularity condition that SO is a connected region having a piecewise smooth boundary. Moreover, we assume that if

w = w, + re,

where w, is the value of w at position w on dSo and e, is the unit normal vector at w, then s(w) can be expanded as

The calculation of (Z,) proceeds in this case as

= Lo 4( w ) dw + / / 9( w) exp{ - ta (w ) r} dr dw

C’ = P O + i

where Po is the measure of SO and

From this it follows that c ( t ) = l - ( P o + & ) l ( P o + ~ ) = p B

where C’ B = - PO

Hence the following theorem.

Theorem 2. If So has a finite measure PO > 0, the convergence rate of c(f) for a deterministic machine is as B

E(t) N - t2 where B is a constant depending on Po and the function f (x, w).

Note that when SO tends to a point WO, Po tends to 0. This implies that B tends to infinity, and the asymptotic behavior changes to that of Theorem 1 where phase transition takes place. Remark. The above result is obtained from the annealed approximation of (Z t+ l /Z t ) . The above error probability E ( t ) is, roughly speaking, based


on the learning scheme where at each time one chooses a machine randomly that correctly classifies the t examples [( t ) . However, the behavior is exponential,

E ( t ) N exp{ - c t }

if the learning scheme is to choose a machine randomly such that it correctly classifies [(f) and keep it if it correctly classifies the ( t + 1)st example, but if it does not then choose another machine randomly that does correctly classify the ( t + 1) examples [(,+I). This is known as the perfect generalization (Seung et al. 1992).

6 Case 3 A Deterministic Machine with a Noisy Teacher

This section treats the case of where the true classifier is unique and is a deterministic machine with parameter wo but teacher signals include stochastic error. The following is a typical example: The correct answer is 1 when f(x, wo) > 0 and -1 when f(x, wo) < 0, but the teacher signal y is 1 with probability kV(x, wo)] and is -1 with probability 1 - kcf). A typical function k is given by

1 1 + exp{ -Pu}

k(u) =

where 1 / p is the so-called "temperature." In this case, we cannot usually find any w consistent with t examples

[('I when t is large. We use instead a statistical estimator wt from t examples.

From the statistical point of view, the problem is to estimate an un- known parameter vector w from t independent observations (xi, yi), i = 1, . . . , t drawn from the probability distribution specified by w,

The Fisher information matrix G is defined by

1 dlogr(x,y;w) a l o g ~ ( x , y ; w ) ~ G(w)=E aw 1 aw

where E is the expectation with respect to the distribution ~ ( x , y; w), d/aw denotes the gradient column vector and the superscript T denotes the transposition. When the Fisher information exists, the estimation problem is regular. We assume that the problem is regular, which requires the differentiability of f(x, w). Let w, be the maximum likelihood estimator from t examples. It is well known that the covariance matrix of the maximum likelihood estimator w, is asymptotically given by


where G is the Fisher information matrix. The Fisher information matrix is explicitly given by

G = p2/k(l - k)-(-)Tp(x)dx af af aw a w

where k = k(f) and f = f (x, w) (see Amari 1991; Amari and Murata 1992). The expectation of the generalization error is then given by

D &( t ) = 1 - (S(W,)) = - 4

D = a(e)(eG-'eT) dR

Theorem 3. If their teacher signals include errors, then the average generalization error & ( f ) is asyrnpfoticaZZy given by

s where

D &( t ) N -

Jt

This convergence rate coincides with one obtained by Hansel and Sompolinsky (1988). Here, the error probability is evaluated for a deterministic machine. When the temperature ,&' tends to 0, the teacher becomes noiseless. It should be noted that the Fisher information G tends to infinity in proportion to p2 and hence, D tends to 0 in this limit. The asymptotic behavior then changes to that of Theorem 1, phase transition taking place.

7 Case 4 Stochastic Machine

In the case of a stochastic machine, the teacher signals are also stochastic. The error probability ~ ( t ) never tends to 0 in this case, but instead converges to some co > 0.

We have

= /exp{flogs(w)} dw

where s(w) = /P(Y I x, W)P(Y I x, WO)P(X) dxdy

Since s(w) is smooth in this case, we have the following expansion at its maximum wb,

with a constant c and a positive definite matrix K. Hence, s(w) N c - (w - wA)K(w - WA)T

(Z,) - ctt-"'2


so that

in agreement with Sompolinsky et al. (1990) and others.

Theorem 4. For a stochastic machine, the generalization error behaves as a t

E ( t ) - E o + -

8 Discussions

We have thus obtained four typical asymptotic laws of the generalization error E ( f ) under the annealed approximation. However, the validity of the annealed approximation is questionable, as is discussed in Seung et al. (1992). Gyorgyi and Tishby (1990) give a different result for a simple perceptron model based on the replica method, whose validity is not guaranteed. In order to see the validity of the approximation, we calculate the exact E ( f ) for the following simple example: Consider predicting a half space of R2, where signals x = (xI,x2) are normally distributed with mean 0 and the identity covariance matrix, w is a scalar having a uniform prior 9(w), and

f (x, w ) = x1 cos w + x2 sin w

In this special case, the probability density function of Zt is given by

p ( ~ , ) = 4t2zt exp{ -2tzf)

By calculating Zf+l and averaging it over xt+l , the random variable ef is denoted as

e; = - (u2 + v’)/(u + v )

where u and u are independent random variables subject to the same density function

p(u) = t exp{ - tu} From this, we have the asymptotically exact result

1 2

2 3t &(t ) = -

while the annealed approximation gives &( t ) N l / t . On the other hand, we have

so that

(log Z,) = c - log t

&f = - 1 t

where the annealed approximation holds.


This shows that the approximation gives the same order of t-’ but a different factor. It is interesting to see how the difference depends on the number m of parameters in w.

Looking from the point of view of statistical inference, the deterministic case and stochastic case are quite different. The estimator wt from t example is usually subject to a normal distribution with a covariance matrix of order l / t in the stochastic case. However, in the deterministic case, wt is usually not subject to a normal distribution. The squared error usually shows a stronger convergence. This is because the manifold of probability distributions has a Riemannian structure in the stochastic case (Amari 19851, while it has a Finslerian structure in the deterministic case (Amari 1987).

This suggests a difference of the validity of the annealed approximation in the two cases. We will discuss this point in more detail in a forthcoming paper (Amari and Murata 1992; Amari 1992).

Acknowledgments

The authors would like to thank Dr. Kevin Judd for valuable comments on the manuscript. This work was supported in part by Grant-in-Aid for Scientific Research on Priority Areas on “Higher-Order Brain Functions,” the Ministry of Education, Science and Culture of Japan.

References

Amari, S. 1967. Theory of adaptive pattern classifiers. IEEE Trans. EC-16(3),

Amari, S. 1985. Differential-Geometrical Methods in Statistics. Springer Lecture Notes in Statistics, 28, Springer, New York.

Amari, S. 1987. Dual connections on the Hilbert bundles of statistical models. In Geometrization ofStatistica1 Theory, C. T. J. Dodson, ed., pp. 123-152. ULDM, Lancaster, UK.

Amari, S. 1990. Mathematical foundations of neurocomputing. Proc. IEEE 78,

Amari, S. 1991. Dualistic geometry of the manifold of higher-order neurons.

Amari, S. 1992. A universal theorem on learning curves. To appear. Amari, S., and Murata, N. 1992. Predictive entropies and learning curves. To

Baum, E. B. 1990. The perceptron algorithm is fast for nonmalicious distribu-

Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization?

299-307.

1443-1 463.

Neural Networks 4, 443-451.

appear.

tions. Neural Comp. 2, 248-260.

Neural Comp. 1, 151-160.


Gyorgyi, G., and Tishby, N. 1990. Statistical theory of learning a rule. In Neural Networks and Spin Glasses, W. K. Theumann and R. Koberle, eds., pp. 3-36, World Scientific, Singapore.

Haussler, D., Littlestone, N., and Warmuth, K. 1988. Predicting 0 , l functions on randomly drawn points. Proc. COLT'88, pp. 280-295. Morgan Kaufmann, San Mateo, CA.

Hansel, D., and Sompolinsky, H. 1990. Learning from examples in a single-layer neural network. Europhys. Lett. 11, 687-692.

Levin, E., Tishby, N., and Solla, S. A. 1990. A statistical approach to learning and generalization in layered neural networks. Proc. I E E E 78, 1568-1574.

Rissanen, J. 1986. Stochastic complexity and modeling. Ann. Statist. 14, 1080- 1100.

Rosenblatt, F, 1961. Principles of Neurodynamics. Washington, D.C.: Spartan. Rumelhart, D., Hinton, G. E., and Williams, R. J. 1986. Learning internal repre-

sentations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, I: Foundations. MlT Press, Cambridge, MA.

Schwartz, D. B., Samalam, V. K., Solla, S. A., and Denker, J. S. 1990. Exhaustive learning. Neural Comp. 2, 374-385.

Seung, H. S., Sompolinsky, H., and Tishby, N. 1992. Statistical mechanics of learning from examples. To appear.

Sompolinsky, H., Seung, S., and Tishby, N. 1990. Learning from examples in large neural networks. Phy. Rev. Letf. 64, 1683-1686.

Valiant, L. G. 1984. A theory of the learnable. Comm. ACM 27,1134-1142. White, H. 1989. Learning in artificial neural networks: A statistical perspective.

Neural Comp. 1, 425-464.

Received 24 May 1991; accepted 15 November 1991

four types of learning curves - shigeru shinomoto

Documents