erwin. e. obermayer k._schulten. k. _1992: self-organising maps_stationary states, metastability and...

Self-Organizing Maps: Stationary States,Metastability and Convergence RateE. Erwin, K. Obermayer and K. SchultenBeckman Institute and Department of PhysicsUniversity of Illinois at Urbana-ChampaignUrbana, IL 61801, U.S.A.Tel: (217)-244-1613Fax: (217)-244-8371E-mail: [email protected] investigate the e�ect of various types of neighborhood functions on the con-vergence rates and the presence or absence of metastable stationary states ofKohonen's self-organizing feature map algorithm in one dimension. We demon-strate that the time necessary to form a topographic representation of the unitinterval [0; 1] may vary over several orders of magnitude depending on the rangeand also the shape of the neighborhood function by which the weight changesof the neurons in the neighborhood of the winning neuron are scaled. We willprove that for neighborhood functions which are convex on an interval given bythe length of the Kohonen chain there exist no metastable states. For all otherneighborhood functions, metastable states are present and may trap the algo-rithm during the learning process. For the widely-used Gaussian function thereexists a threshold for the width above which metastable states cannot exist. Dueto the presence or absence of metastable states, convergence time is very sensitiveto slight changes in the shape of the neighborhood function. Fastest convergenceis achieved using neighborhood functions which are \convex" over a large rangearound the winner neuron and yet have large di�erences in value at neighboringneurons. 1

1 IntroductionThe self-organizing feature map (SOFM) algorithm (Kohonen 1982a, 1982b, 1988) is abiologically-inspired method for constructing a structured representation of data from aninput space by prototypes, called weight vectors. The weight vectors are associated withselected elements, the neurons, of an image space where metric relationships are de�nedbetween the elements. For any given data-set, the SOFM algorithm selects weight vectorsand assigns them to neurons in the network. The weight vectors as a function of neuroncoordinates are called the feature map. Topologically ordered feature maps are character-ized by the fact that prototypes which are neighbors in the input space are mapped ontoneighboring neurons. The SOFM algorithm can successfully form feature maps which aretopologically ordered, or nearly so, in a variety of applications, where the input space andimage space have the same or di�erent dimensionality. However, no complete theory of fea-ture map formation has yet appeared. Such a theory would be useful for optimizing thealgorithm in technical applications and for developing new algorithms for cases where theoriginal SOFM algorithm is not adequate.In a companion article (Erwin et al. 1992) we began to provide a framework for answeringsome questions about the SOFM algorithm. We have proven that the one-dimensional SOFMalgorithm will converge to a topographic representation of the unit interval [0; 1] by a linearchain of neurons given only that the neighborhood function, by which the weight changes inneurons in the neighborhood of the winning neuron are scaled, be monotonically decreasing.We also demonstrated that the dynamics of the one- or multi-dimensional algorithm may bedescribed using a set of energy functions, such that the weight values associated with eachneuron tend to move to decrease their energy.In this paper we will consider the convergence process in more detail, in particular therate of convergence and the presence and absence of metastable states, which correspondto non-global minima of the energy functions. In section 2 we will brie y describe the one-dimensional SOFM algorithm. In section 3 we will discuss the issue of metastable states,stationary states of the algorithm which do not correspond to the topologically ordered,optimal mappings. We will prove that for a certain class of neighborhood functions there2

exist no metastable states, while for other types of neighborhood functions metastable statesare present regardless of the parameters of the algorithm. We will also show that for thecase of the widely used Gaussian functions, there exists a threshold value for their width,above which the topologically ordered state is the only stationary state of the algorithm.However, for more narrow Gaussian functions, or for the simple step function, metastablestates exist and the mapping algorithm may become \stuck" temporarily in these non-ordered con�gurations.In section 4 we will provide numerical and analytical results on the speed of orderingas a function of the algorithm's parameters. Due to the presence or absence of metastablestates, convergence time is very sensitive to slight changes in the shape of the neighborhoodfunction, so that neighborhood functions which are close in some function-space metricmay yet cause the SOFM algorithm to require very di�erent amounts of time to converge.Fastest convergence is achieved using neighborhood functions which are \convex" over a largerange around the winner neuron in the network, and yet have large di�erences in value atneighboring neurons. For the typically chosen Gaussian function, these competing interestsbalance to give the shortest convergence time when the width of the Gaussian is of theorder of the number of neurons in the chain. The results of this section o�er suggestions foravoiding metastable states in practical applications.2 The One-Dimensional SOFM AlgorithmThe one-dimensional SOFM algorithm employs a set of neurons which are arranged in alinear chain. The location of any neuron in the network is speci�ed by the scalar index s,which we will allow to take integer values between 1 and N , the number of neurons. Aweight vector ws, in this case a scalar from the interval [0; 1], is assigned to each neuron sto form the feature map.Given an initial set of weight vectors, feature map formation follows an iterative proce-dure. At each time step t, a pattern v, an element of the data manifold in input space, ischosen at random. The neuron r whose weight value wr is metrically closest to the pattern3

is selected jwr � vj = mins jws � vj; (1)and the weight values of all neurons are then changed according to the feature map updaterule, ws(t+ 1) = ws(t) + � h(r; s)[v(t) � ws(t)]; (2)where �, the learning step width, is some small constant (0 < �� 1:) The function h(r; s) iscalled the neighborhood function. For most applications it has a maximum value of one fors � r and decreases with increasing distance between s and r.For convenience we de�ne a state of the network as a particular set of weight valuesfwsj s = 1; 2; : : : ; N ;ws 2 [0; 1]g, and a con�guration as the set of states which are charac-terized by the same order relations among the scalar weights.The original form of the SOFM algorithm employed a step function as the neighborhoodfunction h(r; s) h(r; s) � H(jr � sj) = 8><>: 1; if jr � sj � 1;0; otherwise. (3)Later it was discovered that the algorithm could be made to converge more quickly inpractical applications if a gradually decreasing function were used (e. g. Ritter 1989; Lo andBavarian 1991). In a companion paper we proved that for the one-dimensional case, thealgorithm can be guaranteed to converge for any monotonically decreasing neighborhoodfunction. However, we also observed that other properties of the neighborhood functionseemed to a�ect the e�ciency of the algorithm. In particular we will show here that thealgorithm is more e�cient when a so-called convex neighborhood function is used.We de�ne a neighborhood function to be convex on a certain interval of integer valuesI � f0; 1; 2; : : : ; Ng, ifjs� qj > js� rj; jr � qj =) [h(s; s) + h(s; q)] < [h(s; r) + h(r; q)] (4)holds for all js� qj; js� rj; jr � qj within the interval I, and to be concave otherwise. If theindices r; s and q are allowed to be real numbers and the interval I is taken to be the setof all real numbers, this de�nition �ts the usual notion of convexity. However, due to edgee�ects, the de�nition (4) allows some additional neighborhood functions to be classi�ed as4

convex even though their second derivative is positive over a part of the range of allowedarguments.To illustrate the consequences of neighborhood function shape, we choose speci�c neigh-borhood functions which do or do not meet condition (4). We have selected the step func-tion (3) above and the \Gaussian", \concave exponential", \compressed Gaussian", and\compressed concave exponential" neighborhood functions (Fig. 1) given by:h(r; s) � H(jr � sj) = 8>>>>>>>>>>>>>>>>>><>>>>>>>>>>>>>>>>>>:

exp (�(r � s)2=�2); (5)exp (�qjr � sj=�); (6)1� �(1� exp (�(r � s)2=�2)); and (7)1� �(1� exp (�qjr � sj=�)); (8)respectively. The constant � in the compressed functions must be in the range 0 < � � 1,and will usually be small. For � = 1, equations (7) or (8) reduce to the Gaussian (5) orconcave exponential functions (6), respectively. Note that Gaussian functions (5) and (7) areconvex within an interval around their maximum; for large enough values of �, the Gaussianfunctions will be convex over the full range of their possible arguments, which are integervalues less than N . An approximate lower bound on the value of � is given by � > Np2,although slightly lower values of � may still result in neighborhood functions satisfying (4).The concave exponential functions (6) and (8) are concave for all values of the parameter �.5

3 Metastable StatesIt is known that the ordered state for the one-dimensional SOFM algorithm is stationary andstable (Kohonen 1982a; Ritter and Schulten 1986). However, little attention has been givento the existence of other stationary states, which correspond to stable \topological defects"of the feature map. Here we will show that the existance of metastable states is linked tothe shape of the neighborhood function. Metastable states with topological defects exist forany neighborhood functions which is not \convex" (4).For the derivation of these results it will be convenient to relabel the weight values sothat their indices are arranged in ascending order in the input space. To avoid confusion weintroduce new symbols ux to refer to weight values labeled such that x < y �! ux < uy,(x; y 2 f1; 2; : : : ; Ng) and use the \old" symbols ws to label weight values by the position ofthe corresponding neurons in the network. \New" indices can be converted to \old" indicesby a permutation function s = P(x), which, however, is di�erent for each con�guration ofthe network. Thus we may write: ux � wP(x) � ws: (9)Note that the arguments of the neighborhood function are always the indices s of ws, since theneighborhood function is de�ned by neighborhood relationships in the image space (network),not in the input space. We will use the abbreviation h(x; y) for h(P(x);P(y)).Let us denote the probability density of choosing a pattern v by P (v). Then the averagechange Vx[u] � hux(t + 1)� ux(t)i of the weight value ux in one iteration, with the averagetaken over all possible patterns v, is given byVx[u] = � Z 10 h(x; y) (v � ux)P (v) dv; (10)where y is the label of the winner neuron. The quantity Vx[u] may be interpreted loosely asthe average force acting to either increase or decrease the value of the weight ux at the nextiteration. Expanding (10) into a sum of integrals over all possible winner neurons y yieldsVx[u] = � NXy=1 h(x; y) Zv2(y)(v � ux)P (v) dv; (11)6

where each integral is evaluated only over the Voronoi tessellation cell of the neuron y, i. e.the area of the input space mapped onto neuron y. The Voronoi tessellation cell may beexpressed as (1) = fv j 0 < v < 12(u1 + u2)g;(y) = fv j 12(uy�1 + uy) < v < 12(uy + uy+1)g; for 1 < y < N;(N) = fv j 12(uN�1 + uN) < v < 1g: 9>>>>>=>>>>>; (12)We de�ne a stationary state to be a set of weight values fuxg or fwsg which are char-acterized by vanishing forces: Vx = 0 for all x. The stationary states could correspond tolocal minima or maxima of the energy functions (given in Erwin et al. 1992), but we willshow that all stationary states correspond to local minima of the energy functions. Laterwe will di�erentiate between stable stationary states which belong to the absorbing, orderedcon�gurations, and metastable states which belong to con�gurations with topological defects.Let us consider the simplest case, where P (v) is a constant. The condition Vx[u] � 0 forthe existence of a stationary state then simpli�es to0 = NXy=1 h(x; y) Zv2y(v � ux) dv: (13)If the neighborhood function h(r; s) is also constant, there exists only one stationary statewith weight values given by ux = 1=2, for all x.Let us next consider neighborhood functions which can be written as a perturbationh(x; y) � 1� �g(x; y) (14)on this trivial case. Note that the \compressed" neighborhood functions (7) and (8) are ofthis form. Then we may derive the following theorem:Theorem 1. Given a constant probability distribution and a neighborhoodfunction h(x; y) � 1� �g(x; y), such that g(x; y) � G(jx � yj) is positive and oforder O(1), and 0 � �� 1; then1. Any permutation function P(x) which when inserted into the right handside of ux = 12 + �8 (g(1; x)� g(N; x))7

+ �216(g(1; x)2 � g(N; x)2) + O(�3) (15)leads to weights fuxg in ascending order, with the di�erences between adja-cent weight values being greater than O(�3), describes a con�guration whichcontains one stationary state.2. The stationary state fu0xg is given by the r.h.s. of (15).The derivation of (15) in Theorem 1 is given in Appendix A. Note that the \compressed"neighborhood functions (7) and (8) meet the requirements of Theorem 1 when the factor �is su�ciently small.For an ensemble of maps with weight values located at one of the stationary states,application of the update rule (2) to each map, with patterns v independently chosen fromthe input space with the probability density P (v), leads to no change in the average of eachweight value across the ensemble. The weight values in any individual map must, however,change after each application of the update rule. The average e�ect of the update rule onmaps whose weight values are near a stationary state is considered in the following theorem:Theorem 2. Given a constant probability distribution, then for every neighbor-hood function @Vx[u]=@ux < 0; 8x 2 f1; 2; : : : ; Ng; (16)for all sets of weight values fuxg such that the derivative exists, i. e. for all mapssuch that ux 6= uy;8x 6= y.This theorem may be proven by performing the indicated derivative and using the propertyux > uy; 8x > y of the stationary state (15).Theorem 2 states that the weight values of maps in a con�guration containing a stationarystate are more likely to move toward that stationary state than away from it, and thus they uctuate around the values given by Theorem 1. However, unless the weights are in anordered con�guration, there is always a �nite probability that a series of input patterns willcause an individual map to \escape", i. e. to change their con�guration of weights. After achange in con�guration, the weights will again be attracted to a stationary state, but this8

may not be the one in the previous con�guration. For these reasons, we call stationary statesstable only if they belong to the absorbing, ordered con�guration, and metastable otherwise.We may make several extensions of Theorem 1 concerning the existence of stationarystates. Our �rst corollary states the conditions under which no metastable states exist.Corollary 1. If the conditions of Theorem 1 hold and h(x; y) is convex in thesense of (4), then and only then all stationary states belong to ordered con�gu-rations.The proof of Corollary 1 is given in Appendix B. Although expansion (15) is no longervalid for neighborhood functions with � � 1, we found empirical evidence from numericalsimulations that even when the expansion (15) is not valid. Metastable states exist forconcave, but not for convex neighborhood functions.When the conditions of Theorem 1 are met, it follows from ux+1 > ux and (15) thatg(P(1);P(x + 1))� g(P(N);P(x + 1)) > g(P(1);P(x)) � g(P(N);P(x)); (17)for any two weights ux and ux+1. After P(1) and P(N) have been speci�ed, there is at mostone possible set of values for the other P(x)'s which will ful�ll (17). ThusCorollary 2. If the conditions of Theorem 1 are met, and h(x; y) is concave,then non-ordered stationary states, i. e. metastable states, exist. If the numberof neurons is N , then (15) predicts at most a total of N(N � 1) stable andmetastable stationary states.When the conditions of Theorem 1 do not hold, it is no longer correct to neglect the termsof order O(�3) in (15). Corollary 2 no longer holds and more than one stationary state mayexist for each pair of P(1) and P(N) values, thus the number of stationary states is nolonger limited to N(N � 1). Indeed, for � ! 0 in any of the four neighborhood functions,N ! metastable states exist, one for each con�guration of the weights.The conditions of Theorem 1 are met for the compressed neighborhood functions (7)and (8) for small � and all but the smallest values of �. For very small �, g(1; x)� g(N; x) <O(�3) and the expansion to order �3 is not su�cient. The conditions of Theorem 1 are also9

met by the Gaussian (5) and concave exponential (6) functions, if their width � is su�cientlylarge. In this case expansion of (5) and (6) with respect to 1=� leads to the perturbations�G(jxj) � x2=�2, or �G(jxj) � qjxj=�, respectively. Since Gaussian neighborhood functionswith broad enough � are convex over the full range of their possible arguments, applicationof Corollaries 1 and 2 to Gaussian functions yields that there exists a threshold value for �above which no metastable states exist. An equation for determining this threshold valuewill be given in section 4.3.Figure 2 illustrates these results for a 10-neuron Kohonen chain and a compressed Gaus-sian neighborhood function (7) with N = 10, � = 10�8: For a 10-neuron chain a maximum of90 metastable con�gurations can be found from (15). These con�gurations can be groupedinto the 25 classes which are listed in Table 1 along with the range of � over which theyare stable. All members of a class are related to each other by two elementary symmetryoperations, i.e. by replacing all indices s by (N � s + 1), or replacing all weight values wsby (1� ws).For large �, the perturbative part of (7) is convex, and the only stationary states are thetwo ordered stationary states, but as � is lowered this term becomes concave and metastablecon�gurations also begin to appear. As � is reduced, the �rst set of metastable states toappear are the ones in the class labeled B in Table 1. In this con�guration all of the weightsare ordered except for the two corresponding to the neurons at the end of the chain, i.e.(w1 < w2 < w3 � � � < w10 < w9). Next, three symmetrical metastable states appear at thesame value of �, namely with (w1 > w2 > � � �w10 > w9); (w2 < w1 < w3 < � � � < w9 < w10);and (w2 > w1 > w3 > � � � > w9 > w10). As � is lowered further, more metastable statesappear, with greater disorder. Some metastable states, such as B, remain metastable as �is lowered, others are only stable over a limited range of �. As � goes to zero for anyof the neighborhood functions considered, the conditions of Theorem 1 do not hold, andexpansion (15) is no longer su�cient. Note that for the concave exponential function (8),a certain set of metastable states exist for all values of �; some of their con�gurations areindicated in Table 1. 10

4 Rate of Convergence4.1 Ordering TimeLet us de�ne ordering time as the number of time steps required for a given map to reach anordered con�guration. In the case of the neighborhood functions (5){(8), for � ! 0 or � !1the relative order of the weight values is not a�ected by the update rule and ordering timebecomes in�nite. For all other values of �, the neighborhood function is a monotonouslydecreasing function of the distance jr� sj, and the ordering time must be �nite. Thus theremust exist an intermediate value �min which corresponds to a minimum in ordering time.The minimal value and its location di�er between the neighborhood functions (5){(8) and,in fact, crucially depend on the shape of the neighborhood function.Figure 3a shows the average ordering time as a function of � for the Gaussian neighbor-hood function (5) and an ensemble of 10-neuron Kohonen chains. Note that time is scaledby �, since ordering time is proportional to � for �� 1. Ordering time is minimal for � � 9.For � !1 ordering time rises slowly and approaches a logarithmic dependence. The stan-dard deviation of the number of time steps required (indicated by the error bars) is smalland approximately constant. For small � average ordering time rises very rapidly and thestandard deviation is large. The large average ordering time with a large standard deviationis due to the presence of metastable states, which appear as the neighborhood function be-comes progressively more concave. Constant standard deviation, and a logarithmic increasein ordering time with �, is due to an initial \contraction" phase of the network. Orderingin the separate regimes of large and small � will be discussed in detail below.Similar graphs result for chains with a larger number of neurons N , but ordering timescales di�erently with N in the regimes of large and small �. For large N ordering timebecomes a function of �=N for large �. However, this is not true for small � where thepresence of metastable states causes ordering time to increase faster than a linear functionof N . Figure (3)b shows that the compressed Gaussian function gives results similar to theGaussian function, except that time of ordering rises more rapidly as � approaches zero.For the case of the concave exponential (6) and compressed concave exponential (8)neighborhood functions, however, the presence of metastable states for all values of � gives11

rise to much longer ordering times; so much longer, in fact, that it was infeasible to constructgraphs similar to Figs. 3a,b. Ordering times for independent simulations with identical pa-rameters spanned several orders of magnitude. Note that although the compressed Gaussianand compressed concave exponential functions di�er only slightly (in some function-spacemetric,) they give rise to ordering times which di�er by several orders of magnitude. Hencethe shape of the neighborhood function, e.g. its concavity or convexity, crucially in uencesthe learning dynamics. In the following sections we will discuss the di�erent convergencephenomena in the regimes where metastable states are, or are not present.4.2 Contraction of WeightsFor convex functions, e.g. Gaussian functions with large �, metastable states do not exist;the increase in ordering time as � increases is completely due to the increase in the amountof time spent in the initial (random) con�guration.For � ! 1, the change in weight di�erences, �wrs = (wr � ws)t+1 � (wr � ws)t, periteration approaches zero as 1=�2. In the initial random con�guration the weights cover theunit interval completely, weight di�erences are much larger than 1=�2 and no rearrangementof weights can take place. Therefore map formation must proceed in two steps: �rst therange of weight values must shrink until �wrs = O(1=�2) while maintaining the initialcon�guration, and then the weights must rearrange to form an ordered mapping.Figure 4 shows the average ordering time t, the time tc spent in the initial con�guration,and the rearrangement time, t� tc, as a function of � for the Gaussian neighborhood func-tion (5). The increase in ordering time above �min is completely due to an increase in tcwhich empirically �ts well with the logarithmic functiontc � (1=�) ln(l=lc[h(r; s)]) (18)where l denotes the length of the interval spanned by the initial weight values and lc denotesthe length of the interval covered by the weights (critical length) after which the �rst rear-rangements occur (see Figure 5.) The distance lc is a functional of h(r; s). Its dependenceon � determines the shape of the ordering-time curves for large �.Based on the mistaken assumption that for a Gaussian neighborhood function the number12

of \kinks" in the map, N(t), could not increase in any time step, Geszti et al. (1990) inferredthat the rate-limiting step in the ordering process was the elimination of the last \kink" inthe map. They modelled this process as a random walk of the location of this kink in thechain of neurons, but failed to correctly predict ordering time as a function of �. We cannow see why this approach fails. For large �, the rate-limiting step is the initial contractionof the range of the weights; after this step the ordering time is independent of �. For small �,the e�ect of metastable states on the ordering time must be considered.4.3 E�ect of Metastable StatesFor the concave exponential functions (6) and (8), and for the Gaussian functions (5) and (7)with small �, the long ordering times combined with a large standard deviation may beexplained by the presence of metastable states. In some simulations the map will get \stuck"in metastable con�gurations for a large number of time steps, whereas in some simulationsthe algorithm will, by chance, avoid the metastable states.For the compressed neighborhood functions (7) and (8), Theorem 1 may be used topredict when metastable stationary states should be present. Their e�ect on the orderingprocess may be observed in Figure 6, which shows the percentage of maps in a particularcon�guration as a function of time for the compressed neighborhood functions (7) and (8).In each plot the heavy curve represents the percentage of maps, out of an ensemble of 10,000initially random maps, which have reached one of the two ordered con�gurations, class Ain Table 1, at a given time. The thin solid curves represent the percentage of maps in thecon�gurations of classes B{G de�ned in Table 1, and the dashed curve which starts near 100percent represents the total percentage of maps in all other con�gurations.Figures 6a,b show the population diagram for a compressed Gaussian neighborhood func-tion. With large � (� = 1000) the only stationary states of the mapping algorithm should bethe ordered con�gurations, and indeed we see in Figure 6b that the mapping remains in theinitial random con�gurations (dashed curve) until t � 190 while undergoing the contractionof weights discussed in the previous section. After t � 190, a rapid rearrangement of theweights takes place until t � 220 when all maps in the ensemble have reached the orderedcon�gurations (heavy curve.) At no time do the populations of any of the non-ordered13

con�gurations become signi�cant.The situation is quite di�erent for the compressed Gaussian (7) with small � (� = 3:5.)Figure 6a shows that for this small value of �, the period of contraction of weights is notnecessary, and rearrangement starts almost immediately. After a very short time most ofthe maps are trapped in one of the metastable con�gurations | which for this value of �are of the form of classes B, C, and E-L of Table 1.Figures 6c,d show similar diagrams for the compressed concave exponential (8) with� = 3:5, and � = 1000, respectively. For both of these neighborhood functions metastablestates exist in con�gurations B-D, and F, and in a few other con�gurations which are notplotted (see Table 1.) As expected we do see a large percentage of the maps trapped in thesemetastable con�gurations, particularly con�gurations of the type B and D, for intermediatetimes. For both of these neighborhood functions, the percentage of maps in the orderedcon�gurations approaches 100 percent as a logarithmic function of time, but for each theaverage ordering time is far greater than for the corresponding convex Gaussians in Fig. 6a,b.Plots of the compressed Gaussian (7) and compressed concave exponential (8) functionsagainst distance for � = 1000 appear quite similar at short distances from the winner neuron.The concave exponential may at �rst appear to be an even better choice for the neighborhoodfunction since it does decrease more rapidly at small distances from the winner neuron.However, a 10-point one-dimensional array of neurons can self-organize into a map of theunit interval in fewer than 140 iterations with this Gaussian neighborhood function, whilethe ordering time with the concave exponential can be many orders of magnitude longer.Although Theorem 1 and its corollaries hold only for the compressed functions, or forthe Gaussian with large �, we have empirically observed that Corollaries 1 and 2 correctlypredict the existence or non-existence of metastable states for all monotonically decreasingneighborhood functions. Metastable states appear to exist for any concave neighborhoodfunction, and not to exist for convex neighborhood functions. Figures 7a and 7b showpopulation diagrams for the (non-compressed) Gaussian (5) function with � = 2 and 1000;respectively, in the same format as Fig. 6. (Further examples can be seen in Erwin et al.1991). By using a convex function, such as a broad Gaussian, we can optimize ordering timeand avoid having the algorithm get \stuck" in a metastable state. Since the �rst metastable14

state to appear as � is lowered has the weights wN and wN�1 in reverse order in the inputspace, i. e. con�guration B in Table 1, we may �nd the threshold value of sigma above whichno metastable state can exist by �nding the lowest value of � for which the neighborhoodfunction between the three neurons w1, wN�1 and wN is convex (4). This threshold valuecan be found from the relationh(N;N) + h(1; N) = h(1; N � 1) + h(N � 1; N � 2) (19)1 +H(N � 1) = H(N � 2) +H(1) (20)by numerical methods.Further evidence that metastable states arise for concave functions even when the expan-sion in (15) does not hold can be seen in Fig. 8. In this �gure the standard deviation of theordering time is plotted against � for a 100-neuron Kohonen chain. For intermediate valuesof �, the standard deviation of the ordering time is inversely proportional to �; however forsmall values of �, the ordering time, and its standard deviation both increase rapidly. Thevalue of � which marks the transition between these two behaviors appears to correspondto � � 43:214 (shown as a dotted line in Fig 8) below which value the neighborhood func-tion becomes concave. The same behavior is observed in maps with a di�erent number ofneurons.The earliest form of the SOFM algorithm employed a step function (3) for the neighbor-hood function (Kohonen 1982a,b, 1988). Later it was discovered that gradually decreasingfunctions, such as Gaussians, give rise to faster convergence of the algorithm in practicalapplications (e. g. Ritter 1989; Lo and Bavarian 1991). Since it resembles the shape ofa narrow Gaussian function, we might expect the step neighborhood function to lead tometastable stationary states which result in the longer ordering times. However, Theorem 1cannot be used to either con�rm nor deny the existance of metastable stationary states inthis case. Since the low-order terms in the expansion (15) evaluate to zero for many values ofthe parameter x, the higher-order terms in the expansion cannot be neglected. Evidence forthe existence of metastable states may be obtained by making plots such as those shown inFigures 6 and 7. Such graphs reveal that the ordering process follows a similar time courseto that followed when metastable states are present. After a short period of contraction, the15

maps quickly rearrange so that the majority are in one of a few con�gurations, at least someof which probably contain metastable states. The con�guration which holds the greatestnumber of non-ordered maps as the majority of maps reach the ordered con�guration is thecon�guration E of Table 1.

16

5 Summary and DiscussionThe one-dimensional SOFM algorithm itself is of little practical importance. However, athorough understanding of its properties is a necessary �rst step towards understanding thehigher-dimensional versions of the algorithm which are used in applications. In this paperand its companion (Erwin et al. 1992) we have focused on the simplest form of the SOFMalgorithm. This enabled us to provide exact results, such as a proof of ordering for generalmonotonically decreasing neighborhood functions, and a method of describing the behaviorof the algorithm in terms of a set of energy functions. However, our results also revealhow complicated a full description of the SOFM algorithm must be. We proved that thealgorithm, even in the one-dimensional form, cannot be described as following a gradientdescent on any energy function. The dynamics of the algorithm must instead be describedusing a set of energy functions, which can be very complicated for the multi-dimensionalcase.In this paper we have proven analytically that, for the one-dimensional SOFM algo-rithm, the \shape" of the neighborhood function, in particular its \convexity" determinesthe presence or absence of non-ordered stationary states. Ordering time may vary by or-ders of magnitude depending on the type of neighborhood function chosen. Broad, convexneighborhood functions, such as broad Gaussians, lead to the shortest ordering times. Nar-row Gaussian functions, step functions, and other concave functions all allow non-orderedstationary states which delay the convergence of the algorithm on average. Other, moresubtle properties of the neighborhood function which we have not studied may also turn outto be useful for e�cient multi-dimensional map formation. For example, Geszti (1990) hassuggested that anisotropic neighborhood functions should reduce the ordering time of one-and multi-dimensional feature mappings.In this paper we have shown that the Gaussian functions, which are the most commonlyused neighborhood functions, give best results if the the width of the Gaussian function islarge enough to avoid the presence of metastable states. For the one-dimensional case thevalue of � below which the �rst metastable state appears may be found from the relation (20).An optimal value of � exists, which is slightly higher than this threshold value. It is di�cult17

to give an equation for the optimal value of � for a Gaussian function, since the two com-peting e�ects, due to the presence of metastable states and the contraction of weights, scaledi�erently with the number of neurons. As a quick approximation, the value of � whichgives the optimal ordering time is of the order of the number of neurons. It is probablywise to employ similar guidelines in choosing neighborhood functions in multi-dimensionalapplications.From our discussion of ordering time for the one-dimensional algorithm, we can un-derstand the empirically observed fact that in practical applications with high-dimensionalforms of the algorithm, ordered maps may often be formed more quickly by employing aneighborhood function which begins with a large width which is slowly reduced. Startingwith a broad neighborhood function allows rapid formation of an ordered map, due to theabsence of metastable stationary states. After the ordered map has formed, the width of theneighborhood function may be reduced until the map expands to �ll the input space morefully.AcknowledgementsThis research has been supported by the National Science Foundation (grant number 9017051)and by the National Institute of Health (grant number P41RR05969). Financial support toE. E. by the Beckman Institute and the University of Illinois, and of K. O. by the Boehringer-Ingelheim Fonds is gratefully acknowledged. The authors would like to thank H. Ritter forstimulating discussions.ReferencesErwin E, Obermayer K, Schulten K (1991) Convergence properties of self-organizing maps.In: Kohonen T, et al. (eds) Arti�cial Neural Networks, vol I. North Holland, pp 409{414.Erwin E, Obermayer K, Schulten K (1992) Self-organizing maps: Ordering, convergenceproperties and energy functions. Biol Cybern .18

Geszti T (1990) Physical Models of Neural Networks. World Scienti�c, Singapore.Geszti T, Csabai I, Czak�o F, Szak�acs T, Serneels R, Vattay G (1990) Dynamics of the Koho-nen map. In: Statistical Mechanics of Neural Networks: Proceedings, Sitges, Barcelona,Spain. Springer-Verlag, pp 341{349.Kohonen T (1982a) Analysis of a simple self-organizing process. Biol Cybern 44:135{140.Kohonen T (1982b) Self-organized formation of topologically correct feature maps. BiolCybern 43:59{69.Kohonen T (1988) Self-Organization and Associative Memory. Springer-Verlag, New York.Lo ZP, Bavarian B (1991) On the rate of convergence in topology preserving neural net-works. Biol Cybern 65:55{63.Ritter H, Schulten K (1986) On the stationary state of Kohonen's self-organizing sensorymapping. Biol Cybern 54:99{106.Ritter H, Martinetz T, Schulten K (1989) Topology conserving maps for learningvisuomotor-coordination. Neural Networks 2:159{168.Appendix A: Derivation of the Stationary StateEquationTo derive the stationary state equation (15), we set Vx = 0 using the de�nition of Vx in (13).Inserting h(x; y) � 1� �g(x; y); and (21)ux = 1=2 + ��x + �2�x (22)into (13) and keeping only terms up to order �2, we �nd0 = � [(g(1; x)� g(N; x))=4 � 2�x]+ �2(N�1Xy=2 (�y+1 � �y�1)(�y�1 + 2�y + �y+1 � 4�x)=419

+ [(�1 + �2 � 4�x)(�1 + �2 � g(1; x)) + g(1; x)(�1 + �2)� 4�x]=4+ [g(N; x)�x � (�N�1 + �N )2=4 + �x(�N + �N�1)� �x])+O(�3)= � [(g(1; x)� g(N; x))=4 � 2�x]+ �2[�x(g(1; x) + g(N; x)) � 2�x] +O(�3)So from the �rst-order term we get�x = (g(1; x)� g(N; x))=8; (23)and from the second-order term we get�x = �x(g(1; x) + g(N; x))=2 = (g(1; x)2 � g(N; x)2)=16: (24)Inserting this into (22) gives the result (15). Note that the third order terms (omitted here)cannot be written in such a simple form.Appendix B: Proof That All Stationary States areOrdered for Convex g(r; s)From u1 < u2 < � � � < uN and (15) we can conclude that the following relationship musthold for the values of g(x; y)g(1; 1)� g(N; 1) < g(1; 2)� g(N; 2) < � � � < g(1; N)� g(N;N) (25)Now assume that g(x; y) has the following properties: g(x; y) = g(y; x) = G(jP(x)�P(y)j),and G(jxj) is monotonically increasing from zero with jxj. Also assume thatg(z; x) > g(z; y) + g(y; x); (26)will hold if and only if neuron wP (y) is located between neurons wP (x) and wP (z) in the neuralchain, i.e. if jP(z) � P(x)j > jP(z) �P(y)j; jP(y) �P(x)j: (27)20

Condition (26) may also be written as (1 + h(z; x)) < (h(z; y) + h(y; x)) which togetherwith (27) is our de�nition of convexity (4).From (25) we know that for any x, (g(1; 1)� g(N; 1)) < (g(1; x)� g(N; x)) < (g(1; N)�g(N;N)), but since g(1; 1) = g(N;N) = 0, we may writeg(x;N) < g(N; 1) + g(1; x); and (28)g(1; x) < g(1; N) + g(N; x): (29)From (28) and (26) we know that P(1) is not located between P(N) and P(x), and from (29)and (26) we know that P(N) is not located between P(1) and P(x). Therefore we mayconclude P(x) is located between P(1) and P(N); for all 1 < x < N: (30)From (25) we know that for all x0 > x,g(1; x)� g(N; x) < g(1; x0)� g(N; x0); or ratherg(1; x) + g(N; x0) < g(1; x0) + g(N; x): (31)From (30) we know that both P(x) and P(x0) are located between P(1) and P(N). Nowsuppose that P(x0) were between P(1) and P(x). Then it follows from the assumption (26)that g(1; x) > g(1; x0) + g(x; x0)) g(1; x) > g(1; x0); andg(x0; N) > g(x; x0) + g(x;N)) g(x0; N) > g(x;N); thusg(1; x) + g(x0; N) > g(1; x0) + g(x;N): (32)But this contradicts (31). Therefore our supposition that P(x0) is located between P(1)and P(x) is inconsistent with the assumption that the neighborhood function be convex.We conclude that for a convex neighborhood functionP(x0) is located between P(x) and P(N); 8x < x0 < N: (33)Conditions (30) and (33) together imply that either P(1) < P(2) < � � � < P(N) or P(1) >P(2) > � � � > P(N). Therefore the only stationary states for a convex neighborhood functionare the two states where the weights are ordered in either ascending or descending order.21

Figure CaptionsTable 1. Table of metastable states for a compressed Gaussian neighborhood function (7)(� = 10�8.) The range of values of � over which a metastable state exists in each givencon�guration was calculated from (15). Since this equation is no longer valid for small �,we only indicate states which are metastable for values of � greater than 1.2. Con�gurationsmarked by a star (*) also contain metastable states for the concave exponential neighborhoodfunction (8).Figure 1. Examples of the neighborhood functions used: Gaussian (open symbols),concave exponential (�lled symbols); � = 6 in both cases. The Gaussian function here isconvex (4) over the interval �9 < x < 9.Figure 2. Schematic diagram demonstrating how the number of metastable states in-creases for a 10-neuron chain as the width � of the Gaussian neighborhood function is de-creased. The horizontal axis is a somewhat abstract \con�guration" axis. Only half of thecon�guration space is shown | each vertical line thus represents two con�gurations in oneof the classes of metastable states listed in Table 1. The central line represents the classof ordered con�gurations. Class B of metastable states branches o� from the central lineat � � 5:02; classes C and D becomes metastable at � � 4:53.Figure 3. Ordering time as a function of the width � of the neighborhood function,for a Gaussian (a) and a \compressed" Gaussian (b) function (� = 0:1). Ordering time isaveraged over 1000 independent simulations for each data point. Error bars represent onestandard deviation. Metastable states exist below � = 5:0254 (dotted line.)Figure 4. Ordering time (�lled squares), time spent in the initial con�guration (opensquares), and rearrangement time (open circles) as a function of � for a ten-neuron Kohonenchain with a Gaussian neighborhood function.Figure 5. The number of time steps spent in the initial con�guration tc is plottedagainst the length of the unit interval spanned by the weight values in the initial map,l = max(jwi � wj j), for a 10-neuron Kohonen chain and a Gaussian neighborhood function22

at several values of �. The curves can be �t to the function tc � (1=�) ln(l=lc[h(r; s)]), forlarge l, where lc is an empirical constant which is a functional of h(r; s).Figure 6. Percentage of ordered maps and maps in disordered con�gurations, out of anensemble of 10,000 independent simulations, is shown as a function of time for a \compressed"Gaussian neighborhood function (� = 0:1; � = 10�4) with (a) � = 3:5, (b) � = 1000, andfor a concave exponential neighborhood function with (c) � = 3:5, and (d) � = 1000. Thecurves for the disordered con�gurations in classes B,C,D,E,F, and G from Table 1 are denotedby the symbols above. The dashed curve represents the total percentage of maps in all othercon�gurations. Plot symbols are ommitted on some curves to avoid clutter.Figure 7. Percentage of ordered maps and maps in disordered con�gurations as a func-tion of time for 10; 000 independent simulations for a Gaussian neighborhood function with(a) � = 2 and (b) � = 1000. Symbols as in Figure (6).Figure 8. The standard deviation of the ordering time plotted against � for a 100-neuron Kohonen chain. For large � the standard deviation of the ordering time approaches aconstant value. At intermediate values, the standard deviation of the ordering time is inverselyproportional to �. For small values of � the ordering time and its standard deviation increaserapidly. The dotted line shows the calculated value of � � 43:214, below which metastablestates should exist.

23

Figure 1:1.00

0.80

0.60

0.40

0.20

0.000 5 10-5x

h(|x

|)

-10

24

Table 1Class Range of Number of Prototype� Members P(1) P(2) P(3) P(4) P(5) P(6) P(7) P(8) P(9) P(10)A* 1{1.2000 2 1 2 3 4 5 6 7 8 9 10B* 5.0245{1.2000 4 1 2 3 4 5 6 7 8 10 9C* 4.5289{1.2000 2 2 1 3 4 5 6 7 8 10 9D* 4.5289{4.3733 4 1 2 3 4 5 6 7 10 9 8E 4.3733{1.2000 4 1 2 3 4 5 6 10 7 9 8F* 4.0258{3.8473 4 2 1 3 4 5 6 7 10 9 8G 4.0258{3.8473 4 1 2 3 4 5 10 6 9 8 7H 3.8473{1.2000 4 2 1 3 4 5 6 10 7 9 8I 3.8473{3.3954 4 1 2 3 4 5 10 9 6 8 7J* 3.5132{3.3024 2 3 2 1 4 5 6 7 10 9 8K 3.5132{3.3024 4 2 1 3 4 5 10 6 9 8 7L 3.5132{3.3954 4 1 2 3 4 10 9 5 8 7 6M 3.3954{1.2000 4 1 2 3 4 10 5 9 6 8 7N 3.3954{3.3024 4 1 2 3 10 4 9 5 8 7 6O 3.3024{1.2000 2 3 2 4 1 5 6 10 7 9 8P 3.3024{2.6862 4 2 1 3 4 5 10 9 6 8 7Q 3.3024{2.6862 4 1 2 3 10 4 9 8 5 7 6R 2.9889{2.7269 4 3 2 1 4 5 10 6 9 8 7S 2.9889{2.7269 4 2 1 3 4 10 9 5 8 7 6T 2.9889{2.7269 4 1 2 3 10 9 8 4 7 6 5U 2.7269{1.2000 4 3 2 4 1 5 10 9 6 8 7V 2.7269{1.2000 4 2 1 3 4 10 9 8 5 7 6W 2.7269{1.2000 4 1 2 3 10 9 8 7 4 6 5X 2.6862{1.2000 4 2 1 3 4 10 5 9 6 8 7Y 2.6862{1.2000 4 1 2 3 10 9 4 8 5 7 6Z 2.4489{2.0864 2 4 3 2 5 1 10 6 9 8 7AA 2.4489{2.0864 4 3 2 1 4 10 9 5 8 7 6AB 2.4489{2.0864 4 2 1 3 10 9 8 4 7 6 5AC 2.4489{2.0864 4 1 2 10 9 8 7 3 6 5 4AD 2.0864{1.2000 2 4 3 5 2 1 10 9 6 8 7AE 2.0864{1.2000 4 3 2 4 1 10 9 8 5 7 6AF 2.0864{1.2000 4 2 1 3 10 9 8 7 4 6 5AG 2.0864{1.2000 4 1 2 10 9 8 7 6 3 5 4AH* 1.8857{1.2000 4 4 3 2 1 5 10 9 8 7 6AI* 1.8857{1.2000 4 3 2 1 4 10 9 8 7 6 5AJ* 1.8857{1.2000 4 2 1 3 10 9 8 7 6 5 4AK 1.8857{1.2000 4 1 2 10 9 8 7 6 5 4 3AL* 1.2810{1.2000 2 5 4 3 2 1 10 9 8 7 6AM* 1.2810{1.2000 4 4 3 2 1 10 9 8 7 6 5AN* 1.2810{1.2000 4 3 2 1 10 9 8 7 6 5 4AO* 1.2810{1.2000 4 2 1 10 9 8 7 6 5 4 3AP* 1.2810{1.2000 4 1 10 9 8 7 6 5 4 3 2Table 1:25

Figure2:Configuration

Width of the Neighborhood Function

5432

26

Figure 3:


Scal

ed T

ime

50

40

30

20

10

0

10000

1000

100

10

11 10 100 1000 100001 10 100 1000 10000

(a) (b)

27

Figure 4:20

10

01 10 100 1000 10000

Neighborhood Function Width

Scal

ed T

ime

28

Figure 5:T

ime

in I

niti

al C

onfi

gura

tion 1500

1000

500

00.001 0.01 0.1 110 -4

Initial Map Width

σσσσ

= 10,000= 1,000= 500= 100

29

Figure 6:

0 200 400 600 800 1000 0 100 200 300

0 400 800 1200 1600 2000 0 400 800 1200 1600 2000

100%

80%

60%

40%

20%

0%

100%

80%

60%

40%

20%

0%Pop

ulat

ions

of

Sele

cted

Con

figu

rati

ons

Number of Time StepsOrdered Class B Class C Class D Class E Class F Class G Other

(a)

(c)

(b)

(d)

30

Figure 7:

0 40 80 120 160 200

100%

80%

60%

40%

20%

0%

(b)

Number of Time Steps

Pop

ulat

ions

0 200 400 600 800

(a)

31

Figure 8:


10 100 1000 10000

10000

1000

100

10

1

Stan

dard

Dev

iati

on o

fO

rder

ing

Tim

e

32

erwin. e. obermayer k._schulten. k. _1992: self-organising maps_stationary states, metastability and...

Design

original sofm algorithm

onedimensional sofm

weight vectors

weight changes

feature maps

neighboring neurons

absence of metastable

input space