computational explorations in cognitive neurosciencebressler/edu/compneuro/notes/chapter_4.pdf ·...

Computational Explorations in Cognitive Neuroscience Chapter 4: Hebbian Model Learning

4.1 Overview Learning is a general phenomenon that allows a complex system to replicate some of the structure in its environment. Structure refers to any regularity or consistent pattern in the environment. Donald Hebb was instrumental in suggesting a way that learning could take place in the nervous system by changing synaptic weights. For Hebb, joint activation of units in the Nerve Cell Assembly strengthens their connection. We now know that there are synaptic processes that can strengthen (and weaken) synapses by following a (modified) Hebbian rule, i.e. they require joint pre- and post-synaptic depolarization to take effect.

Hebbian learning is a form of model learning: the result of learning is the construction of a model in the system that captures some of the structure of the environment. Because it does not require explicit feedback from the environment, it is called self-organizing.

4.3 Computational Objectives of Learning The problem of constructing an internal model of the external world is a difficult one. It is inherently under-determined, or ill-posed, i.e., there is not enough information available to do a complete job. In other words, our sensory systems have an impoverished view of the outside world. They have to figure out what is going on based only on small glimpses. At the same time, another problem is that there is an overwhelming quantity of potential information available to the senses. The model construction job is made even more difficult by having to sift through all the activity going on in the sensory sheets to find information that is relevant.

In summary: “our senses deliver a large quantity of low-quality information that must be highly processed to produce the apparently transparent access to the world that we experience.” The strategy taken by the nervous system (and the one that we must follow in designing neural networks) is to use biases to organize and select the incoming sensory data. One useful bias is parsimony – choosing the simplest possible explanation from all the possible options.

Consider the case of vision. The visual system must construct models of the visual world based on a series of limited “snapshots” – two-dimensional projections on the retina from a high-dimensional space. That is, many dimensions of variation are collapsed and intertwined in the photoreceptor activity. The problem of recovering all these dimensions is under-determined because the mapping of the environment to the activity patterns is many-to-one. That is, many different environmental arrangements could be responsible for any given pattern of retinal activity. Therefore, many different possible internal models could fit equally well with the activity. Another way of expressing this idea: the interpretation of the visual world by the visual system is not sufficiently constrained by the visual input. Because many different dimensions may be collapsed in the projection onto the retina, it is difficult for the system to determine the “real” external causes in visual perception.

One process that helps overcome this problem is to use the constraint of temporal consistency. This constraint can be implemented by temporal integration – averaging over a sequence of individual snapshots. Temporal integration is important for the implementation of learning in neural network models. The process of slowly adding up small weight changes results in the weights that represent the aggregate statistics of a large sample of experiences. If there are stable patterns in the input space, these will prevail in the final weights through this averaging process. The result is that we can train a network to represent stable sources of input over a wide range of experiences with the world.

Averaging, however, is not enough. Averaging all the snapshots of the world that we are exposed to would result in a uniformly gray image. There needs to be some filtering, i.e. some selectivity as to what is taken in. This is where biases, or prior expectations, are critical. Biases allow the system to determine what kinds of input patterns are more informative than others, and how to organize the transformations of those input patterns (remember the clustering diagrams).

For biases to be useful, they must provide a fit to the properties of the real world. How can this be done? One type of bias that is NOT very useful is the implementation of specific knowledge. This amounts to hard-wiring a system with connection patterns that represent detailed aspects of the environment. This only works if the environment is guaranteed to contain those aspects. It hinders the system’s ability to flexibly adjust to different environments. Also, it is not neurobiologically realistic. On the other hand, two types of bias that are both useful and biologically plausible are:

a) architectural: built-in connectivity preferences; e.g. area a is connected preferentially to area b, and not to area c.

b) parametric: built-in differences in values such as quantity, ratio, or rate; e.g. area a has a greater proportion of inhibitory cells or a faster learning rate than area b.

There must be a balance between too much and too little use of biases in learning. The bias-variance dilemma: an over-reliance on biases results in not enough learning about the environment, whereas an over-reliance on experiences results in learning that is idiosyncratic and variable. Both extremes result in the construction of poor models: the first because the model may be too rigid by not accounting for important dimensions of variability in the world; the second because the model may be too prone to domination by specific instances of input and fail to account for important consistent features in the world. So it is important that biases and experiences be balanced. However, there is no general procedure that will guarantee this balance.

4.3.1 Simple Exploration of Correlational Model Learning One powerful way for a model to find consistency in the patterns it receives from the environment is to detect correlations. In visual inputs, for example, correlations are indicative of stable features in the visual world. The simulation in this section is a very simple example to show that the weights of a network (using a Hebbian learning mechanism) can be shaped to reflect the correlation structure of the environment. The weights are initially uniform, but they change to conform to a consistent feature of the input space, i.e. a single line.

4.4 Principal Components Analysis Understanding how Hebbian learning causes units in a network to represent correlations in the environment can be aided by the concept of principal components analysis (PCA). PCA is a technique that rotates a variable space in order to maximize the variance along one (principal) axis. It is an iterative process that orders the major directions of variability in the variable space according to the amount of variability that they account for. The first principal component is the one that accounts for the greatest amount of variability in the variable space. We will see that a simple Hebbian learning mechanism operates in such a way as to represent the first principal component of the variability of input space.

4.4.1 Simple Hebbian PCA in One Linear Unit Consider a receiving unit (Figs 4.5, 4.6) with the activation function:

kjj kk

y x w= ∑ (4.1) Keep in mind that each variable is a function of time, although that is not written explicitly. At each time step, the values of these variables may change. Also, a different pattern of activity may be presented over the input units at each time step. Specifically, the Hebbian learning rule is implemented by updating the weights into the receiving unit at each time step.

For the rule to be Hebbian, the weight change (dwt in the simulator) depends on both pre- and post-synaptic units’ activity as:

w x y jt ij iεΔ = (4.2) The symbol ε is called the learning rate parameter (lrate in the simulator). When ε is large, it means that the weights undergo big changes at each step, when it is small, they undergo small changes. Explicitly, the weight change enters into the weight update equation as: ( ) ( )ij t ijw+ Δ1ijw t w t+ = (4.3) Now, to observe the overall effect of learning, we need to determine how the weights change over a whole sequence of input patterns. (Remember that a different input pattern is considered to be presented at each time step.)

The weight change over all input pattern presentations is:

ijwΔ = i jt

x yε∑ (4.4) Notice that if we set ε to equal 1/N (where N is the total number of input patterns presented), then the right-hand side of this equation is just the temporal mean (expected value). Thus:

ij i j tyw xΔ = (4.5)

Now, we can substitute the expression for y from (4.1):

i j i k k jk

i k k jt t tk k

w x x w

x x w

Δ =

= =

∑

∑t

i k k jC w∑ (4.6)

where Cik is an element of the correlation matrix, representing the correlation between two input units i and k. The correlation between units i and k is defined as the expected value of the product of their activity values over time. [Refer to Fig 4.6]

The last equation says: the overall change in the weight from input unit i to receiving unit j is a weighted average, over all the different input units (indexed by k), of the correlations between these other input units and the particular input unit i. Each correlation is weighted by the average weight from the other input unit to the receiving unit. Training a network with this learning rule results in a set of weights that are dominated by the pattern of correlation that best accounts for the variability of the input space, i.e. the network learns to detect the most prevalent “feature” of the input space. This (strongest) pattern of correlation can be thought of as the first principal component of the input data.

4.4.2 Oja’s Normalized Hebbian PCA A general problem with the simple Hebbian learning rule is that there is no bound on the weights as learning continues. It is neither computationally feasible nor biologically realistic to let the weights grow without bounds. A variety of different methods have been proposed for restraining weight growth. An early version was that of Oja (1982):

( )2i j j ijy y w−ijw xεΔ = (4.9)

This modified Hebbian learning rule subtracts away a portion of the weight value to keep it from growing infinitely large. At equilibrium (when the weight value is no longer changing), the weight from a given input unit represents the fraction of that input’s activation relative to the total weighted activation over all the other inputs.

( )2j i ji

k

x y y w

x xw

y x

ε= −

= =∑

0 i j

ii j

j k k jw (4.10)

4.5 Conditional Principal Components Analysis How can this simple form of learning be applied to a layer of receiving units, instead of a single one? One possibility would be to have a different receiving unit for each principal component in the input space. Called sequential principal components analysis (SPCA), this approach would train the first receiving unit on the first principal component, the second on the second, etc. Each successive principal component would represent the direction of greatest variability in the data after the previous ones had been removed.

In theory, SPCA provides a complete representation of the input space. However, it is not practical because it is based on the assumption that the input space has a hierarchical structure. The structure of the real world is more likely to be heterarchical – consisting of separate but equal categories. Interestingly, when models that are constructed to produce heterarchical representations are trained with inputs from natural scenes, they produce weight patterns that resemble receptive field properties of some neurons in the primary visual cortex, i.e. preferred tuning for bars of light of a particular thickness in a particular orientation. Although the SPCA approach is a general solution, it is not biologically realistic. Also, because of its generality it computes correlations over the entire space of input patterns. This ignores the type of clustering that exists in real world inputs, where meaningful correlations only exist in particular sub-regions of input space, not over the entire space.

For example, of all the patterns of light that cross your retinas, only a small subset are relevant for behavior. These patterns are the ones for which the visual pattern recognition system must represent correlations. Thus, realism would seem to require application of a type of conditionality to restrict the PCA computation to only certain input patterns. This argument motivates a version of Hebbian learning called conditional PCA (CPCA). Conditionality is imposed by determining when individual units will participate in learning about different aspects of the input space.

A conditionalizing function is used to specify the conditions under which a given unit should perform its PCA function. This function determines when a receiving unit is active, which is when PCA learning can occur. It would be desirable for the conditionalizing function to be self-organizing, where the units evolve their own conditionalizing function as a result of interactions during learning. To begin understanding CPCA, however, we assume the existence of a conditionalizing function that turns on receiving units for some inputs and not for others. Thus, the receiving unit’s activation serves as a conditionalizing factor.

4.5.1 The CPCA Learning Rule The design objective of CPCA is to have each weight represent the probability that the input unit is active conditional on the receiving unit also being active. Thus:

( )i jx yijw P= (4.11) In CPCA, we assume that the receiving unit represents a subset of input patterns in the environment. (4.11) tells us the learning objective: we want the weights to reflect the probability that a given input unit is active across the subset of input patterns represented by the receiving unit, i.e. for which it is active.

A conditional probability of 0.5 is equivalent to zero correlation, i.e. the input unit is equally likely to be on and off when the receiving unit is active. When the conditional probability is greater than 0.5, it means that a positive correlation exists between the input unit being on and the receiving unit being on. When the conditional probability is less than 0.5, it means that a negative correlation exists – the input unit is more likely to be off when the receiving unit is on.

Note that the activation of a receiving unit in CPCA depends on more than one input unit. This makes the weight for any given input unit dependent on its correlation with the other input units. This is the most basic property desired from a Hebbian learning mechanism. The following weight update rule has been shown (Rumelhart & Zipser 1986) to achieve CPCA:

( )j i ijy x wε −t ij j i j ijw y x y wε ⎡ ⎤Δ = − =⎣ ⎦ The corresponding effect of weight changes over all inputs is:

( )j i ijx w−ijt

w yεΔ = ∑ (4.12) This rule has the effect of normalizing the weights so that they do not become infinitely large as learning progresses.

Note that the weights are adjusted according to 2 factors: a) how different the value of the input unit’s activation level is from the weight

value. b) the activation of the receiving unit.

The activity of the receiving unit controls the weight adjustment:

a) If the receiving unit is not active, no weight adjustment will occur. b) If the receiving unit is active, the weight adjustment depends on its level of

activity (reflecting how much it “cares” about the input). In the second case, the rule tries to set the weight to match the input unit activation. That is, if the weight is equal to the input activation, no change will occur, and the farther away from the input activation level is the weight, the greater the weight change will be. Thus, when the receiving unit “cares” about the input, it tries to match the weights from its input units to the activation levels of those units.

4.5.2 Derivation of CPCA Learning Rule At this point, we want to show that the weight update rule of (4.12) achieves the conditional probability objective of (4.11). To do this, we assume that the activation represents the hypothesis about the input and the input pattern represents the data. This assumption allows us to replace each activation in (4.12) by the joint probability of its activation and the occurrence of a given input pattern. The joint probability is expressed in terms of the conditional probability as described in Chapter 2. The joint probability of the hypothesis (h) and the data (d) is given by P(h,d). It is equivalent to the intersection of the hypothesis and data. Conditional probability:

( )( ) ( )

,P h dP d

=P h d (2.23)

Interpretation: when we receive some particular input data, this equation gives the probability that the hypothesis is true. This gives us an expression for the joint probability in terms of the conditional probability:

( ) ( ) ( )h d P d,P h d P= (2.23b) Following (2.23b), for activations xi and yj, and input pattern t, the joint probabilities are:

( ) ( ) ( )( ) ( ) ( )

y t P t

x t P t

,

,

j j

i i

P y t P

P x t P

=

=

Substituting joint probabilities for activations, the weight update rule (4.12) becomes:

( ) ( ) ( ) ( )j ijij j it

w P y t P x t P y t w P tε ⎡ ⎤Δ = −⎣ ⎦∑ (4.13) Now, in order to observe the final outcome of updating weights based on this rule, consider the asymptotic equilibrium state, where the weights no longer change with repeated exposure to input patterns.

At equilibrium, the network has already learned the structure of the input space, and no further weight changes are necessary. Thus:

( ) ( ) ( ) ( ) ( )jP y t P t⎡ ⎤⎣∑ ∑0ij j i ijt t

w P y t P x t P t wε ε⎡ ⎤Δ = = −⎣ ⎦ ⎦ (4.14a)

Rearranging, we get an expression for the weight value at equilibrium:

( ) ( ) ( )( ) ( )

j it

ij

t

P y t Pw

P y=∑

∑ j

x t P t

t P t (4.14b)

The numerator of this equation is the definition of the joint probability of the input (x) and receiving (y) units both being active together across all the inputs. This is P(yj,xi). Likewise, the denominator is the probability of the receiving unit being active across the whole input space. This is P(yj).

Now, from the joint probability definition, the final weight is equal to the probability of x conditional on y:

( )( ) ( )i jP x y=

,j iij

j

P y xw

P y= (4.15)

This means that by using this weight update rule in (4.12), learning of the input space will result in each final weight representing the probability that the input unit is active conditional on the receiving unit also being active. This shows that this update rule achieves the desired CPCA design objective (4.11).

4.5.3 Biological Implementation of CPCA Hebbian Learning The CPCA learning rule (4.12) has some basic features that are similar to what would be expected from NMDA-mediated LTP/LTD. Thus:

a) when the input and receiving units are both strongly active, the weight value (synaptic strength) tends to increase. This is similar to what occurs in LTP.

b) when the receiving unit is active but the input unit is not, the weight value decreases. This is similar to what occurs in LTD, assuming that the NMDA channels are open and a small amount of calcium influx occurs.

c) when the receiving unit is not active, no weight change occurs. This is similar to the effect of blocking of the NMDA channels by magnesium ions.

Note also that the weights saturate at both extremes: a) When the weight is large (near 1), further increases in the weight are suppressed because it becomes less likely that the input activation level will exceed the weight value and the increase will be smaller when that value is exceeded. Further decreases will become more likely because the input activation level will be more likely to be less than the weight value. b) When the weight is small (near 0), further increases in the weight become more likely and larger; likewise further decreases will become less likely and smaller. This pattern is also consistent with experimental studies of LTP/LTD. This type of saturation, where the magnitude of the weight change occurs exponentially as the bounds are approached, is called soft weight bounding.

4.6 Exploration of Hebbian Model Learning We see in this simulation how a single receiving unit can learn different patterns of correlation across the input units depending on their different probabilities of being co-active with the receiving unit. The receiving unit will always be active for present purposes, so the conditional probabilities of the different correlation structures (i.e. right- and left-slanting lines) will only depend on their relative frequencies of occurrence. For example, with p_right set to 0.7 and p_left to 0.3, the weights to the receiving unit from input units that only send the right-slanted pattern will go to 0.7 and those from input units that only send the left-slanted pattern will go to 0.3. (The weight of the one common unit goes to 1 because it is always activated.) The parameter lrate in Leabra corresponds to the learning rate parameter in (4.12).

Question 4.1: (a) Changing lrate from 0.005 to 0.1 causes the weights to fluctuate with much greater variance. (b) The weight updates are too large, and so they overcompensate for small changes in the probability structure as the weight structure evolves. (c) If the learning rate is too high, the system runs the risk of learning erroneous relations. This risk is greatest when the number of learning events is small. The problem may be mitigated by integrating over a larger number of events with a lower learning rate. In the unconditional PCA, each receiving unit is exposed to the structure of the entire input space. The relative probabilities of occurrence are disregarded. As discussed, this will tend to blur out distinctions between different correlation patterns (features) in the environment. In the next exercise, we explore the unconditional PCA. The unconditional property is simulated by removing the distinction between the probabilities of the two input patterns (i.e. setting them both to 0.5).

Question 4.2: (a) Setting p_right to 0.5 causes the weights all to go to 0.5 (except the common one which goes to 1 as before). (b) This weight pattern suggests the existence of a single “x” feature existing in the environment, not two separate diagonal line features. There is no way to distinguish line features since the weights for this “x” pattern are all the same. (c) This is similar to the “blob” solution for natural scene images because the different input patterns (right- and left-slanting lines) are blurred together into a common pattern.

Question 4.3: (a) Setting p_right to 1 simulates the situation in which the hidden unit is only active when there is a right-leaning line in the input, never when there is a left-leaning one. (b) The weights follow the same pattern – they go to 1 for the right-leaning input units and zero for the others. (c) This case might be more informative because the unit acts as a feature detector, i.e. it is conditionalized to be active for only one type of input pattern. (d) The architecture and training of the network could be extended by adding an additional feature detector for left-leaning lines in the input, and having each feature detector activated only when its preferred type of input was present. This arrangement would lead to each receiving unit (feature detector) having an input weight pattern corresponding to the feature that it was designed to detect. A “readout” unit in a higher layer could then determine which environmental feature was present by which feature detector was activated.

Next consider what happens when each “feature category” is represented by three subtypes rather than one. Set env_type to THREE_LINES. With p_right set to 0.7, the units that exclusively carry right-slanting inputs converge to 0.2333 (.7/3) and those carrying left-slanting inputs converge to 0.1 (.3/3), reflecting the relative probabilities of the different subtypes as before.

Note, however, that the weights have been diluted as compared to the previous example. Because the difference in weights between the two categories is so much smaller, the ability to distinguish between the categories is much more susceptible to noise fluctuations.

This may be a particularly serious problem during the early phase of learning when the units are generally not very selective (i.e. they are more vulnerable to fluctuations). It would be desirable for the learning algorithm to do a better job at emphasizing the categorical differences between receiving units. It would also be desirable to have the weights have a dynamic range that was consistent with their full range of possible excursion, i.e. from 0 to 1, rather than being squeezed into a small range (in this case from .1 to .2333).

4.7 Renormalization and Contrast Enhancement There are two basic problems that can occur with the CPCA algorithm as it has been developed so far:

1) insufficient dynamic range 2) insufficient selectivity

We will now look at correction factors to remedy these problems:

1) renormalization corrects the weights by accounting for the expected activity over the input layer: if this is sparse, as is typical, renormalization will work to boost the weights

2) contrast enhancement increases the separation between strong and weak weights by use of a sigmoid nonlinear function

These corrections are necessitated by computational concerns, i.e. they help the algorithm perform more efficient learning. As such, they represent quantitative adjustments that do not affect the basic qualitative nature of the learning rule.

4.7.1 Renormalization Remember our intuitive argument that a conditional probability (of x given y) of 0.5 should correspond to a situation in which the input and receiving units behave in an uncorrelated manner. Note that this argument depends on the assumption that the input unit has a 0.5 probability of being active, i.e. it is equally likely to be on or off. If we consider that the input patterns are typically sparse, i.e. have low activity levels, then this assumption is violated because any given input unit will be infrequently active. If α represents the probability that an input unit is active, then the probability of x conditional on y cannot be greater than α (given that x and y are uncorrelated). For example, if on average xi is only active 20% of the time, then P(x) = α = 0.2, and P(x|y) cannot be larger than 0.2 if x and y are uncorrelated.

But this violates the intuition that P(x|y) should be 0.5 if x and y are uncorrelated. Also, it makes the ranges for positive & negative correlation unequal. Renormalization is meant to restore the uncorrelated probability to 0.5, making the ranges for positive & negative correlation the same size.

( )1 1ij j i j ij j i ij jw y x y w y x w yε ε ( ) ( )0i ijx w⎡ ⎤− −⎡ ⎤Δ = − = − +⎣ ⎦ ⎣ ⎦ (4.17)

Expressed in this way, we see that the first term in the brackets acts to bring small weights up since (1-wij) is large and positive when wij is small. The second term in the brackets acts to bring large weights down since (0-wij) is large and negative when wij is large. Note that since (1-wij) goes to zero as wij goes to one, the first term in brackets disappears at wij = 1. We can allow this term to be larger by replacing 1 with a number m, where m > 1. That is:

( )ij j i j ij j i ij jw y x y w y x m w yε ε ( ) ( )1 0i ijx w⎡ ⎤− −⎡ ⎤Δ = − = − +⎣ ⎦ ⎣ ⎦ (4.18) When m > 1, small weights are increased more than when m = 1. If we know α, the probability that the input unit is active, then we can set m = 0.5/α. This will produce larger weight increases when the probability of the input unit being active is smaller than 0.5, and smaller increases when is it larger. At α=0.5, the increases will be unaffected.

In simulations, we can also control the amount of renormalization to use by using parameter qm (called savg_cor in the simulator):

( )5.q5.m mα α−= − (4.20) and then set m = 0.5/αm. As an example, consider what happens if α = .1 qm αm m 0 .5 1 (no renormalization) .5 .3 1.67 (some renormalization) 1 .1 5 (maximum renormalization)

4.7.2 Contrast Enhancement The goal of this correction is to make the weights more reflective of strong correlations in the input patterns, and less reflective of weak correlations. This motivation can be justified as a “parsimony bias”. Contrast enhancement is implemented by a sigmoid nonlinear function.

This function converts the linear weight into an effective weight:

1

1ij

ij

ww

1

ˆ ijw γ−=⎛

+ ⎜ ⎟⎜ ⎟−⎝ ⎠

⎞ (4.21)

The effect of this nonlinear function is to elevate weights above the threshold, and suppress weights below it. The threshold is the midpoint (0.5). The term γ is the weight gain parameter. The nonlinearity collapses to the linear case when γ is 1. As γ increases above 1, the slope of the sigmoid function becomes more steep, giving more contrast between large and small weights.

The offset parameter θ ( wt_off in the simulator) is added in order to change the threshold:

( )

1

1ij

ij

w

w1

ˆ ijw γ

θ

−=⎛⎜

⎞+⎜ ⎟−⎝ ⎠

⎟ (4.23)

When θ > 1, the threshold (midpoint) value is shifted to a greater value (as in Fig 4.12, where it has a value of 1.25). It is important to distinguish the effect of contrast enhancement of the weights from the effect of the gain parameter on the activation values.

The activation gain changes a receiving unit’s sensitivity to its total net input. It can make the unit more or less sensitive to all inputs around its threshold value, based on the total level of activation produced by an input, not the pattern of activation. By contrast, weight contrast enhancement makes the units more or less selective to particular patterns of input. That is, weights get pushed up or down according to whether they are above or below the threshold. Increasing the weight contrast enhancement of a unit makes it a more sensitive filter for detecting input patterns.

4.7.3 Exploration of Renormalization and Contrast Enhancement in CPCA

Renormalization: The input space consists of 5 horizontal lines, each presented with equal probability. Each line co-occurs with the receiving unit with the same probability as the expected activity level over the input layer (.2). This situation represents the case where input & receiving units are uncorrelated because it gives the same level of co-occurrence that would result from activating input units at random, with the overall activity level on the input being 0.2.

We see that the weights all settle in near 0.2.

As we discussed, we would like the weights for the uncorrelated case to go to 0.5 rather than 0.2. This is why we need renormalization. To get full renormalization, we set qm (savg_cor) to 1. This causes the weights to settle close to 0.5.

In some applications, there may be expectations about the ratio of input features/hidden unit. In those cases, the value of qm can be set lower than 1 to prevent this ratio from being larger than expected.

Contrast enhancement sigmoid function: We now train the network to distinguish between left-slanting and right-slanting line categories, where each category has 3 subtypes. With wt_gain = 1 (linear case), there is mild separation between the weights of the two categories (0.58 vs 0.25). When we introduce the sigmoid nonlinearity (wt_gain = 6), the separation increases (0.85 vs 0). That is, only the right lines are represented, and they have strong weights.

wt_gain=1 wt_gain=6

The value of increasing the weight contrast enhancement is that we can train hidden units to represent just one feature type, even when they may be partially selective for other feature types. In conclusion, contrast enhancement and renormalization work together to determine what a unit will tend to detect and what it will ignore. They are essentially correction factors that adjust the CPCA algorithm to compensate for its limitations of dynamic range and selectivity. They can increase the effectiveness of the CPCA algorithm.

Question 4.4: (a) There are 2 different types of weights. The first type is at the central input unit and its 4 horizontal and vertical neighbors. These have high weights because they are at the intersection of right- and left-slanted input patterns. The second type is at the 8 other input units receiving input from right-slanted inputs. These are lower because they only get input from the right-slanted inputs. With [env_type=three_lines; p_right=0.7; lrate=0.005; savg_cor=1; wt_gain=6; wt_off=1] the first type has a value of 1.0, and the second type has a value from 0.85 to 0.90 With [env_type=three_lines; p_right=0.7; lrate=0.005; savg_cor=1; wt_gain=6; wt_off=1.25] the first type is unchanged whereas the second type is reduced to a value from 0.63 to 0.67.

(b) The stronger weights (type 1) stay the same. The weaker weights (type 2) are decreased. Setting wt_off to 1.25 shifts the sigmoid curve to the right. For type 1, the effective weight stays the same because the linear weight remains on the upper saturation part of the sigmoid curve. For type 2, the effective weight is reduced because the linear weight is in the linear part of the sigmoid curve. A linear weight value of 0.58 is transformed to an effective weight value of 0.87 when wt_off is 1.0. When wt_off is 1.25, the same linear weight value is transformed to 0.65. This is contrast enhancement: high weights remain high while lower weights get lower. (c) A wt_off value of around 2.1 causes the non-central units of the right lines to have weights around 0.1 or less. (d) No, weights of 0.1 do not reflect the correlations in any single input pattern, which are all high. This high threshold value suppresses the representation of correlated inputs so that only the highest are represented. (e) This representation might be useful for excluding unwanted input correlations, even when they are fairly strong, in favor of a single desired very strong feature.

Question 4.5: (a) With wt_off set to 1 and savg_cor set to 0.7, the non-central units of the right lines go down as compared to a savg_cor value of 1. (b) They go to the same low value (around .1 or less) that occurred with wt_off equal to 2.1 in the previous question. Why does this happen? In the previous question, we were using a wt_gain value of 6 to achieve contrast enhancement. This magnified the differences in weights around 0.5 (wt_off was set to 1). By now lowering savg_cor to 0.7, the effective activity level (αm) is larger, making m smaller, meaning that we are lessening the degree of renormalization since the smaller weights remain smaller.

4.8 Self-Organizing Model Learning Now that we have learned about CPCA, we are ready (finally!) to consider Hebbian learning in a network having multiple receiving units. The learning is now self-organizing, i.e. the receiving units compete with each under by a kWTA inhibitory function. In terms of CPCA, the conditionalizing function comes from the competition of receiving units – competition imposes conditions on when a unit is active, and is thus allowed to learn. (1) The probability of a receiving unit winning the competition to have its weights strengthened by any given input pattern depends on the strength of the input it receives from that pattern as compared to that of the other receiving units. (2) For a receiving unit to win and have its weights strengthened, it must be more well-tuned to (selective for) that input pattern than the other receiving units.

(3) To become more well-tuned, however, the receiving unit must win the competition, i.e. according to CPCA it must be active for its weights to be strengthened. (4) Therefore, weights will be further strengthened for the most-tuned receiving units and not for less-tuned ones. This is an example of positive feedback: tuning

strengthening more tuning more strengthening etc. (5) This means that any initial small tunings of the receiving units will tend to be enhanced. (What happens if the initial weight settings are randomized?)

A system with positive feedback always runs the risk of an exaggerated response (as we saw with runaway excitation in the case of bidirectional excitatory connectivity). In self-organizing learning, the risk is that some units may become overloaded in representing input features at the expense of others. This tendency for “hogging” is countered by the growing selectivity of the receiving units. As their tuning increases with learning, they not only become more responsive to certain input features, but they also become less responsive to others. This tends to cause the selectivity to different features to be spread among the receiving units (as long as there is variation across their initial tunings).

4.8.1 Exploration of Self-Organizing Learning The network in self_org.proj has a receiving layer with 20 units. Each receiving unit receives projections from all input units. The receiving layer uses the average-based kWTA inhibition function with k = 2. Initially the weights are random. [Note: to reinitialize the weights, select the View/PROCESS_CTRL option from the control panel; then click the NewInit button.]

The input space consists of all 45 pairwise combinations of vertical and horizontal lines in the 5x5 input grid (10 vert-vert, 10 horiz-horiz, 25 vert-horiz). A training session consists of presentation of 30 passes through the 45 events.

With learning, individual receiving units develop selective representations of the line features, while ignoring the random context of the other lines. That is, they become line feature detectors.

Results of 3 different training sessions

After training, you should see that: 1) each feature (vert or horiz line) gets represented by at least 1 receiving unit. 2) some features get represented by more than 1 receiving unit. 3) which unit represents any given feature changes randomly from 1 training

session to the next. 4) for any given training session, some units do not become selective for any

feature. These are “loser” or “dead” units. This reflects the excess capacity that is required to adequately reflect the structure of the input space. Biologically, this excess capacity should be large to allow a reserve pool of neurons for learning new features.

The ability of the network to develop this selectivity is made possible by the interaction of CPCA learning and inhibitory competition:

1) receiving units that initially have larger random weights for an input pattern win the competition.

2) the weights of these units are then tuned to be more selective for this pattern. 3) these units will then be more likely to respond to that pattern in the future, as

well as other similar patterns (i.e. ones that share one of the line features). 4) Small initial differences in the weights for the two lines in the input pattern

will cause receiving units to be more likely to respond to one of the line features in the pattern.

5) The small initial weight differences are enhanced with learning, so that receiving units usually become more selective for just one feature. The separation between a unit’s stronger and weaker correlations is increased by contrast enhancement.

6) Overall, the strongest correlations in the environment (i.e. line features) will tend to become represented by this process. This is a combinatorial distributed representation – each input pattern is represented by a combination of receiving units.

Unique Pattern Statistic Each time Run is clicked, a training session of 30 passes through the 45 events is initiated. TRAIN_GRAPH_LOG shows the unique pattern statistic for each of the 30 passes. This statistic records the number of unique hidden unit activity patterns that were produced as a result of probing the network with all 10 different features (i.e. vert or horiz lines) presented individually. For perfect performance, the unique pattern statistic is 10, meaning that all 10 features were uniquely represented. Each time Batch is clicked, a batch of 8 training sessions is initiated. BATCH_LOG shows 4 summary statistics after the 8 training sessions:

(a) average unique pattern statistic over the 8 sessions (b) maximum unique pattern statistic over the 8 sessions (c) minimum unique pattern statistic over the 8 sessions (d) count of number of times that perfect performance (i.e. a perfect 10)

occurred in the 8 sessions

Parameter Manipulations Weight gain parameter (γ): Question 4.6: (a) Training the network with wt_gain = 6 gives: avg_unq_pats min_unq_pats max_unq_pats cnt_unq_pats 10 10 10 8 with wt_gain = 1 gives: 9 8 10 2 (b) Over the 8 training runs, the minimum number of unique patterns that occurred was 8 out of 10, rather than 10 out of 10. The average number was 9 out of 10, instead of 10 out of 10. The number of training runs on which all 10 features were uniquely represented was 2.

(c) With wt_gain = 1, the sigmoid function is “squashed” and there is weaker separation of large and small weights. This causes some hidden units to represent more than one line feature, and some features not to be uniquely represented. With wt_gain = 6, the sigmoid function is in effect, and there is strong separation of large and small weights. The sigmoid function allows contrast enhancement, whereby the strongest weights are selected and the weaker weights are de-selected. This selectivity is important for self-organizing learning by allowing detection of distinct features.

Weight offset parameter (θ): Question 4.7: (a) wt_off avg_unq_pats min_unq_pats max_unq_pats cnt_unq_pats 1.25 (default) 10 10 10 8 1.0 9.75 9 10 6 .75 8.125 6 10 1 (b) There was a noticeable change in the weight patterns compared to the default case: the number of runs having all unique representations drops from 8 to 6 to 1, indicating that training is producing more units with non-unique representations. (c) wt_off is the offset of the sigmoid function. It acts like a threshold for weight enhancement. As it is lowered, more and more weights in the mid-range are enhanced. This means that weaker weights that were being suppressed now get enhanced. This makes it more likely that hidden units will come to represent multiple features, and less likely that they will represent only one feature (since they are less “exclusive”).

(d) This threshold is important for self-organizing learning because it helps determine how selective hidden units will be. Remember that selectivity is important for establishing a combinatorial distributed representation, one in which the separate features of the input space are uniquely represented by hidden units.

Renormalization parameter (qm): Changing savg_cor from .5 to 1 results in: savg_cor avg_unq_pats min_unq_pats max_unq_pats cnt_unq_pats 1 9 8 10 4 Increasing the amount of renormalization makes the weights increase more rapidly. This causes the units to develop less selective representations of lines. A lower level of correlation is needed to produce strong weights, and units have a greater tendency to represent multiple features.

Initial mean random weight parameter: Now, setting wt_mean to 0.5 sets the initial random weight values to be 0.5 rather than 0.25. We now see: wt_mean avg_unq_pats min_unq_pats max_unq_pats cnt_unq_pats .5 10 10 10 8 The tendency for units to form unique representations is increased so that, most of the time, all units form unique representations. This effectively eliminates “loser” units, i.e. every unit now codes for a line. How can we explain this result? Starting off with larger weight values means that we will tend to get larger decreases than increases. Then hidden units that were active for a given pattern will receive less net input for a similar pattern. This is because the weights will have decreased for those input units that were off initially but are now on.

The result is that units that are initially successful at representing input patterns will not have as much of an advantage over those that were not successful. This gives the latter units a chance to catch up. So, all units end up representing unique features. This tendency can be used to counterbalance the “hogging” tendency, where a few units tend to represent all the features at the expense of the other units. Learning rate parameter (ε): Question 4.8: (a) Apparently, this tenfold increase in the learning rate has NO noticeable effect on the network. lrate avg_unq_pats min_unq_pats max_unq_pats cnt_unq_pats .1 10 10 10 8 (b) This same value of lrate in Question 4.1 produced fluctuation with great variance which interfered with the network’s learning ability. Here, there is no comparable effect because the learning rate effect is compensated by other effects, e.g. kWTA, renormalization and contrast enhancement.

4.8.2 Summary and Discussion Hebbian learning by CPCA + kWTA has been shown to be effective in more complex environments with real-world input spaces. One drawback: the complex interdependencies of the hidden layer units make rigorous mathematical analysis difficult.

4.9 Other Approaches to Model Learning How does CPCA + kWTA compare to other types of Hebbian learning?

a) Does it have the same level of performance? b) Can it accomplish the same functions?

4.9.1 Algorithms That Use CPCA-Style Hebbian Learning There are several different Hebbian learning algorithms that use a similar learning rule, e.g. the Kohonen network (Kohonen 1984). The CPCA + kWTA approach differs primarily in the kWTA activation dynamics, not the learning rule. The production of sparse distributed representations by CPCA + kWTA gives it a combinatorial flexibility that is lacking in the Kohonen network.

4.9.2 Clustering The competitive learning algorithm of Rumelhart & Zipser (1986) produces a localist representation, i.e. only one unit is active at a time. Competitive learning causes each hidden unit to represent a different cluster (natural grouping) of similar patterns in the input space. Strongly correlated input patterns tend to form such clusters. The clustering metaphor makes sense for representation by single units, i.e. localist representation. It makes less sense for kWTA. However, the k active units in kWTA inhibition may be thought of as representing multiple active clusters simultaneously.

4.9.3 Topography The Kohonen network is useful for formation of topographic maps because it is concerned about the neighborhood of activity around the single winner. This causes hidden units to represent similar things as their neighbors. This property may be useful for understanding the formation of topographic maps in the nervous system. A CPCA + kWTA model with lateral excitation can also produce topographic maps (see Chapter 8).

4.9.4 Information Maximization and MDL Another proposed constraint on Hebbian learning is information maximization (Linsker 1988), in which models are trained so as to maximize the amount of information extracted from the input patterns. Information maximization is only one of many constraints that should be considered in Hebbian learning. If this constraint is unchecked, it could lead to over-extraction of information, i.e. the production of unparsimonious representations that capture all of the details of the input space.

It is usually more useful to strike a balance between information maximization and parsimony. CPCA + kWTA accomplishes this balance.

a) By having each receiving unit extract the first principal component of the correlation matrix representing its subset of the input space, CPCA maximizes the information received by that unit because its weights are tuned to the direction of maximum input variation.

b) The kWTA inhibitory competition lowers the overall information capacity of the hidden layer, providing a counterbalance to the information maximization objective. Also, by excluding all other principal components of the correlation matrix, parsimony is enforced. Parsimony is further enforced by the weight contrast enhancement function.

4.9.5 Learning Based Primarily on Hidden Layer Constraints The Bienenstock, Cooper, Munro (BCM) algorithm uses different constraints:

1. each hidden unit must be active for the same percentage of time as every other one.

2. the number of hidden units must equal the number of sources (features) in the environment.

The BCM algorithm works well when the feature categories to be learned are uniformly distributed in the environment & when the numbers of hidden units and feature categories are evenly matched. However, this assumption does not seem very realistic: (it is unlikely that the number of “hidden” units in sensory cortical areas in any way matches the number of sensory feature categories that must be learned; and it does not seem realistic that all feature categories in the sensory environment occur with the same frequency).

Independent components analysis (ICA) is designed to solve the blind source separation problem. It works well when the basic conditions of that problem are satisfied (e.g. the cocktail party situation).

a) Like BCM, ICA requires that the number of hidden units be equal to the number of feature categories (sources) in the environment.

b) ICA also requires that the number of hidden units be equal to the number of input units.

c) ICA learning is based on making hidden units maximally independent of each other, so that what one unit learns is highly dependent on what other units have learned.

CPCA+kWTA attempts to maintain a balance between specialization of individual units and competition between units. The result is that, unlike BCM or ICA, it is much less dependent of the number of hidden units.

4.9.6 Generative Models Another class of models are called generative (Dayan et al 1995; Carpenter & Grossberg 1987) because they are based on the generation of an internal model of the world to accomplish pattern recognition. Representations are learned by an iterative interaction between model synthesis and input analysis. Learning is based on the difference between what is generated and what appears in the input. This is sometimes called recognition by synthesis.

a) Generative models have the advantage of fitting nicely into the Bayesian statistical framework. The generative mode is similar to computing likelihood: it expresses the probability that the internal model (hypothesis) would have produced the input pattern (data).

b) The disadvantage of a generative model is that it establishes a rigid hierarchy: one layer must be considered as an internal model to the one below it & each layer uses a different kind of processing. This may limit its usefulness and biological plausibility as compared to bidirectional constraint satisfaction processing (as in Chapter 3).

4.1 Overview 4.3 Computational Objectives of LearningNotice that if we set ( to equal 1/N (where N is the total number of input patterns presented), then the right-hand side of this equation is just the temporal mean (expected value). Thus: Unique Pattern Statistic Parameter Manipulations

computational explorations in cognitive neurosciencebressler/edu/compneuro/notes/chapter_4.pdf ·...

Documents