feedback networks and hopfield networks - kth

Reducing the Boltzmann machineHopfield NetworksHebbian Learning

Beyond the Boltzmann machine

Feedback Networks and Hopfield Networks

Erik Fransén, Daniel Gillblad

CB, KTH

Erik Fransén, Daniel Gillblad Feedback networks



Outline

1 Reducing the Boltzmann machine

2 Hopfield Networks

3 Hebbian Learning

4 Beyond the Boltzmann machine

Some examples and images from MacKay (2003): Information Theory,Inference, and Learning Algorithms, a good introduction to machine learningand information theory.




Reducing the Boltzmann machine

Problem of the Boltzmann machine: Number of computationsgrows exponentially with problem size.

Reducing the topology: The restricted Boltzmann machine(no connections between hidden units)Replacing the stochastic variables with their mean: TheHopfield network




Replacing stochastic variables

The activation (input to a neuron) from other neurons isstochastic. Replace ai = ∑j wijxj with < ai >The state of a neuron is stochastic. Replace xj with < xj >To do this, mean field theory is used.Simply stated: the average of the function value isapproximated by the function value of the average.+ much faster- only first order (and with some tricks also second order)correlations can be handled




Hopfield Networks

A fully connected feedback network.The weights are constrained to be symmetric.Can be used

As nonlinear associative memories or content-addressablememories.To solve optimization problems.




Discrete Hopfield Networks, Notation

We will denote weights from neuron i to neuron j as wij .The network consists of I fully connected neurons troughsymmetric connections, i.e. wij = wji .There are no self connections, thus wii = 0.Biases wi0 may be included.The activity of a neuron is written as xi .




Discrete Hopfield Networks, Activities

A Hopfield network’s activity rule is for each neuron toupdate its state in accordance with a thresholdingactivation function,

x(a) = Θ(a)≡{

1−1

a≥ 0a < 0

As there is feedback, we need to define an order for theupdates:

Synchronous updates. All neurons compute theiractivationsai = ∑j wijxjthen update their states simultaneously usingxi = Θ(ai )Asyncronous updates. One neuron at a time updates itsactivation and state. The sequence can be fixed or random.

The properties of the network may be sensitive to updatestrategy.




Discrete Hopfield Networks, Convergence

Let us update one neuron at a time according to theupdate rule.The network state will converge within a finite number ofsteps if

wij = wji wii = 0

Define an energy measure,

E =−12 ∑

i ,jwijxixj

The change in energy for each update is

∆Exk→x∗k =−(∑i

wkixix∗k −∑i

wkixixk ) =−(x∗k −xk )∑i

wkixi ≤ 0




Hebbian Learning

Hebb’s postulate of learning is the oldest and most famousof all learning rules (Hebb, 1949):When an axon of cell A is near enough to excite a cell B and repeatedlyor persistently takes part in firing it, some growth process or metabolicchanges take place in one or both cells such that A’s efficiency as oneof the cells firing B, is increased.

We can reformulate this into1 If two neurons on either side of a connection are activated

simultaneously, the strength of the connection is increased.2 If two neurons on either side of a connection are activated

asynchronously, the connection is weakened or eliminated.




Hebbian Connections

Four key properties:1 Time-dependant mechanism. Modification depends on the

time of occurrence.2 Local mechanism. Only uses locally available information.3 Interactive mechanism. A change depends on activity

levels of both sides of the connection.4 Correlational mechanism. Connection change depends on

correlation between activities.




Hebbian Learning and Correlation

Hebbian learning can be described in terms of correlation.Positively correlated activites are increased

dwij

dt∼ Correlation(xi ,xj)

For example,

dwij

dt= ηcov(xi ,xj) = ηE [(xi − x̄i)(xj − x̄j)]

If two stimuli co-occur, the Hebbian learning rule willincrease the weights.Unsupervised learning.Can be used to provide pattern completion.




Associative Networks

Heteroassociation: Mapping from one pattern to another,e.g.y = sign(Wx) (Thresholding)y = Wx (Linear mapping)Autoassociation: Mapping to the same pattern, e.g.x = sign(Wx)x = WxCan be performed with a recurrent network.




Discrete Hopfield Networks, Learning

The learning rule is intended to make a set of desiredmemories {x(n)} be stable states of the activity rule.Each memory is a binary pattern, xi ∈ {−1,1}.The weights are set using the sum of outer products(Hebbian learning),

wij = η ∑n

x (n)i x (n)

j

where η is an unimportant constant.To prevent the weights from growing with the number ofpatterns, η is often set to 1

N .




Continuous Hopfield Networks

Using the same architecture and learning rule as in thediscrete Hopfield network, we can define a continuousHopfield network.Activities are real numbers between -1 and 1.Update activities as single neurons with sigmoid activationfunctions:

Synchronous or asynchronous updates. Activations areagain calculated asai = ∑j wijxjbut neurons use the activation functionxi = tanh(ai )

Although the learning rule is the same as before, the valueof η now becomes important.




Attractors

Use local minima to store patterns.The input of the network is the initial state.The output of the network is the closest local minima.There is an area of attraction around each local minima.




Attractors, Example 1




Attractors, Example 2




Attractor Types

“Normal” attractors,

Point attractors.Limit cycles.

Strange attractors (the system itself is chaotic),

Sensitive dependence on initial conditions.Deterministic, but can exhibit a behaviour so complicatedthat it looks random.




Associative Memories

For simple Hopfield networks, it often takes just oneiteration to converge from a randomly perturbed storedpattern.The network often has more stable states in adition to thedesired memories:

The inverse of a stable state.Mixtures of the memories.

Introducing “Brain-damage” by setting a subset of thelearned weights to zero often allows the network to stillcomplete the patterns.Patterns can usually be added up to a certain point, atwhich the network fails catastrophically.Network properties are not robust when changing fromasynchronous to synchronous updates.




Examples, auto associative memory

��

��

��

�! "� �

�!#$� � �

��%&� �

�(')� �

��*+� �

��,��

�(-.� � �0/1� � �(23� � �

�(4.� � � �(56� � �




Examples, hetero associative memory

moscow------russia

lima----------peru

london-----england

tokyo--------japan

edinburgh-scotland

ottawa------canada

oslo--------norway

stockholm---sweden

paris-------france

moscow---????????? ⇒ moscow------russia

?????????---canada ⇒ ottawa------canada

otowaa------canada ⇒ ottawa------canada

egindurrh-scotland ⇒ edinburgh-scotland




Hopfield Networks and Optimization

Since a Hopfield network minimizes an energy function, wecan map some optimization problems onto Hopfieldnetworks.Travelling Salesman:

B

C

D

2 3 41

A

��

� � ��

D

C

A

B

� ��

B

C

D

2 3 41

A

��

� ! ��

D

C

A

B

� �#"#�

�%$ �

C

D

A

B

2 3 41

� �&�

C

D

A

B

2 3 41

')(+*-,




Capacity of Hopfield Networks

The standard learning method gives us local minima in thecorrect places.However, retreiving patterns can fail in numerous ways:

1 Individual bits in some memories might be corrupted. Astable state of the network is displaced a little from thedesired memory.

2 Entire memories might be absent from the set of attractors.3 Spurious additional memories unrelated to desired

memories might be present.4 Spurious additional memories derived from the desired

memories through operations such as mixing and inversionmay be present.

These are in general all undesirable, although failure type4 may be regarded as generalization.




Examples on capacity

moscow------russia

lima----------peru

london-----england

tokyo--------japan

edinburgh-scotland

ottawa------canada

oslo--------norway

stockholm---sweden

paris-------france

→W→

moscow------russia

lima----------peru

londog-----englard

tonco--------japan

edinburgh-scotland

lostoslo--------norway

stockholm---sweden

paris-------france

wrpkmh---xqpqwqxpq

paris-------sweden

ecnarf-------sirap




More Examples on capacity, 1

��

�� !��"�� #�$�� !��"��$�� $��%�� &�� $��"�� '�$�� "��(��'�)�� *� �� $�"��"��$��"��"��+��,��' �� $�� -�#�� "��"��' (�"��"��. �� ' /�� $�� "��"�� 0�� 1�2��$��"�� *�'��3�� +��'��$�� *�� ,�� ' *�"�� (�� 0��-�� ' ��"��$��,�4�� #�� *�"�� ,�3�� +��$�� *�"��+��$�� $�"��"��$��

��54� 6 6

��7�� 6 6

��8�� 6

��9�� 6

�:;� 6 6




More examples on capacity, 2

��

� � � � �

� � � � �




Consequences of Capacity Calculations, 1

If we try to store N ' 0.18I then about 1% of the bits will beunstable after the first iteration, starting from a stablepattern.When N/I is large, unstable bits may cause an avalancheeffect of bits becoming unstable during iteration.There is a sharp discontinuity at

Ncritical = 0.138I

When N/I exceeds 0.138, the system only has spuriousstates.




Consequences of Capacity Calculations, 2

0

0.2

0.4

0.6

0.8

1

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.160.95

0.96

0.97

0.98

0.99

1

0.09 0.1 0.11 0.12 0.13 0.14 0.15




Transition Properties

For all N/I stable states uncorrelated with the desiredmemories exist.

For N/I ∈ (0,0.138) there are stable states close tothe desired memories.

For N/I ∈ (0,0.05) the desired memories have lowerenergy than the uncorrelated states.

For N/I ∈ (0.05,0.138) the uncorrelated statesdominate.

For N/I ∈ (0,0.03) there are additional mixture states.




Capacity of Random Patterns, 1

Assume that the patterns we want to store are randombinary patterns.Let us study the stability of a single bit, assuming that thestate of the network is set to the desired pattern x(n).The activation of a particular neuron is

ai = ∑j

wijx(n)j

and the weights are (for i 6= j)

wij = x (n)i x (n)

j + ∑m 6=n

x (m)i x (m)

j





We split W into two terms, one representing “signal”reinforcing the desired memory, and the second “noise”.The activation becomes

ai = ∑j 6=i

x (n)i x (n)

j x (n)j + ∑

j 6=i∑

m 6=nx (n)

i x (n)j x (n)

j

= (I−1)x (n)i + ∑

j 6=i∑

m 6=nx (m)

i x (m)j x (m)

j

The first term is (I−1) times the desired state x (n)i .

The second term is a sum of (I−1)(N−1) randomquantities x (m)

i x (m)j x (m)

j . These are independent randomvariables with mean 0 and variance 1.





We can conclude that ai has mean (I−1)x (n)i and variance

(I−1)(N−1).Assume that I and N are large enough so that we canassume that the distinction to (I−1) and (N−1) arenegligable.This means that ai is approximately Gaussian distributedwith mean Ix (n)

i and variance IN.The probability that bit i will flip is

P(i flip) = Φ

(− I√

IN

)= Φ

(− 1√

N/I

)where

Φ(z) =∫ z

−∞

dz1√2π

e−z2/2




Increasing Capacity in Hopfield Networks

We can increase the capacity of the network if we abandonthe Hebbian learning rule, and instead use an objectivefunction that measures how well the patterns are storedand minimizing it.For all patterns x(n), if all other neurons are set correctly,the activation of neuron i should be such that xi = x (n)

i ,

Cost(W) = ∑i ∑n tni ln(y (n)

i ) + (1− t(n)n ln(1−y (n)

i ))

t(n)i =

{1 x (n)

i = 10 x (n)

i =−1

y (n)i = 1

1+e−a(n)i

Parameters can be found using gradient descent.





The Boltzmann machine is a special case of an undirectedgraphical model, it belongs to a class called Markov RandomFields.More about (statistical) belief networks in a later lecture.


feedback networks and hopfield networks - kth

Documents