additional nn models
DESCRIPTION
Additional NN Models. Reinforcement learning (RL) Basic ideas: Supervised learning: (delta rule, BP) sample(x, f(s)) to learn f(.) precise error can be determined and is used to drive the learning. Unsupervised learning: (competitive, BM) - PowerPoint PPT PresentationTRANSCRIPT
Additional NN ModelsReinforcement learning (RL)• Basic ideas:
– Supervised learning: (delta rule, BP)• sample(x, f(s)) to learn f(.)• precise error can be determined and is used to drive the
learning.– Unsupervised learning: (competitive, BM)
• no target/desired output provided to help learning, • learning is self-organized/clustering
– reinforcement learning: in between the two• no target output for input vectors in training samples• a judge/critic will evaluate the output
good: reward signal (+1) bad: penalty signal (-1)
• RL exists in many places– Originated from psychology( training animal)– Machine learning community, different theories and
algorithmsmajor difficulty: credit/blame distribution chess playing: W/L (multi-step) soccer playing: W/L(multi-player)
– In many applications, it is much easier to determine good/bad, right/wrong, acceptable/unacceptable than to provide precise correct answer/error.
– It is up to the learning process to improve the system’s performance based on the critic’s signal.
• Principle of RL– Let r = +1 reword (good output) r = -1 penalty (bad output)– If r = +1, the system is encouraged to continue what it
is doingIf r = -1, the system is encouraged not to do what it is doing.
– Need to search for better output • because r = -1 does not indicate what the good
output should be. • common method is “random search”
• ARP: the associative reword-and-penalty algorithm for NN (Barton and Anandan, 1985)– Architecture
criticz(k)
y(k)
x(k)
input: x(k)output: y(k)stochastic units: z(k) for random search
– Random search by stochastic units zi
or let zi obey a continuous probability distribution function.
or let is a random noise, obeys
certain distribution.Key: z is not a deterministic function of x, this
gives z a chance to be a good output.– Prepare desired output (temporary)
1/2
1/2
)1()1()1()1(
Tneti
Tneti
i
i
ezpezp
wherenetz ii
1)()(1)()()(
krifkykrifkykd
– Compute the errors at z layer
where E(z(k)) is the expected value of z(k) because z is a random variableHow to compute E(z(k))
• take average of z over a period of time• compute from the distribution, if possible• if logistic sigmoid function is used,
– Training: BP or other method to minimize the error
))(()()( kzEkdke
)/tanh())(1)(1()()1()( TnetnetgnetgzE iiii
(II) Probabilistic Neural Networks1. Purpose: classify a given input pattern x into one of the pre-defined classes by Bayesian decision rule. Suppose there are k predefined classes s1, …sk
P(si): prior probability of class siP(x|si): conditional probability of class siP(x): probability of xP(si|x): posterior probability of si, given xexample:S=s1Us2Us3….Usk, the set of all patientssi: the set of all patients having disease Ix: a description of a patient(manifestation)
P(x|si): prob. One with disease I will have description x
P(si|x): prob. one with description x will have disease i.
by Bayes’ theorem:)|(max)|( xsPxsP jji
examplersfromlearnedaresxPPNNin
sPsxPsPsxP
iffxsPxsPstconsisxPce
xPsPsxPxsP
i
iijii
jji
i
iii
)|(,
\)()|(max)()|(
)|(max)|(,tan)(sin
)()()|()|(
2. PNN architecture: feed forward with 2 hidden layerslearning is not used to minimize error but to obtain P(x|si)
3. Learning assumption: P(si) are known, P(x|si) obey Gaussian distr.
estimate
)2
)(exp(21)(
2exp
)2(1)|(
2
2
12
2)(
2/
uxxf
xx
nsxP
in
j i
ij
im
imi
4.Comments:(1) Bayesian classification by (2) fast classification( especially if implemented in parallel machine).(3) fast learning(4) trade nodes for time( not good with large training smaples/clusters).
(III)Recurrent BP1. Recurrent networks: network with feedback links
- state(output) of the network evolves along the time.- may or may not have hidden nodes.- may or may not stabilize when t- how to learn w so that an initial state(input) will lead to a stable state with the desired output.
2. Unfoldingfor any recurrent network with finite evolution time, there is an equivalent feedforward network.problems:too many repetitions too many layers when the network need a long time to
reach stable state.standard BP needs to be relized to hard duplicate weights.
3. Recurrent BP (1987)system:
assume at least one fixed point exists for the system with the given initial statewhen a fixed point is reduced
can be obtained.error
)1()( iijjii
i vvwgdtdV
T
)0(
y
)( ijiji vwgv
k
kkk
vtueE 22 )(21
21
take the gradient descent approach to minimize E by update W
direct derivation will have
k ts
kq
stts W
ve
uEW
jiifjiif
whfpP
PQmatrixaofelementanisqwhere
vingqeW
ijijiijij
ks
ktskskts
01
)(':
)(')(
1
Computing is very time consuming.Pineda and Almeida/s proposal:
can be computed by another recurrent netwith identical structure of the original RNdirection of each are is reversed( transposed network)in the original network: weight for node j to i: Wijin the transposed network, weight for node j to i:
1P
kksks
ktskskts
qez
yhfqeW )(')(
jij whf )('
j
ijjijii ezwhfZ
dtdZ
)('
Weight-update procedure for RBPwith a given input and its desired output
1. Relax the original network to a fixed point2. Compute error3. Relax the transposed network to a fixed point 4. Update the weight of the original network
The complete learning algorithmincremental/sequentialW is updated by the preseting of each learning pair using the weight-update procedure.to ensure the dearned network is stable, learning rate must be small(much smaller than the rate for standard BP learning)time consuming: two relaxation processes are involved for each step of weight updatebetter performance than BP in some applications
)),0(( dy
III network of radial basis functions1. Motivations
better function approximationBP network( hidden units are sigmoid)
training time is very longgeneralization(with non-training input) not
always good
Counter Propagation(hidden units are WTA)poor approximation, especially with interpolationany input is forced to be classified into one class
and intern produces class/ output as its function value.
2. Architectureinputhiddenoutput(similar to BP and CPN)operation/learning: similar to CPN
inputhidden: competitive learning for class character
hiddenoutput delta rule(LMS error) for mappingdifference: hidden units obey Radial Basis function
3. Hidden unit: Gaussian functionsuppose unit I represent a class of inputs with centroid
ic
jj
iii
xwhere
Cxx
)(
/)2/exp()( 2
Radial basis function
input vectors with equal distance to Ci will have the same output.
Each hidden unit I has a receptive fied with Ci as its centerif x=Ci , unit I has the largest outputif x!=Ci, unit I has the smallest outputthe size of the receptive field is determined by
During computation, hidden units are not WTA( no lateral inhibition with an input x, usually more than one hidden units can have non-zero output. These outputs can be combined at output layer to produce better approximation.
)()( 2121 xxthenxxif
4. LearninginputhiddenCi: competitive, based on neti : ad hoc(performance not sensitive tohiddenoutput
delta rule(LMS)
i i
5. Comments• compare with BP
approximate any …..L2 function(same as BP)may have better …usually requires many more training samples and
many more hidden unitsonly one hidden layer is needed.training is faster
• Compare with CPNmuch better function approximationtheoretical analysis is only prelimnary