flexible and fast convergent learning agent miguel a. soto santibanez michael m. marefat...
DESCRIPTION
1) Artificial Neural Networks Robust to errors in the training data Dependency on the availability of good and extensive training examples 2) Instance-Based Learning Able to model complex policies by making use of less complex local approximations Dependency on the availability of good and extensive training examples 3) Reinforcement Learning Independent of the availability of good and extensive training examples Convergence to the optimal policy can be extremely slow Background and Motivation IITRANSCRIPT
Flexible and fast convergent learning agent
Miguel A. Soto SantibanezMichael M. Marefat
Department of Electrical and Computer EngineeringUniversity of Arizona, Tucson, AZ
Background and Motivation
“A computer program is said to LEARN from experience E with respect to some class of task T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”
A robot driving learning problem:
Task T: driving on public four-lane highways using vision sensorsPerformance measure P: average distance traveled before an error (as
judged by human overseer)Training experiences E: a sequence of images and steering commands
recorded while observing a human driver
1) Artificial Neural Networks
Robust to errors in the training data
Dependency on the availability of good and extensive training examples
2) Instance-Based Learning Able to model complex policies by making use of less complex local approximations
Dependency on the availability of good and extensive training examples
3) Reinforcement Learning
Independent of the availability of good and extensive training examples
Convergence to the optimal policy can be extremely slow
Background and Motivation II
Background and Motivation III
Motivation:
Is it possible to get the best of both worlds?
Is it possible for a Learning Agent to be flexible and fast convergent at the same time?
The ProblemFormalization:
Given: a) a set of actions A = {a1, a2, a3, . . .}, b) a set of situations S = {s1, s2, s3, . . .}, c) and a function TR(a, s) tr,
where tr is the total reward associated with applying action a while at state s,
The LA needs to construct a set of rules P = {rule(s1, a1), rule(s2, a2), . . .} such that rule(s, a) P, a = amax
where TR(amax, s)=max(TR(a1,s), TR(a2,s), . . .) Also:
1) Increase flexibility 2) Increase speed of convergence
The Solution
The Q learning Algorithm:
1: rule(s, a) P, TR(a, s) 02: find out what is the current situation si3: do forever:4: select an action ai A and execute it5: find out what is the immediate reward r 6: find out what is the current situation si’7: TR(ai, si) r + aFactormax(TR(a, si’)) a8: si si’
The Solution IIAdvantages: 1) The LA does not depend on the availability of good and extensive training examples
Reason: a) This method learns from experimentation instead of given training examplesShortcomings: 1) Convergence to the optimal policy can be very slow
Reasons: a) The Q learning Algorithm propagates “good findings” very slowly. b) Speed of convergence tied to number of situations that need to be handled.
2) May not be able to use this method on high dimensionality problems
Reason: a) The memory requirements grow exponentially as we add more dimensions to the problem.
The Solution IIISpeed of convergence tied to the number of situations:
situations ==> P rules that need to be found P rules that need to be found ==> experiments are needed experiments are needed ==> convergence speed
120000 situations world 12 situations world
The Solution IVSlow propagation of “good findings”:
A
B C D
J
E F H
I K L
Factor = 0.9
0
0 0
0
0
0
90 100
0
0 0 0
After visiting A, B, . . .G
G
1
2 3 4
5 6
7
0
0 0
0
81
0
90 100
0
0 0 0
After visiting A, . . .G 2 times
59
66 73
0
81
0
90 100
0
0 0 0
After visiting A, . . .G 5 times
0
0 0
0
0
0
0 100
0
0 0 0
Table of intrinsic rewards Possible situations
The Solution VFirst Sub-problem:
Slow propagation of “good findings”
Solution:Develop a method that propagates “good findings” beyond the previous state
A
B C D
J
E F H
I K L
Intrinsic value of F = 100
Intrinsic value of others = 0
Factor = 0.9
0
0 0
0
0
0
90
100
0
0 0 0
Without Propagation With Propagation
G
1
2 3 4
5 6
7
59
66 73
0
81
0
90 100
0
0 0 0
The Solution VISolution to First Sub-problem:
a) Use a buffer, which we call “short term memory”, to keep track of the last n situations b) After each learning experience apply the following algorithm:
t = currentTime -1
is entry visited
at time = t stored in the "short term
memory"?
End
YES
NO
is total reward
(coming from entry at time = t + 1) bigger
than the official Value?
NO
YES
t = t -1
Begin
update P
The Solution VII
The Second and Third Sub-problems: a) Memory requirements grow exponentially as we add more dimensions to the problem
b) Speed of convergence tied to number of situations that need to be handled.
Solution: 1) We just keep a few examples of the policy (also called prototypes)
2) We generate the policy on situation not described explicitly by these prototypes by “generalizing” from “nearby” prototypes
The Solution VIII
Kanerva Coding
And
Tile Coding
Moving Prototypes
The Solution IX
The Solution X
The Solution XI
The Solution XII A sound tree:
a) all the “areas” are mutually
exclusive
b) their merging is exhaustive
c) the merging of any two sibling “areas” is equal to their parent’s “area”.
children
parent
children
parent
The Solution XIII
Impossible Merge
The Solution XIV
“Smallest predecessor”
The Solution XV
The Solution XVIPossible ways of breaking the existing nodes:
Node being inserted
The Solution XVII
List 1
List 1.1
List 1.2
and
The Solution XVIII
The Solution XIX
The Solution XX
ResultsThe performance of the algorithm “Propagation of Good Findings” is especially good when the world is large:
Memory Size = 100 Seed = 9642Factor = 0.99
0
200000
400000
600000
800000
1000000
0 5 10 15 20 25 30 35World s ize
Exp
erie
nces
nee
ded
Look around
Q Learning
Propagation
The algorithm “Propagation of Good Findings” is more efficient when the size of its “Short Term Memory” is large:
Seed = 2129 World Size = 7X7Factor = 0.9
0
2000
4000
6000
8000
10000
0 1 2 3 4 5 6 7Memory size
Exp
erie
nces
nee
ded
Look around
Q Learning
Propagation
Results IIThe algorithm “Propagation of Good Findings” is more efficient when the value of the parameter “discount factor” is large:
Memory Size = 100 World Size = 7X7Seed = 2129
0100002000030000
400005000060000
0 0.2 0.4 0.6 0.8 1 1.2
Factor
Exp
erie
nces
nee
ded
Look around
Q Learning
Propagation
Results do not depend on sequence of random numbers:
Conclusions
Moving Prototypes
Q Learning Algorithm
Propagation of good findings The proposed
Learning Agent
Q Learning Algorithm LA becomes more flexible
Propagating concept Convergence is accelerated
Moving Prototypes concept LA becomes more flexible
Moving Prototypes concept Convergence is accelerated
Conclusions II
What is left to do:
Obtain results on the advantages of using regression trees and linear approximation over other similar methods (just as we have already done with the method “Propagation of Good Findings”).
Apply the proposed model to solving example applications such as a self-optimizing middle-men between a high level planner and the actuators in a robot.
Develop more precisely the limits on the use of this model.