Download - RL with LCS
-
8/3/2019 RL with LCS
1/29
Towards Reinforcement Learning with LCS
-
8/3/2019 RL with LCS
2/29
Having until now concentrated on how LCS canhandle regression and classification tasks, thischapter returns to the prime motivator for LCS,which are sequential decision tasks.
Towards Reinforcement Learningwith LCS
-
8/3/2019 RL with LCS
3/29
Towards Reinforcement Learningwith LCS
-
8/3/2019 RL with LCS
4/29
Problem Definition
The sequential decision tasks that will beconsidered are the ones describable by aMarkov Decision Process (MDP)
Some of the previously used symbols will beassigned a new meaning.
Towards Reinforcement Learningwith LCS
-
8/3/2019 RL with LCS
5/29
Problem Definition
Let X be the set of states x X of the problem domain, that isassumed to be of finite size1 N , and hence is mapped into thenatural numbers N.
In every state xi X , an action a out of a finite set A is
performed and causes a state transition to xj . The probability of getting to state xj after performing action a in
state xi is given by the transition function p(xj |xi , a), which is aprobability distribution over X, conditional on X A.
The positive discount factor R with 0 < 1 determines the
preference of immediate reward over future reward.
-
8/3/2019 RL with LCS
6/29
Problem Definition
The aim is for every state to choose the actionthat maximises the reward in the long run,where future rewards are possibly valued less
that immediate rewards.The Value Function, the Action-Value
Function and Bellmans Equation The approach taken by dynamic programming
(DP) and reinforcement learning (RL) is to definea value function V : X R that expresses foreach state how much reward we can expect toreceive in the long run.
-
8/3/2019 RL with LCS
7/29
Problem Types
The three basic classes of infinite horizonproblems are stochastic shortest path problems,discounted problems, and average reward perstep problems, all of which are well described byBertsekas and Tsitsiklis [17].
Here, only discounted problems and stochasticshortest path problems are considered, where for
the problems and stochastic shortest pathproblems are considered, where for the latter onlyproper policies that are guaranteed to reach thedesired terminal state are assumed.
-
8/3/2019 RL with LCS
8/29
Dynamic Programming andReinforcement Learning
In this section, some common RL methods areintroduced, that learn these functions whiletraversing the state space without building a
model of the transition and reward function.
These methods are simulation-basedapproximations to DP methods, and theirstability is determined by the stability of thecorresponding DP method.
-
8/3/2019 RL with LCS
9/29
Dynamic Programming Operators
Bellmans Equation is a set of equations thatcannot be solved analytically.
Fortunately, several methods have been
developed that make finding its solution easier,all of which are based on the DP operators Tand T.
-
8/3/2019 RL with LCS
10/29
Value Iteration and Policy Iteration
The method of value iteration is astraightforward application of the contractionproperty of T and is based on applying T
repeatedly to an initially arbitrary value vector Vuntil it converges to the optimal value vectorV* . Convergence can only be guaranteed afteran infinite number of steps, but the value vector
V is usually already close to V* after fewiterations.
-
8/3/2019 RL with LCS
11/29
Value Iteration and Policy Iteration
Various variants to these methods exist, such as asynchronous valueiteration, that at each application of T only updates a single state of V.Modified policy iteration performs the policy evaluation step byapproximating V by Tn V for some small n.
Asynchronous policy iteration mixes asynchronous value iteration with
policy iteration by at each step either i) updating some states of V by asynchronous value iteration.
ii) improving the policy of some set of states by policy improvement.Convergence criteria for these variants are given by Bertsekas andTsitsiklis [17].
-
8/3/2019 RL with LCS
12/29
Approximate Dynamic Programming
If N is large, we prefer to approximate the valuefunction rather than representing the value foreach state explicitly
Approximate value iteration is performed byapproximating the value iteration update Vt+1 =TVt by
-
8/3/2019 RL with LCS
13/29
Approximate Dynamic Programming
where is the approximation operator that, for the usedfunction approximation technique, returns the value function
estimate approximation Vt+1 that is closest to by
The only approximation that will be considered is the onemost similar to approximation value iteration and is the
temporal-di erence solution which aims at finding the fixedffpoint by the update
-
8/3/2019 RL with LCS
14/29
SARSA()
Coming to the first reinforcement learning algorithm,SARSA stands for State-Action-Reward-State-Action,as SARSA(0) requires only information on the currentand next state/action pair and the reward that was
received for the transition. It conceptually performs policy iteration and uses
TD() to update its action-value function Q. Morespecifically it performs optimistic policy iteration, where
in contrast to standard policy iteration the policyimprovement step is based on an incompletelyevaluated policy.
-
8/3/2019 RL with LCS
15/29
Q-Learning
-
8/3/2019 RL with LCS
16/29
-
8/3/2019 RL with LCS
17/29
9.5 Further Issues
Besides the stability concerns when using LCSto perform RL, there are still some furtherissues to consider, two of which will bediscussed in this section:
The learning of long paths, and
How to best handle the explore/exploitdilemma.
-
8/3/2019 RL with LCS
18/29
9.5.1 Long Path Learning
The problem of long path learning is to find the optimalpolicy in sequential decision tasks when the solutionrequires learning of action sequences of substantiallength.
While a solution was proposed to handle this problem[12], it was only designed to work for a particular problemclass, as will be shown after discussing how XCS fails atlong path learning. The classifier set optimality criterionfrom Chap. 7 might provide better results, but in general,long path learning remains an open problem.
Long path learning is not only an issue for LCS, but forapproximate DP d RL in general.
-
8/3/2019 RL with LCS
19/29
XCS and Long Path Learning
Consider the problem that is shown in Fig. 9.2.The aim is to find the policy that reaches theterminal state x6 from the initial state x1a in theshortest number of steps.
In RL terms, this aim is described by giving areward of 1 upon reaching the terminal state,and a reward of 0 for all other transitions4 . The
optimal policy is to alternately choose actions 0and 1, starting with action 1 in state x1a .
-
8/3/2019 RL with LCS
20/29
XCS and Long Path Learning
The optimal value function V over the number ofsteps to the terminal state is for a 15-step corridor
finite state world shown in Fig. 9.3(a). As can beseen, the di erence of the values of V betweenff two adjacent states decreases with the distancefrom the terminal state.
-
8/3/2019 RL with LCS
21/29
Using the Relative Error
Barry proposed two preliminary approaches tohandle the problem in long path learning in XCS,both based on making the error calculation of aclassifier relative to its prediction of the value
function [12]. The first approach is to estimate the distance of the
matched states to the terminal state and scale theerror accordingly, but this approach su ers from theff
inaccuracy of predicting this distance.
-
8/3/2019 RL with LCS
22/29
Using the Relative Error
A second, more promising alternative proposed
in his study is to scale the measured predictionerror by the inverse absolute magnitude of theprediction. The underlying assumption is that thedi erence in optimal values between twoff
successive states is proportional to the absolutemagnitude of these values
-
8/3/2019 RL with LCS
23/29
A Possible Alternative?
It was shown in Sect. 8.3.4 that the optimalitycriterion that was introduced in Chap. 7 is able
to handle problem where the noise di ers inffdi erent areas of the input space. Given that itffis possible to use this criterion in anincremental implementation, will such an
implementation be able to perform long pathlearning?
-
8/3/2019 RL with LCS
24/29
A Possible Alternative?
Let us assume that the optimality criterion causes the size of the area of the input space that is matched by aclassifier to be proportional to the level of noise in the data, such that the model is refined in areas where theobservations are known to accurately represent the data-generating process. Considering only measurement noise,when applied to value function approximation this would lead to having more specific classifiers in states where thedi erence in magnitude of the value function for successive states is low, as in such areas this noise is deemed toffbe low. Therefore, the optimality criterion should provide an adequate value function approximation of the optimalvalue function, even in cases where long action sequences need to be represented.
-
8/3/2019 RL with LCS
25/29
9.5.2 Exploration and Exploitation
Maintaining the balance between exploiting current knowledge to guide actionselection and exploring the state space to gain new knowledge is an essentialproblem for reinforcement learning.
Too much exploration implies the frequent selection of sub-optimal actions andcauses the accumulated reward to decrease.
Too much emphasis on exploitation of current knowledge, on the other hand,might cause the agent to settle on a sub-optimal policy due to insu cientffiknowledge of the reward distribution [228, 209]. Keeping a good balance isimportant as it has a significant impact on the performance of RL methods.
-
8/3/2019 RL with LCS
26/29
9.5.2 Exploration and Exploitation
There are several approaches to handlingexploration and exploitation: one can choose asub-optimal action every now and then,independent of the certainty of the available
knowledge,
Or one can take this certainty into account tochoose actions that increase it. A variant of the
latter is to use Bayesian statistics to model thisuncertainty, which seems the most elegantsolution but is unfortunately also the leasttractable.
-
8/3/2019 RL with LCS
27/29
9.6 Summary
Despite sequential decision tasks being theprime motivator for LCS, they are still the oneswhich LCS handle least successfully. Thischapter provides a primer on how to usedynamic programming and reinforcementlearning to handle such tasks, and on how LCScan be combined with either approach from firstprinciples.
-
8/3/2019 RL with LCS
28/29
9.6 Summary
An essential part of the LCS type discussed in this book isthat classifiers are trained independently. This is notcompletely true when using LCS with reinforcement learning,as the target values that the classifiers are trained on arebased on the global prediction, which is formed by all
matching classifiers in combination. In that sense, classifiersinteract when forming their action-value function estimates.Still, besides combining classifier predictions to form thetarget values, independent classifier training still forms thebasis of this model type, even when used in combination with
RL.
-
8/3/2019 RL with LCS
29/29
9.6 Summary
Overall, using LCS to approximate the value or action-value function in RL is appealing as LCS dynamicallyadjust to the form of this function and thus might provide a better approximation than standard functionapproximation techniques. It should be noted, however, that the field of RL is moving quickly, and that Q-Learning is by far not the best method that is currently available. Hence, in order for LCS to be a competitiveapproach to sequential decision tasks, they also need to keep track with new developments in RL, some ofwhich were discussed when detailing the exploration/exploitation dilemma that is an essential component of RL.
In summary, it is obvious that there is still plenty of work to be done until LCS can provide the same formaldevelopment as RL currently does. Nonetheless, the initial formal basis is provided in this chapter, upon whichother research can build further analysis and improvements to how LCS handles sequential decision taskse ectively, competitively, and with high reliability.ff