corrections to reinforcement learning and dynamic
TRANSCRIPT
Corrections toReinforcement learning and dynamicprogramming using functionapproximatorsLucian Busoniu, Robert Babuska, Bart De Schutter, and Damien Ernst
Taylor & Francis CRC Press 2010
Updated July 8, 2011
Figure 1.3 (page 3)
The figure omits the goal and obstacle mentioned in the text (these elements are
shown later, in Figure 1.5). The correct figure is given here:
state (position)xk
action (step)uk
next state xk+1
r ,k+1 rewardGoal
Example 2.3 (page 26) and Example 2.4 (page 34)
The results in Table 2.2 are not obtained with the version of Q-iteration from Al-
gorithm 2.1, as stated in the text, but with an asynchronous version that employs
the most recently updated Q-values at each step of the computation. This version
replaces lines 3–5 of Algorithm 2.1 with the procedure:
Q← Qℓ
for every (x,u) do
Q(x,u)← ρ(x,u)+ γ maxu′Q( f (x,u),u′)end for
Qℓ+1← Q
1
2
Similarly, Tables 2.3, 2.5, and 2.6 are produced using asynchronous versions of,
respectively, Algorithms 2.2, 2.5, and 2.6. These versions are easily obtained by mod-
ifications similar to the one above, and are not given here. The computational cost
considerations and comparisons in the examples remain valid, but apply to the asyn-
chronous algorithm variants. (Since the synchronous variants given in the book are
less efficient, they would run in a larger number of iterations.)
Equation 3.50 (page 89): Qhℓ should be Qhℓ
Section 4.5.4 (page 160): Car on the hill example
The terminal states were incorrectly handled in this example. In particular, the usual
fuzzy Q-iteration updates:
θℓ+1,[i, j]← ρ(xi,u j)+ γ maxj′
N
∑i′=1
φi′( f (xi,u j))θℓ,[i′, j′]
were applied even when the next state f (xi,u j) was terminal, i.e., outside the domain
[−1,1]× [−3,3]. This, however, corresponds to – incorrectly – assigning non-zero
rewards to the terminal states. These rewards then propagate through the updates and
lead to overly large Q-values. The correct way to perform the updates is by explicitly
enforcing a zero Q-value in any terminal state:
θℓ+1,[i, j]← ρ(xi,u j)+
{γ max j′∑
Ni′=1 φi′( f (xi,u j))θℓ,[i′, j′] if f (xi,u j) is non-terminal
0 if f (xi,u j) is terminal
The following results change due to this modification. Figure 4.14(b) changes to:
−1−0.5
00.5
1
−2
0
2
−1
−0.5
0
0.5
1
pp’
Q(p
,p’, −
4)
Figure 4.15 changes to:
3
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
−0.2
−0.1
0
0.1
0.2
0.3
0.4
N’
Score
optimized MFs, mean score
optimized MFs, 95% confidence bounds
optimal score
equidistant MFs, score
(a) Performance.
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 2010
−2
100
102
104
106
N’
Execution tim
e [s]
optimized MFs, mean execution time
optimized MFs, 95% confidence bounds
equidistant MFs, execution time
(b) Execution time.
Figure 4.16 changes to:
−1−0.5
00.5
1
−3
−2
−1
0
1
2
30
0.5
1
pp’
φ (
p,
p’)
The original discussion remains entirely valid, and these results are qualitatively
similar to the original ones. This is because the policies computed by fuzzy Q-
iteration remain almost unaffected by the change.
Page 196, line 5: γ = 0.95 should be γ = 0.98
Note that the policy in Figure 5.13(b) is near-optimal for γ = 0.95. Nevertheless, the
corresponding near-optimal policy for γ = 0.98:
−2 0 2−50
0
50
α [rad]
α’ [r
ad
/s]
h(α,α’) [V]
−10
−5
0
5
10
has the same structure, and the considerations after Figure 5.13 remain valid.