powerpoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/mlds_2018/lecture/qlearning (v2).pdf · title...
TRANSCRIPT
![Page 1: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/1.jpg)
Q-LearningHung-yi Lee
![Page 2: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/2.jpg)
Outline
Introduction of Q-Learning
Tips of Q-Learning
Q-Learning for Continuous Actions
![Page 3: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/3.jpg)
Critic
• A critic does not directly determine the action.
• Given an actor π, it evaluates how good the actor is
• State value function 𝑉𝜋 𝑠
• When using actor 𝜋, the cumulated reward expects to be obtained after visiting state s
𝑉𝜋s𝑉𝜋 𝑠
scalar
𝑉𝜋 𝑠 is large 𝑉𝜋 𝑠 is smaller
The output values of a critic depend on the actor evaluated.
![Page 4: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/4.jpg)
Critic
𝑉以前的阿光 大馬步飛 = bad
𝑉變強的阿光 大馬步飛 = good
![Page 5: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/5.jpg)
How to estimate 𝑉𝜋 𝑠
• Monte-Carlo (MC) based approach• The critic watches 𝜋 playing the game
After seeing 𝑠𝑎,
Until the end of the episode, the cumulated reward is 𝐺𝑎
After seeing 𝑠𝑏,
Until the end of the episode, the cumulated reward is 𝐺𝑏
𝑉𝜋 𝑠𝑎𝑉𝜋𝑠𝑎 𝐺𝑎
𝑉𝜋 𝑠𝑏𝑉𝜋𝑠𝑏 𝐺𝑏
![Page 6: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/6.jpg)
How to estimate 𝑉𝜋 𝑠
• Temporal-difference (TD) approach
⋯𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1⋯
𝑉𝜋 𝑠𝑡𝑉𝜋𝑠𝑡
𝑉𝜋 𝑠𝑡+1𝑉𝜋𝑠𝑡+1
𝑉𝜋 𝑠𝑡 = 𝑉𝜋 𝑠𝑡+1 + 𝑟𝑡
𝑉𝜋 𝑠𝑡 − 𝑉𝜋 𝑠𝑡+1 𝑟𝑡-
Some applications have very long episodes, so that delaying all learning until an episode's end is too slow.
![Page 7: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/7.jpg)
MC v.s. TD
𝑉𝜋 𝑠𝑎𝑉𝜋𝑠𝑎 𝐺𝑎Larger variance
𝑉𝜋 𝑠𝑡𝑉𝜋𝑠𝑡 𝑉𝜋 𝑠𝑡+1 𝑉𝜋 𝑠𝑡+1𝑟 +
Smaller varianceMay be inaccurate
𝐺𝑎 is the summation of many steps
𝑉𝑎𝑟 𝑘𝑋 = 𝑘2𝑉𝑎𝑟 𝑋
![Page 8: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/8.jpg)
MC v.s. TD
• The critic has the following 8 episodes• 𝑠𝑎 , 𝑟 = 0, 𝑠𝑏 , 𝑟 = 0, END
• 𝑠𝑏 , 𝑟 = 1, END
• 𝑠𝑏 , 𝑟 = 1, END
• 𝑠𝑏 , 𝑟 = 1, END
• 𝑠𝑏 , 𝑟 = 1, END
• 𝑠𝑏 , 𝑟 = 1, END
• 𝑠𝑏 , 𝑟 = 1, END
• 𝑠𝑏 , 𝑟 = 0, END
[Sutton, v2, Example 6.4]
(The actions are ignored here.)
𝑉𝜋 𝑠𝑎 =?
𝑉𝜋 𝑠𝑏 = 3/4
0? 3/4?
Monte-Carlo:
Temporal-difference:
𝑉𝜋 𝑠𝑎 = 0
𝑉𝜋 𝑠𝑎 = 𝑉𝜋 𝑠𝑏 + 𝑟
3/43/4 0
![Page 9: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/9.jpg)
Another Critic
• State-action value function 𝑄𝜋 𝑠, 𝑎
• When using actor 𝜋, the cumulated reward expects to be obtained after taking a at state s
𝑄𝜋s 𝑄𝜋 𝑠, 𝑎
scalara
𝑄𝜋 𝑠, 𝑎 = 𝑙𝑒𝑓𝑡
𝑄𝜋 𝑠, 𝑎 = 𝑓𝑖𝑟𝑒
𝑄𝜋 𝑠, 𝑎 = 𝑟𝑖𝑔ℎ𝑡𝑄𝜋
for discrete action only
s
![Page 10: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/10.jpg)
State-action value function
https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf
![Page 11: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/11.jpg)
Another Way to use Critic: Q-Learning
𝜋 interacts with the environment
Learning 𝑄𝜋 𝑠, 𝑎Find a new actor 𝜋′ “better” than 𝜋
TD or MC
?
𝜋 = 𝜋′
![Page 12: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/12.jpg)
Q-Learning
• Given 𝑄𝜋 𝑠, 𝑎 , find a new actor 𝜋′ “better” than 𝜋
• “Better”: 𝑉𝜋′ 𝑠 ≥ 𝑉𝜋 𝑠 , for all state s
𝜋′ 𝑠 = 𝑎𝑟𝑔max𝑎
𝑄𝜋 𝑠, 𝑎
➢𝜋′ does not have extra parameters. It depends on Q
➢Not suitable for continuous action a (solve it later)
![Page 13: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/13.jpg)
𝜋′ 𝑠 = 𝑎𝑟𝑔max𝑎
𝑄𝜋 𝑠, 𝑎
𝑉𝜋′ 𝑠 ≥ 𝑉𝜋 𝑠 , for all state s
𝑉𝜋 𝑠 ≤ 𝑄𝜋 𝑠, 𝜋′ 𝑠
= 𝐸[𝑟𝑡+1 + 𝑉𝜋 𝑠𝑡+1 |𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝜋′ 𝑠𝑡 ]
≤ 𝐸[𝑟𝑡+1 + 𝑄𝜋 𝑠𝑡+1, 𝜋′ 𝑠𝑡+1 |𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝜋′ 𝑠𝑡 ]
= 𝐸[𝑟𝑡+1 + 𝑟𝑡+2 + 𝑉𝜋 𝑠𝑡+2 | … ]
≤ 𝐸[𝑟𝑡+1 + 𝑟𝑡+2 + 𝑄𝜋 𝑠𝑡+2, 𝜋′ 𝑠𝑡+2 | … ]
𝑉𝜋 𝑠 = 𝑄𝜋 𝑠, 𝜋 𝑠
≤ max𝑎
𝑄𝜋 𝑠, 𝑎 = 𝑄𝜋 𝑠, 𝜋′ 𝑠
Q-Learning
… ≤ 𝑉𝜋′ 𝑠
![Page 14: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/14.jpg)
Target Network
𝑄𝜋𝑠𝑡
𝑄𝜋𝑠𝑡+1
𝑟𝑡 +
𝑎𝑡
𝜋 𝑠𝑡+1
Q𝜋 𝑠𝑡 , 𝑎𝑡
Q𝜋 𝑠𝑡+1, 𝜋 𝑠𝑡+1
⋯𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1⋯
Q𝜋 𝑠𝑡 , 𝑎𝑡
= 𝑟𝑡 + Q𝜋 𝑠𝑡+1, 𝜋 𝑠𝑡+1
regression
fixed
fixed value
update
After updating N times
Target Network
![Page 15: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/15.jpg)
Exploration
• The policy is based on Q-function
𝑎 = 𝑎𝑟𝑔max𝑎
𝑄 𝑠, 𝑎
𝑎 = ൝𝑎𝑟𝑔max
𝑎𝑄 𝑠, 𝑎 ,
𝑟𝑎𝑛𝑑𝑜𝑚,
𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 1 − 𝜀
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
This is not a good way for data collection.
𝑠
𝑎1
𝑎2
𝑎3 𝑄 𝑠, 𝑎 = 0
𝑄 𝑠, 𝑎 = 0
𝑄 𝑠, 𝑎 = 0
1 Always sampled
Never explore
Never explore
Epsilon Greedy
Boltzmann Exploration
𝑃 𝑎|𝑠 =𝑒𝑥𝑝 𝑄 𝑠, 𝑎
σ𝑎 𝑒𝑥𝑝 𝑄 𝑠, 𝑎
𝜀 would decay during learning
![Page 16: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/16.jpg)
Replay Buffer
𝜋 interacts with the environment
Learning 𝑄𝜋 𝑠, 𝑎Find a new actor 𝜋′ “better” than 𝜋
𝜋 = 𝜋′
Buffer
……
expexp
exp
exp
𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1Put the experience into buffer.
The experience in the buffer comes from different policies.
Drop the old experience if the buffer is full.
![Page 17: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/17.jpg)
Replay Buffer
𝜋 interacts with the environment
Learning 𝑄𝜋 𝑠, 𝑎Find a new actor 𝜋′ “better” than 𝜋
𝜋 = 𝜋′
Buffer
……
expexp
exp
exp
Put the experience into buffer.
In each iteration:
1. Sample a batch
2. Update Q-function
Off-policy
𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1
![Page 18: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/18.jpg)
Typical Q-Learning Algorithm
• Initialize Q-function 𝑄, target Q-function 𝑄 = 𝑄
• In each episode
• For each time step t
• Given state 𝑠𝑡, take action 𝑎𝑡 based on Q (epsilon greedy)
• Obtain reward 𝑟𝑡, and reach new state 𝑠𝑡+1• Store (𝑠𝑡, 𝑎𝑡, 𝑟𝑡, 𝑠𝑡+1) into buffer
• Sample (𝑠𝑖, 𝑎𝑖, 𝑟𝑖, 𝑠𝑖+1) from buffer (usually a batch)
• Target 𝑦 = 𝑟𝑖 +max𝑎
𝑄 𝑠𝑖+1, 𝑎
• Update the parameters of 𝑄 to make 𝑄 𝑠𝑖, 𝑎𝑖 close to 𝑦 (regression)
• Every C steps reset 𝑄 = 𝑄
![Page 19: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/19.jpg)
Outline
Introduction of Q-Learning
Tips of Q-Learning
Q-Learning for Continuous Actions
![Page 20: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/20.jpg)
Double DQN
• Q value is usually over-estimated
![Page 21: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/21.jpg)
Double DQN
• Q value is usually over estimate
𝑄 𝑠𝑡 , 𝑎𝑡 𝑟𝑡 +max𝑎
𝑄 𝑠𝑡+1, 𝑎
Tend to select the action that is over-estimated
𝑄 𝑠𝑡+1, 𝑎
![Page 22: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/22.jpg)
Double DQN
• Q value is usually over estimate
• Double DQN: two functions Q and Q’
𝑄 𝑠𝑡 , 𝑎𝑡 𝑟𝑡 +max𝑎
𝑄 𝑠𝑡+1, 𝑎
Hado V. Hasselt, “Double Q-learning”, NIPS 2010Hado van Hasselt, Arthur Guez, David Silver, “Deep Reinforcement Learning with Double Q-learning”, AAAI 2016
𝑄 𝑠𝑡 , 𝑎𝑡 𝑟𝑡 + 𝑄′ 𝑠𝑡+1, 𝑎𝑟𝑔max𝑎
𝑄 𝑠𝑡+1, 𝑎
If Q over-estimate a, so it is selected. Q’ would give it proper value.
How about Q’ overestimate? The action will not be selected by Q.
Target Network
![Page 23: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/23.jpg)
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van
Hasselt, Marc Lanctot, Nando de Freitas, “Dueling Network Architectures for Deep Reinforcement Learning”, arXiv preprint, 2015
States
States
Q(s,a)
Q(s,a)= A(s,a)+V(s)
V(s)
A(s,a)Only change the network structure
Dueling DQN
![Page 24: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/24.jpg)
Dueling DQN
3 3 3 1
1 -1 6 1
2 -2 3 1
state
action
1 3 -1 0
-1 -1 2 0
0 -2 -1 0
Q(s,a)
A(s,a)
V(s) 2 0 4 1
0
1
4
-1=+
=+
Average of column
sum of column = 0
![Page 25: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/25.jpg)
1.0Dueling DQN
7
3
2
3
-1
-2
Normalize A(s,a) before adding with V(s)
![Page 26: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/26.jpg)
Dueling DQN - Visualization
(from the link of the original paper)
![Page 27: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/27.jpg)
Dueling DQN - Visualization
(from the link of the original paper)
![Page 28: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/28.jpg)
Prioritized Reply
𝑄𝑠𝑡
𝑄𝑠𝑡+1
𝑟𝑡 +
𝑎𝑡
𝑎𝑡+1
𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1
𝑄 𝑠𝑡 , 𝑎𝑡
𝑄 𝑠𝑡+1, 𝑎𝑡+1
ExperienceBuffer
https://arxiv.org/abs/1511.05952?context=cs
TD error
The data with larger TD error in previous training has higher probability to be sampled.
Parameter update procedure is also modified.
𝑎𝑡+1 = 𝑎𝑟𝑔max𝑎
𝑄 𝑠𝑡+1, 𝑎
![Page 29: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/29.jpg)
Multi-step
𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1Experience
Buffer
Balance between MC and TD
𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , ⋯ , 𝑠𝑡+𝑁 , 𝑎𝑡+𝑁 , 𝑟𝑡+𝑁 , 𝑠𝑡+𝑁+1
𝑄𝑠𝑡
𝑄𝑠𝑡+𝑁+1
𝑡′=𝑡
𝑡+𝑁
𝑟𝑡′ +
𝑎𝑡
𝑎𝑡+𝑁+1
𝑄 𝑠𝑡 , 𝑎𝑡
𝑄 𝑠𝑡+𝑁+1, 𝑎𝑡+𝑁+1
𝑎𝑡+𝑁+1 = 𝑎𝑟𝑔max𝑎
𝑄 𝑠𝑡+𝑁+1, 𝑎
![Page 30: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/30.jpg)
Noisy Net
• Noise on Action (Epsilon Greedy)
• Noise on Parameters
https://arxiv.org/abs/1706.01905
https://arxiv.org/abs/1706.10295
𝑎 = ൝𝑎𝑟𝑔max
𝑎𝑄 𝑠, 𝑎 ,
𝑟𝑎𝑛𝑑𝑜𝑚,
𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 1 − 𝜀
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑎 = 𝑎𝑟𝑔max𝑎
෨𝑄 𝑠, 𝑎
The noise would NOT change in an episode.
Inject noise into the parameters of Q-function at the beginning of each episode
𝑄 𝑠, 𝑎 ෨𝑄 𝑠, 𝑎Add noise
![Page 31: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/31.jpg)
Noisy Net
• Noise on Action
• Given the same state, the agent may takes different actions.
• No real policy works in this way
• Noise on Parameters
• Given the same (similar) state, the agent takes the same action.
• ⟶ State-dependent Exploration
• Explore in a consistent way
隨機亂試
有系統地試
![Page 32: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/32.jpg)
Demohttps://blog.openai.com/better-exploration-with-parameter-noise/
![Page 33: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/33.jpg)
Distributional Q-function
• State-action value function 𝑄𝜋 𝑠, 𝑎
• When using actor 𝜋, the cumulated reward expects to be obtained after seeing observation s and taking a
Different distributions can have the same values.
-10 10 -10 10
𝑄𝜋 𝑠, 𝑎
![Page 34: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/34.jpg)
Distributional Q-function
𝑄𝜋 𝑠, 𝑎1 𝑄𝜋 𝑠, 𝑎3
𝑄𝜋 𝑠, 𝑎2
𝑄𝜋
s
A network with 3 outputsA network with 15 outputs
(each action has 5 bins)
𝑄𝜋
s
![Page 35: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/35.jpg)
Demo
https://youtu.be/yFBwyPuO2Vg
![Page 36: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/36.jpg)
Rainbow
https://arxiv.org/abs/1710.02298
![Page 37: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/37.jpg)
Rainbow
https://arxiv.org/abs/1710.02298
![Page 38: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/38.jpg)
Outline
Introduction of Q-Learning
Tips of Q-Learning
Q-Learning for Continuous Actions
![Page 39: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/39.jpg)
Continuous Actions
• Action 𝑎 is a continuous vector
𝑎 = 𝑎𝑟𝑔max𝑎
𝑄 𝑠, 𝑎
Solution 1
Solution 2
Using gradient ascent to solve the optimization problem.
Sample a set of actions: 𝑎1, 𝑎2, ⋯ , 𝑎𝑁
See which action can obtain the largest Q value
![Page 40: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/40.jpg)
Continuous Actions
Solution 3 Design a network to make the optimization easy.
𝑄𝜋s
𝜇 𝑠
Σ 𝑠
𝑉 𝑠
𝑄 𝑠, 𝑎 = − 𝑎 − 𝜇 𝑠𝑇Σ 𝑠 𝑎 − 𝜇 𝑠 + 𝑉 𝑠
= 𝑎𝑟𝑔max𝑎
𝑄 𝑠, 𝑎𝜇 𝑠
vector
matrix
scalar
![Page 41: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/41.jpg)
https://www.youtube.com/watch?v=ZhsEKTo7V04
![Page 42: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/42.jpg)
Continuous Actions
Solution 4 Don’t use Q-learning
Policy-based Value-based
Learning an Actor Learning a CriticActor + Critic(Next Lecture)
![Page 43: PowerPoint 簡報speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/QLearning (v2).pdf · Title PowerPoint 簡報 Author Hung-yi Lee Created Date 6/13/2018 8:45:50 PM](https://reader036.vdocuments.us/reader036/viewer/2022071103/5fdce055b0887d526c2a6213/html5/thumbnails/43.jpg)
Acknowledgement
•感謝林雨新同學發現投影片上的錯字