concurrent hierarchical reinforcement learning
DESCRIPTION
Concurrent Hierarchical Reinforcement Learning. Example problem: Stratagus. Challenges Large state space Large action space Complex policies. Writing agents for Stratagus. Game company : N person-years to write “AI” for game. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/1.jpg)
1
Concurrent Hierarchical Reinforcement LearningBhaskara Marthi,Stuart Russell,David Latham
UC Berkeley
Carlos Guestrin CMU
![Page 2: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/2.jpg)
2
Example problem: Stratagus
Challenges Large state space Large action space Complex policies
![Page 3: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/3.jpg)
3
Writing agents for Stratagus
Game company : N person-years to write “AI” for game
Standard RL – learn from tabula rasa
Hierarchical RL : concurrent partial program + learning optimal completion
![Page 4: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/4.jpg)
4
Outline
Running example Background Our contributions
Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition
Results
![Page 5: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/5.jpg)
5
Running example
Peasants can move, pickup and dropoff
Penalty for collision Cost-of-living each step Reward for dropping off
resources Goal : gather 10 gold +
10 wood
![Page 6: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/6.jpg)
6
Outline
Running example Background Our contributions
Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition
Results
![Page 7: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/7.jpg)
7
Reinforcement Learning
Learning algorithm
Policy
a
s,r
![Page 8: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/8.jpg)
8
Q-functions
Represent policies implicitly using a Q-function Qπ(s,a) = “Expected total reward if I do action a in
environment state s and follow policy π thereafter”
Fragment of example Q-function
s a Q(s,a)... ... ...
Peas1Loc=(2,3),Peas2Loc=(4,2),Gold=8,Wood=3 (East,Pickup) 7.5
Peas1Loc=(4,4), Peas2Loc=(6,3), Gold=4, Wood=7 (West, North) -12
Peas1Loc=(5,1), Peas2Loc=(1,2), Gold=8, Wood=3 (South, West) 1.9
... ... ...
![Page 9: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/9.jpg)
9
Hierarchical RL and ALisp
Learning algorithm
Completion
a
s,r
Partial program
![Page 10: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/10.jpg)
10
(defun top () (loop do (choose ‘top-choice
(gather-wood)(gather-gold))))
(defun gather-wood () (with-choice ‘forest-choice
(dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))
(defun gather-gold () (with-choice ‘mine-choice
(dest *goldmine-list*) (nav dest)) (action ‘get-gold) (nav *base-loc*)) (action ‘dropoff)))
(defun nav (dest) (until (= (pos (get-state))
dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move))))
Single-threaded Alisp program
Program state includes Program counter Call stack Global variables
![Page 11: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/11.jpg)
11
Q-functions
Represent completions using Q-function Joint state ω = env state + program state MDP + partial program = SMDP over {ω} Qπ(ω,u) = “Expected total reward if I make choice
u in ω and follow completion π thereafter”
Example Q-function
ω u Q(ω,u)... ... ...
At nav-choice, Pos=(2,3), Dest=(6,5) North 15
At resource-choice, Gold=7, Wood=3 Gather-wood -42
... ... ...
![Page 12: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/12.jpg)
12
Top TopTop
GatherWood GatherGold
Nav(Forest1) Nav(Mine2)
Qr Qc Qe
Q
• Temporal decomposition Q = Qr+Qc+Qe where
• Qr(ω,u) = reward while doing u
• Qc(ω,u) = reward in current subroutine after doing u
• Qe(ω,u) = reward after current subroutine
Temporal Q-decomposition
![Page 13: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/13.jpg)
13
Q-learning for ALisp
Temporal decomposition => state abstraction E.g., while navigating, Qr doesn’t depend on gold reserves
State abstraction => fewer parameters, faster learning, better generalization
Algorithm for learning temporally decomposed Q-function (Andre+Russell 02)
Yields optimal completion of partial program
![Page 14: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/14.jpg)
14
Outline
Running example Background Our contributions
Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition
Results
![Page 15: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/15.jpg)
15
Multieffector domains
Domains where the action is a vector Each component of action vector
corresponds to an effector Stratagus: each unit/building is an effector #actions exponential in #effectors
![Page 16: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/16.jpg)
16
HRL in multieffector domains
Single Alisp program Partial program must essentially implement multiple
control stacks Independencies between tasks not used Temporal decomposition is lost
Separate Alisp program for each effector Hard to achieve coordination among effectors
Our approach - single multithreaded partial program to control all effectors
![Page 17: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/17.jpg)
17
Threads and effectors
Threads = tasks Each effector assigned to
a thread Threads can be
created/destroyed Effectors can be moved
between threads Set of effectors can
change over time
Get-gold Get-wood
Defend-Base
Get-wood
![Page 18: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/18.jpg)
18
Concurrent Alisp syntax
Extension of ALisp New/modified operations :
(action (e1 a1) ... (en an)) each effector ei does ai
(spawn thread fn effectors) start new thread that calls function fn, and reassign effectors to it
(reassign effectors dest-thread) reassign effectors to thread dest-thread
(my-effectors) return list of effectors assigned to calling thread
Program should also specify how to assign new effectors on each step to a thread
![Page 19: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/19.jpg)
19
Concurrent Alisp syntax
Extension of ALisp New/modified operations :
(action (e1 a1) ... (en an)) each effector ei does ai
(spawn thread fn effectors) start new thread that calls function fn, and reassign effectors to it
(reassign effectors dest-thread) reassign effectors to thread dest-thread
(my-effectors) return list of effectors assigned to calling thread
Program should also specify how to assign new effectors on each step to a thread
![Page 20: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/20.jpg)
20
Concurrent Alisp syntax
Extension of ALisp New/modified operations :
(action (e1 a1) ... (en an)) each effector ei does ai
(spawn thread fn effectors) start new thread that calls function fn, and reassign effectors to it
(reassign effectors dest-thread) reassign effectors to thread dest-thread
(my-effectors) return list of effectors assigned to calling thread
Program should also specify how to assign new effectors on each step to a thread
![Page 21: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/21.jpg)
21
Concurrent Alisp syntax
Extension of ALisp New/modified operations :
(action (e1 a1) ... (en an)) each effector ei does ai
(spawn thread fn effectors) start new thread that calls function fn, and reassign effectors to it
(reassign effectors dest-thread) reassign effectors to thread dest-thread
(my-effectors) return list of effectors assigned to calling thread
Program should also specify how to assign new effectors on each step to a thread
![Page 22: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/22.jpg)
22
Concurrent Alisp syntax
Extension of ALisp New/modified operations :
(action (e1 a1) ... (en an)) each effector ei does ai
(spawn thread fn effectors) start new thread that calls function fn, and reassign effectors to it
(reassign effectors dest-thread) reassign effectors to thread dest-thread
(my-effectors) return list of effectors assigned to calling thread
Program should also specify how to assign new effectors on each step to a thread
![Page 23: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/23.jpg)
23
(defun top () (loop do (until (my-effectors) (choose ‘dummy)) (setf peas
(first (my-effectors)) (choose ‘top-choice (spawn gather-wood peas) (spawn gather-gold peas))))
(defun gather-wood () (with-choice ‘forest-choice
(dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))
(defun top () (loop do (choose ‘top-choice (gather-gold) (gather-wood))))
(defun gather-wood () (with-choice ‘forest-choice
(dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))
(defun gather-gold () (with-choice ‘mine-choice
(dest *goldmine-list*) (nav dest) (action ‘get-gold) (nav *base-loc*) (action ‘dropoff)))
(defun nav (dest) (until (= (my-pos) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move))))
An example Concurrent ALisp programAn example single-threaded ALisp program
![Page 24: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/24.jpg)
24
defun gather-wood ()
(setf dest (choose *forests*))
(nav dest)
(action ‘get-wood)
(nav *home-base-loc*)
(action ‘put-wood))
(defun gather-gold ()
(setf dest (choose *goldmines*))
(nav dest)
(action ‘get-gold)
(nav *home-base-loc*)
(action ‘put-gold))
(defun nav (Wood1)
(loop
until (at-pos Wood1)
(setf d (choose ‘(N S E W R)))
do (action d)
))
(defun nav (Gold2)
(loop
until (at-pos Gold2)
(setf d (choose ‘(N S E W R)))
do (action d)
))
Thread 2
Thread 1
Environment timestep 27Environment timestep 26
Peasant 1
Peasant 2
PausedMaking joint choice
PausedMaking joint choice
Running
Running
Waiting for joint action
Waiting for joint action
Concurrent Alisp semantics
![Page 25: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/25.jpg)
25
Threads execute independently until they hit a choice or action
Wait until all threads are at a choice or action If all effectors have been assigned an action, do that
joint action in environment Otherwise, make joint choice
Can be converted to a semi-Markov decision process assuming partial program is deadlock-free and has no race conditions
Concurrent Alisp semantics
![Page 26: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/26.jpg)
26
Q-functions
To complete partial program, at each choice state ω, need to specify choices for all choosing threads
So Q(ω,u) as before, except u is a joint choice
Example Q-function
ω u Q(ω,u)... ... ...
Peas1 at NavChoice, Peas2 at DropoffGold, Peas3 at ForestChoice, Pos1=(2,3), Pos3=(7,4), Gold=12, Wood=14
(Peas1:East, Peas3:Forest2)
15.7
... ... ...
![Page 27: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/27.jpg)
27
Outline
Running example Background Our contributions
Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition
Results
![Page 28: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/28.jpg)
28
Making joint choices at runtime
Large state space Large choice space
At state ω, want to choose argmaxu Q(ω,u)
# joint choices exponential in # choosing threads
Use relational linear value function approximation with coordination graphs (Guestrin et al 2003) to represent, maximize Q efficiently
![Page 29: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/29.jpg)
29
Basic learning procedure
Apply Q-learning on equivalent SMDP Boltzmann exploration, maximization on each
step done efficiently using coordination graph Theorem: in tabular case, converges to the
optimal completion of the partial program
![Page 30: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/30.jpg)
30
Outline
Running example Background Our contributions
Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition
Results
![Page 31: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/31.jpg)
31
Problems with basic learning method
Temporal decomposition of Q-function lost No credit assignment among threads
Suppose peasant 1 drops off some gold at base, while peasant 2 wanders aimlessly
Peasant 2 thinks he’s done very well!! Significantly slows learning as number of
peasants increases
![Page 32: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/32.jpg)
32
Threadwise decomposition
Idea : decompose reward among threads (Russell+Zimdars, 2003)
E.g., rewards for thread j only when peasant j drops off resources or collides with other peasants
Qj(ω,u) = “Expected total reward received by thread j if we make joint choice u and make globally optimal choices thereafter”
Threadwise Q-decomposition Q = Q1+…Qn
![Page 33: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/33.jpg)
33
Peasant 3 thread
Learning threadwise decomposition
Peasant 2 thread
Peasant 1 thread
Top thread
a
r
r1
r2 r3
Action state
decomp
![Page 34: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/34.jpg)
34
Peasant 3 thread
Learning threadwise decomposition
Peasant 2 thread
Peasant 1 thread
Top thread
Choice state ω
Q1(ω,·) Q3(ω,·)
Q2(ω,·)
+
Q (ω,·)argmax = u
Q-update
Q-update
Q-update
![Page 35: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/35.jpg)
35
Outline
Running example Background Our contributions
Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition
Results
![Page 36: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/36.jpg)
36
Peasant 3 thread
Threadwise and temporal decomposition
Qj = Qj,r + Qj,c + Qj,e where Qj,r (ω,u) = Expected reward gained by thread j
while doing u Qj,c (ω,u) = Expected reward gained by thread j
after u but before leaving current subroutine Qj,e (ω,u) = Expected reward gained by thread j
after current subroutine
Qj,r Qj,c Qj,e
Peasant 2 thread
Qj,r Qj,c Qj,e
Peasant 1 thread
Qj,r Qj,c Qj,e
Top thread
![Page 37: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/37.jpg)
37
Outline
Running example Background Our contributions
Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition
Results
![Page 38: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/38.jpg)
38
Resource gathering with 15 peasants
Num steps learning (x 1000)
Reward of learnt policy
Flat
![Page 39: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/39.jpg)
39Num steps learning (x 1000)
Reward of learnt policy
Flat
Undecomposed
Resource gathering with 15 peasants
![Page 40: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/40.jpg)
40Num steps learning (x 1000)
Reward of learnt policy
Flat
Undecomposed
Threadwise
Resource gathering with 15 peasants
![Page 41: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/41.jpg)
41Num steps learning (x 1000)
Reward of learnt policy
Flat
Undecomposed
Threadwise
Threadwise + Temporal
Resource gathering with 15 peasants
![Page 42: Concurrent Hierarchical Reinforcement Learning](https://reader035.vdocuments.us/reader035/viewer/2022081504/568148ad550346895db5c18a/html5/thumbnails/42.jpg)
42
Conclusion
Contributions Concurrent ALisp language for partial programs Algorithms for choice selection, learning Threadwise and temporal decomposition of Q-
function
Encode prior knowledge and use structure in Q-function for fast learning in very complex multieffector domains
Questions?