effective reinforcement learning for mobile robots smart, d.l and kaelbing, l.p
TRANSCRIPT
Effective Reinforcement Learning for Mobile Robots
Smart, D.L and Kaelbing, L.P.
Content
Background Review Q-learning Reinforcement learning on mobile robots Learning framework Experimental results Conclusion Discussion
Background
Hard to code behaviour efficiently and correctly Reinforcement learning: tell the robot what to
do, not how to do it How well suited is reinforcement learning for
mobile robots?
Review Q-learning
Discrete states s and actions a Learn value function by observing rewards
– Actual function Q*(s,a) = E[R(s,a) + max Q*(s’,a’)]– Learn by
Q(st,at) = (1-) Q(st,at) + (rt+1 + max Q(st+1,a’))
Sample distribution has no effect on learned policy *(s) = argmax Q*(s,a)
Reinforcement learning on mobile robots
Sparse reward function– Almost always zero reward R(s,a)– Non-zero reward only when on success or failure
Continuous environment– HEDGER is used as a function approximator– Function approximation can be used when it never
extrapolates from the data
Reinforcement learning on mobile robots
Q-learning can only be successful when a state with positive reward can be found
Sparse reward function and continuous environment cause reward states to be hard to find by trial and error
Solution: show robot how to find the reward states
Learning framework
Split learning into two phases:– Phase one: actions are controlled by exterior force,
learning algorithm only passively observes– Phase two: learning algorithm learns optimal policy
By ‘showing’ the robot where the interesting states are, learning should be quicker
Experimental setup
Two experiments on B21r mobile robot– Movement speed is fixed by outside force– Rotation speed has to be learned – Settings = 0.2, = 0.99 or 0.90
Performance is measured after every 5 runs– Robot does not learn from these test– Starting position and orientation similar, not identical
Experimental Results:Corridor Following Task
State space: – distance to end of corridor– distance to left wall as fraction of corridor width– angle to target point
Experimental Results:Corridor Following Task
Computer controlled teacher– Rotation speed is a fraction of the angle
Experimental Results:Corridor Following Task
Human controlled teacher– Different corridor than computer controlled teacher
Experimental Results:Corridor Following Task Results
Decrease in performance after training– Phase 2 supplies more novel experiences
Sloppy human controller causes faster convergence than rigid computer controller– Fewer phase 1 and phase 2 runs– Human controller supplies more varied data
Experimental Results:Corridor Following Task Results
Simulated performance without advantage of teacher examples
Experimental Results:Obstacle Avoidance Task
State space: – direction and distance to obstacles– direction and distance to target
Experimental Results:Obstacle Avoidance Task Results
Human controlled teacher– Robot starts 3m from target, random orientation
Experimental Results:Obstacle Avoidance Task Results
Simulation without teacher examples– No obstacles present; robot only must reach goal– Simulated robot starts in the right orientation– 3 meters from target: 18.7% reached target in one
week of simulated time, taking 6.54 hours on average
Conclusion
Passive observation of appropriate state-action behaviour can speed up Q-learning
Knowledge about the robot or the learning algorithm is not necessary
Any solution will work, providing a good solution is not necessary
Discussion