effective reinforcement learning for mobile robots smart, d.l and kaelbing, l.p

Effective Reinforcement Learning for Mobile Robots

Smart, D.L and Kaelbing, L.P.

Content

Background Review Q-learning Reinforcement learning on mobile robots Learning framework Experimental results Conclusion Discussion

Background

Hard to code behaviour efficiently and correctly Reinforcement learning: tell the robot what to

do, not how to do it How well suited is reinforcement learning for

mobile robots?

Review Q-learning

Discrete states s and actions a Learn value function by observing rewards

– Actual function Q*(s,a) = E[R(s,a) + max Q*(s’,a’)]– Learn by

Q(st,at) = (1-) Q(st,at) + (rt+1 + max Q(st+1,a’))

Sample distribution has no effect on learned policy *(s) = argmax Q*(s,a)

Reinforcement learning on mobile robots

Sparse reward function– Almost always zero reward R(s,a)– Non-zero reward only when on success or failure

Continuous environment– HEDGER is used as a function approximator– Function approximation can be used when it never

extrapolates from the data

Reinforcement learning on mobile robots

Q-learning can only be successful when a state with positive reward can be found

Sparse reward function and continuous environment cause reward states to be hard to find by trial and error

Solution: show robot how to find the reward states

Learning framework

Split learning into two phases:– Phase one: actions are controlled by exterior force,

learning algorithm only passively observes– Phase two: learning algorithm learns optimal policy

By ‘showing’ the robot where the interesting states are, learning should be quicker

Experimental setup

Two experiments on B21r mobile robot– Movement speed is fixed by outside force– Rotation speed has to be learned – Settings = 0.2, = 0.99 or 0.90

Performance is measured after every 5 runs– Robot does not learn from these test– Starting position and orientation similar, not identical

Experimental Results:Corridor Following Task

State space: – distance to end of corridor– distance to left wall as fraction of corridor width– angle to target point


Computer controlled teacher– Rotation speed is a fraction of the angle


Human controlled teacher– Different corridor than computer controlled teacher

Experimental Results:Corridor Following Task Results

Decrease in performance after training– Phase 2 supplies more novel experiences

Sloppy human controller causes faster convergence than rigid computer controller– Fewer phase 1 and phase 2 runs– Human controller supplies more varied data

Experimental Results:Corridor Following Task Results

Simulated performance without advantage of teacher examples

Experimental Results:Obstacle Avoidance Task

State space: – direction and distance to obstacles– direction and distance to target

Experimental Results:Obstacle Avoidance Task Results

Human controlled teacher– Robot starts 3m from target, random orientation

Experimental Results:Obstacle Avoidance Task Results

Simulation without teacher examples– No obstacles present; robot only must reach goal– Simulated robot starts in the right orientation– 3 meters from target: 18.7% reached target in one

week of simulated time, taking 6.54 hours on average

Conclusion

Passive observation of appropriate state-action behaviour can speed up Q-learning

Knowledge about the robot or the learning algorithm is not necessary

Any solution will work, providing a good solution is not necessary

Discussion

effective reinforcement learning for mobile robots smart, d.l and kaelbing, l.p

Documents