machine learning techniques and applications reinforcement...

MACHINE LEARNING – 2012

1

MACHINE LEARNING TECHNIQUES

AND APPLICATIONS

Reinforcement Learning

for Continuous State and Action Spaces

Gradient Methods


2

Reinforcement Learning (RL)

Supervised Learning

Unsupervised Learning

Reinforcement learning is in-between


3

Drawbacks of classical RL

Curse of dimensionality:

Computational costs increase dramatically with number of states.

Markov World:

Cannot handle continuous action and state space

Model-based vs model free:

Need a model of the world (can be estimated through exploration)

Gradient methods to handle continuous state and action spaces


4

RL in continuous state and action spaces

1....States and actions , , are continuous:

,

One can no longer swipe through all states and actions

to determine the optimal policy.

Instead, one can either

t t

N P

t t

t Ts a

s a

:

1) use function approximation to estimate the value function V(s)

2) use function approximation to estimate the state-action

value function Q(s, a)

3) optimize a parameterized policy | (policy seara s ch)


5

Policy Gradients

Parametrize the value function:

;V s Open parameters

Assume an initial estimate ; .

Run an episode using a greedy policy | ;

Compute the error on the estimate: = ;

Find the optimal parameters through gradient descent

on the stat

k

t

t

V s

a s

e V s E r

e value function:

; ~

V se


6

TD learning for continuous state action space

Doya, NIPS 1996

1

Parametrize the value function such that:

;

is a set of basis function (e.g. RBF functions).

These are set by the user and also called the features.

are the weights associated

KT

j j

j

j

V s s s

s K

to each feature. These are

the unknown parameters.


7

TD learning for continuous state action space

Doya, NIPS 1996

1

Pick a set of parameters and compute:

;

Do some roll-outs of fixed episode length and gather , 1...

Estimate new ; through TD learning

Gradient descent on the squared TD err

KT

j j

j

t

V s s s

r s t T

V s

1 1...

or gives:

;ˆ; ; ,

Other technique for estimation of non-linear regression functions have

been proposed elsewhere to get a better estimate

t

j t t t t j t

j

t

j K

r s

V sr s V s V s r s s

of the parameters than simple

gradient descent.


8

Robotics Applications of continuous RL

Teaching a two joints, three link robot leg to stand up

Robot configuration. θ0: pitch angle, θ1: hip joint angle, θ2: knee joint angle, θm: the angle of the

line from the center of mass to the center of the foot.

Morimoto and Doya, Robotics and Autonomous Systems , 2001


9



• The final goal is the upright stand-up posture.

• Reaching the final goal is a necessary but not sufficient condition of successful

stand-up because the robot may fall down after passing through the final goal

need to define subgoals

• The stand-up task is accomplished when the robot stands up and stays upright

for more than 2(T + 1) seconds.


10



Hierarchical reinforcement learning:

Upper layer: discretize state-action space

and discover which subgoal should be

reached using Q-learning.

State: pitch and joint angles

Actions: joint displacement (not the torque).

Lower layer: continuous state-action space

The robot learns to apply appropriate torque

to achieve each subgoal.

State: pitch and joint angles, and their

velocities

Actions: torque at the two joints


11

Reward:

Decompose the task into

an upper-level and lower-level

sets of goals.

Y: height of the head

of the robot at a sub-goal posture

and L is total length of the robot.



When the robot achieves a sub-goal, the learner gets a reward < 0.5.

Full reward is obtained when all subgoals and main goal are both achieved.


12



State and action are continuous in time: ,

Model value function: ;

Model policy: | ;

,

: sigmoid function, : noise

Learn through gradient descent on

squared TD error.

Backp

T

T

s t u t

V s s

u s

u s f s

f

1...

ropagate TD error to estimate :

, j j j K

e t

t e t


13


• 750 trials in simulation + 170 on real robot

• Goal: to stand up



14

Separate the problem in two parts:

a) Learning to swing the pole up

b) Learning to balance the pole upright

Part a is learned through TD-learning

Part b is learned using optimal control

Additionally estimate a model of the

dynamics of the inverted pendulum

through locally weighted regression based

(estimate parameters of a known model)

Learning how to swing up a pendulum

Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997



15



* *

1Collect a sequence , from

a single human demonstration

State of the system = , , ,

(human hand trajectory + velocity

angular trajectory and velocity of

the pendulum)

Actions: = (c

T

t tt

t t t t t

t t

s u

s x x

u x

an be converted into torque

command to the robot)

Reward: minimizes angular and hand displacement

as well as velocities



16

1

Learn first a model of the task through RL using

continuous TD learning to estimate ; :

Take a model ; =

Parameters are estimated incrementally through non-linear

function approximation (

K

j

i

V s

V s s

locally weighted regression)

* *

Learn then the reward:

, ,

and are estimated so as to minimize

discrepancies between human and robot trajectories.

Use the reward to generate optimal trajectories from

optim

TT

t t t t t t t tr x u t x x Q x x u Ru

Q R

al control.




17





18



,




t t

N P

t t

t Ts a

s a

:






19

Least Square Policy Iteration


20


1

Approximate the state-value function:

, ; , ,

, is a set of basis function , : .

These are set by the user and also called the features.

are the

KT

j j

j

N P

j

j

Q s a s a s a

s a K s a

weights associated to each feature. These are

the unknown parameters.


21


1: 1 1: 1: 2: 1

Perform a roll-out for time steps using policy | ;

(greedy search on arg max , ; using current estimate of , ; ).

Collect samples , , .

Determine how good an initial choice

a

T T T T

T a s

Q s a Q s a

s a r r s

1

11,.. 1,..

of parameter is by comparing

the predicted reward and the actual reward .

' ,

, , ' , , , 0,1 discount factor

T

t t

T

t t t t tt T t T

R r

J R

s a s s a

Bellman Residual Error

See Lagoudakis & Parr, Journal of Machine Learning Research, 2003 for various numerical solutions to

determine the optimal alpha parameter.


22

Goal: learn to balance and ride a bicycle to a target

position located 1 km away from the starting location

Continuous state: six-dimensional real-valued vector

- angle of the handlebar,

- vertical angle of the bicycle

- angle of the bicycle to the goal

5 Discrete actions: torque applied to the handlebar (discretized to {−2, 0,+2}) and

displacement of the rider (discretized to {−0.02, 0,+0.02}).

Model of the world: uniform noise on each action. Model of the bike’s dynamics.

The state-action value function Q(s, a) for a fixed action a is approximated by a

linear combination of 20 basis functions (linear and quadratic combination of

state and its 1st and 2nd order derivatives)

LSPI: Example learning to ride a bike

Lagoudakis & Parr, Journal of Machine Learning Research, 2003

S=


23

Collect training samples using random policy. Each episode lasts 20 steps.

Learned policy evaluated 100 times to estimate the probability of success (reaching the

goal)

Shaping reward was 1% of the change (in meters) in the distance to the goal and change

in the square of the vertical angle.



The policy after the first iteration balances

the bicycle, but fails to ride to the goal.

The policy after the second iteration is

heading towards the goal, but fails to

balance.

All policies thereafter balance and ride the

bicycle to the goal.

Crash still happens because of noise in the

model.


24

Successful policies are found after a few thousand training episodes.

With 5000 training episodes (60000 samples) the probability of success is about

95% and the expected number of balancing steps is about 70000 steps.




25



,




t t

N P

t t

t Ts a

s a

:






26

Policy Gradients

An alternative is to parametrize the stochastic policy:

| is approximated by drawing from the distribution

| ;

a s

p a s

Parameters of the policy

Actor in state , choose an action following a stochastic

policy: | drawing from the distribution | ;

t t

t t t

s a

a s p a s


27

1

Starting from a deterministic estimate of the policy:

| ; = | |

where | are again a set of known basis functions

and the weights for each basis function.

Adding some noise and

KT

j j

j

j

j

a s a s a s

a s K

making the policy explicitly time-

dependent | , ; = | , , ~ 0,

leads to stochastic estimate:

| , ; ~ | , , | , | ,

T

t t t t t

T T

t t t t t

a s t a s t N

a s t N a s t a s t a s t

Kobler and Peters, Machine Learning, 2011

Policy learning by Weighted Exploration with the Returns (PoWER)

Structured state-

dependent exploration


28


Finite Horizon T; episodic restarts

1: 1: 1 1:

1

At each roll-out (episode), the agent gathers tupple of

rewards and associated states and actions: = , , .

Each roll-out is generated by using a policy with parameter :

~

with

T T T

t

a s r

p

p p s p

1

1

| , | , ;T

t t t t ts s a a s t

1

1

1

The cumulative reward for roll-out is:

, , ,T

t t t

t

R T r s s a t



31

At each new trial, parameters are computed as the weighted

average of all previous trials plus some Gaussian noise

1~ ,

1,...number if trials

i i

i i

i

N w

w

i




32


Teaching a robot to the ball-in-a-cup task

State: joint angle + velocities of robot + cartesian coordinates of the ball

Action: joint space accelerations

| , , ~ , , , with 31 basis functions per DOFTx x



33




34



35

Inverse Reinforcement Learning

A major difficulty in RL is to determine the reward.

The reward encapsulates key information about the task.

A poor choice may lead to suboptimal solutions.

IRL was proposed as framework to estimate what the reward could be.

The reward is such that it best explain the observed behavior of an

“expert agent”. The expert is an agent that knows how to perform

the task (in an optimal manner).

Estimation of the policy and the reward is then done simultaneously.


36


1

Consider first a discrete state - action space.

Assume a parametrization of the reward function:

where : 0,1 are bounded basis functions.

0,1 , so that the reward is also bound

KT

j j

j

j

j

r s s s

s S

ed 0,1r s

Abbel & Ng, Int. Conf. on Machine Learning, 2004


37


1

1

1

Given:

The value of a given policy over one time steps roll-out is given by:

;

T

Tt

t

t

Tt T

t

TT t

t

r s s

T

E V E r s

E s

E s

Expected features given policy

A particular policy is evaluated as a linear combination of features.


39


:1: 1 1

*

Step 1:

i) Record a set of demonstrated paths (a series of roll-outs) yielding a set of

1 state transitions .

ii) Build an estimate of the optimal policy , of the demonstrator by

comp

t T

Mi

i

M

T s

s a

* *

*

*

1 1

uting expected features for policy ,

1:

M Tt i

t

i t

s a

sM


40


0

Step 2:

Initialize the agent's policy , , e.g. picking random policy.

and then do for i=0...n iterations:

i) Approximate (e.g. via monte-carlo) expected feature for the policy .

ii) Find the opt

i i

s a

*

0,... 1: 1

1

imal weight solution of max min .

iii) Set the current estimate of the reward to be

and use RL to find the optimal associated policy .

Terminate if is inferio

j

i i T

j i

Ti i

i

i

c

r s s

c

r to a threshold.


41


*

The algorithm is trying to find a reward function

such that

i.e. a reward on which the expert does better, by a margin of c,

than any of the -1 policies found previously.

i

Ti ir s s V V c

i

,

*1... 1,

The optimization problem is equivalent to

max

with constraints ,

and 1.

Akin to maximum margin optimization in SVM.

j

c

T Tj i

c

c


42

Learning optimal policy to drive a car, or learn driving styles (Nice, Nasty, Right lane

nice, right lane nasty, middle lane) Abbel & Ng, Int. Conf. on Machine Learning, 2004

IRL Example: Driving a car

Agent car is faster than any

other car and needs to learn

when and how to pass other

cars.

5 actions (steer onto any of

the 3 lanes or outside the

road)

5 features indicate where the

car is

10 features for discretized

distances to closest car


43

30 iteration steps for learning each policy through IRL: (from top to bottom, left to

right: nice, nasty, right lane nice, right lane nasty, middle lane)

Learning based on 2 minutes expert demonstration of each driving style.



44

30 iteration steps for learning each policy through IRL: (from top to bottom, left to

right: nice, nasty, right lane nice, right lane nasty, middle lane)

Learning based on 2 minutes expert demonstration of each driving style.



Learning without reward: Donut

Flipping a block to stand on its end Launching a ball into a basket

The robot is provided solely with failed examples.

It has no information about the task – no reward, no indication of what

was incorrect.

Grollman and Billard, ICRA 2012, Best Paper Award Cognitive Robotics


47

Build a distribution (DONUT) that moves away from the bad demonstrations

Explore around the demonstrations and use the covariance to guide the

exploration.

Move away from things that have been visited a lot during unsuccessful

demonstrations but remain within the vicinity of the demonstrations.

2D DONUT distribution; high likelihood of

drawing datapoints in the vicinity but away from

the demonstrated datapoints.

Original distribution of points

from failed demonstrations



48

Exploration parameter that determines how far

away the peaks are from the base distribution.

Base distribution learned from training

example (failed attempts) using GMM

Donut distribution

Exploratory policy


Continuous state: joint angle ; Continuous action: joint velocity

, draw from DONUT distribution ~ | | ; , 1 | ; , /

s a

s a D a s N N f


High coherence across

trials high confidence

Little coherence across trials

low confidence

Angular Position (radian) of robot’s wrist

Vel

oci

ty

• Search around the demonstrations

• Reproduce only parts where all demonstrators agreed

• Avoid regions with high uncertainty



Find a solution in a few trials

Is comparable in efficiency to classical reinforcement learning approaches

But does not need a reward!




Multidimensional DONUT applied to learn how to play Golf

Region in parameter space that

lead to successful hit

State: position of ball with respect to the hole

Action: speed and orientation of end-effector

when hitting the ball



Multidimensional DONUT; Create holes in the distribution located

on each unsuccessful attempt.

Lightness corresponds to the likelihood of generating arbitrary

2D parameters. Red dots are human attempts, blue are system-

generated trials.



Multidimensional DONUT applied to learn how to play Golf

Grollman and Billard, IJSR, 2012


RL is a good alternative to supervised learning when you do

not know what the optimal solution would be.

Drawbacks of classical discrete state-action RL can be

overcome by considering continuous value functions.

Reward function can be estimated through IRL

Learning without rewards and from purely unsuccessful

examples is yet a little explored problem.

Summary

machine learning techniques and applications reinforcement...

Documents