machine learning techniques and applications reinforcement...
TRANSCRIPT
MACHINE LEARNING – 2012
1
MACHINE LEARNING TECHNIQUES
AND APPLICATIONS
Reinforcement Learning
for Continuous State and Action Spaces
Gradient Methods
MACHINE LEARNING – 2012
2
Reinforcement Learning (RL)
Supervised Learning
Unsupervised Learning
Reinforcement learning is in-between
MACHINE LEARNING – 2012
3
Drawbacks of classical RL
Curse of dimensionality:
Computational costs increase dramatically with number of states.
Markov World:
Cannot handle continuous action and state space
Model-based vs model free:
Need a model of the world (can be estimated through exploration)
Gradient methods to handle continuous state and action spaces
MACHINE LEARNING – 2012
4
RL in continuous state and action spaces
1....States and actions , , are continuous:
,
One can no longer swipe through all states and actions
to determine the optimal policy.
Instead, one can either
t t
N P
t t
t Ts a
s a
:
1) use function approximation to estimate the value function V(s)
2) use function approximation to estimate the state-action
value function Q(s, a)
3) optimize a parameterized policy | (policy seara s ch)
MACHINE LEARNING – 2012
5
Policy Gradients
Parametrize the value function:
;V s Open parameters
Assume an initial estimate ; .
Run an episode using a greedy policy | ;
Compute the error on the estimate: = ;
Find the optimal parameters through gradient descent
on the stat
k
t
t
V s
a s
e V s E r
e value function:
; ~
V se
MACHINE LEARNING – 2012
6
TD learning for continuous state action space
Doya, NIPS 1996
1
Parametrize the value function such that:
;
is a set of basis function (e.g. RBF functions).
These are set by the user and also called the features.
are the weights associated
KT
j j
j
j
V s s s
s K
to each feature. These are
the unknown parameters.
MACHINE LEARNING – 2012
7
TD learning for continuous state action space
Doya, NIPS 1996
1
Pick a set of parameters and compute:
;
Do some roll-outs of fixed episode length and gather , 1...
Estimate new ; through TD learning
Gradient descent on the squared TD err
KT
j j
j
t
V s s s
r s t T
V s
1 1...
or gives:
;ˆ; ; ,
Other technique for estimation of non-linear regression functions have
been proposed elsewhere to get a better estimate
t
j t t t t j t
j
t
j K
r s
V sr s V s V s r s s
of the parameters than simple
gradient descent.
MACHINE LEARNING – 2012
8
Robotics Applications of continuous RL
Teaching a two joints, three link robot leg to stand up
Robot configuration. θ0: pitch angle, θ1: hip joint angle, θ2: knee joint angle, θm: the angle of the
line from the center of mass to the center of the foot.
Morimoto and Doya, Robotics and Autonomous Systems , 2001
MACHINE LEARNING – 2012
9
Robotics Applications of continuous RL
Morimoto and Doya, Robotics and Autonomous Systems , 2001
• The final goal is the upright stand-up posture.
• Reaching the final goal is a necessary but not sufficient condition of successful
stand-up because the robot may fall down after passing through the final goal
need to define subgoals
• The stand-up task is accomplished when the robot stands up and stays upright
for more than 2(T + 1) seconds.
MACHINE LEARNING – 2012
10
Robotics Applications of continuous RL
Morimoto and Doya, Robotics and Autonomous Systems , 2001
Hierarchical reinforcement learning:
Upper layer: discretize state-action space
and discover which subgoal should be
reached using Q-learning.
State: pitch and joint angles
Actions: joint displacement (not the torque).
Lower layer: continuous state-action space
The robot learns to apply appropriate torque
to achieve each subgoal.
State: pitch and joint angles, and their
velocities
Actions: torque at the two joints
MACHINE LEARNING – 2012
11
Reward:
Decompose the task into
an upper-level and lower-level
sets of goals.
Y: height of the head
of the robot at a sub-goal posture
and L is total length of the robot.
Robotics Applications of continuous RL
Morimoto and Doya, Robotics and Autonomous Systems , 2001
When the robot achieves a sub-goal, the learner gets a reward < 0.5.
Full reward is obtained when all subgoals and main goal are both achieved.
MACHINE LEARNING – 2012
12
Robotics Applications of continuous RL
Morimoto and Doya, Robotics and Autonomous Systems , 2001
State and action are continuous in time: ,
Model value function: ;
Model policy: | ;
,
: sigmoid function, : noise
Learn through gradient descent on
squared TD error.
Backp
T
T
s t u t
V s s
u s
u s f s
f
1...
ropagate TD error to estimate :
, j j j K
e t
t e t
MACHINE LEARNING – 2012
13
Robotics Applications of continuous RL
• 750 trials in simulation + 170 on real robot
• Goal: to stand up
Morimoto and Doya, Robotics and Autonomous Systems , 2001
MACHINE LEARNING – 2012
14
Separate the problem in two parts:
a) Learning to swing the pole up
b) Learning to balance the pole upright
Part a is learned through TD-learning
Part b is learned using optimal control
Additionally estimate a model of the
dynamics of the inverted pendulum
through locally weighted regression based
(estimate parameters of a known model)
Learning how to swing up a pendulum
Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997
Robotics Applications of continuous RL
MACHINE LEARNING – 2012
15
Learning how to swing up a pendulum
Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997
* *
1Collect a sequence , from
a single human demonstration
State of the system = , , ,
(human hand trajectory + velocity
angular trajectory and velocity of
the pendulum)
Actions: = (c
T
t tt
t t t t t
t t
s u
s x x
u x
an be converted into torque
command to the robot)
Reward: minimizes angular and hand displacement
as well as velocities
Robotics Applications of continuous RL
MACHINE LEARNING – 2012
16
1
Learn first a model of the task through RL using
continuous TD learning to estimate ; :
Take a model ; =
Parameters are estimated incrementally through non-linear
function approximation (
K
j
i
V s
V s s
locally weighted regression)
* *
Learn then the reward:
, ,
and are estimated so as to minimize
discrepancies between human and robot trajectories.
Use the reward to generate optimal trajectories from
optim
TT
t t t t t t t tr x u t x x Q x x u Ru
Q R
al control.
Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997
Robotics Applications of continuous RL
MACHINE LEARNING – 2012
17
Learning how to swing up a pendulum
Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997
Robotics Applications of continuous RL
MACHINE LEARNING – 2012
18
RL in continuous state and action spaces
1....States and actions , , are continuous:
,
One can no longer swipe through all states and actions
to determine the optimal policy.
Instead, one can either
t t
N P
t t
t Ts a
s a
:
1) use function approximation to estimate the value function V(s)
2) use function approximation to estimate the state-action
value function Q(s, a)
3) optimize a parameterized policy | (policy seara s ch)
MACHINE LEARNING – 2012
19
Least Square Policy Iteration
MACHINE LEARNING – 2012
20
Least Square Policy Iteration
1
Approximate the state-value function:
, ; , ,
, is a set of basis function , : .
These are set by the user and also called the features.
are the
KT
j j
j
N P
j
j
Q s a s a s a
s a K s a
weights associated to each feature. These are
the unknown parameters.
MACHINE LEARNING – 2012
21
Least Square Policy Iteration
1: 1 1: 1: 2: 1
Perform a roll-out for time steps using policy | ;
(greedy search on arg max , ; using current estimate of , ; ).
Collect samples , , .
Determine how good an initial choice
a
T T T T
T a s
Q s a Q s a
s a r r s
1
11,.. 1,..
of parameter is by comparing
the predicted reward and the actual reward .
' ,
, , ' , , , 0,1 discount factor
T
t t
T
t t t t tt T t T
R r
J R
s a s s a
Bellman Residual Error
See Lagoudakis & Parr, Journal of Machine Learning Research, 2003 for various numerical solutions to
determine the optimal alpha parameter.
MACHINE LEARNING – 2012
22
Goal: learn to balance and ride a bicycle to a target
position located 1 km away from the starting location
Continuous state: six-dimensional real-valued vector
- angle of the handlebar,
- vertical angle of the bicycle
- angle of the bicycle to the goal
5 Discrete actions: torque applied to the handlebar (discretized to {−2, 0,+2}) and
displacement of the rider (discretized to {−0.02, 0,+0.02}).
Model of the world: uniform noise on each action. Model of the bike’s dynamics.
The state-action value function Q(s, a) for a fixed action a is approximated by a
linear combination of 20 basis functions (linear and quadratic combination of
state and its 1st and 2nd order derivatives)
LSPI: Example learning to ride a bike
Lagoudakis & Parr, Journal of Machine Learning Research, 2003
S=
MACHINE LEARNING – 2012
23
Collect training samples using random policy. Each episode lasts 20 steps.
Learned policy evaluated 100 times to estimate the probability of success (reaching the
goal)
Shaping reward was 1% of the change (in meters) in the distance to the goal and change
in the square of the vertical angle.
Lagoudakis & Parr, Journal of Machine Learning Research, 2003
LSPI: Example learning to ride a bike
The policy after the first iteration balances
the bicycle, but fails to ride to the goal.
The policy after the second iteration is
heading towards the goal, but fails to
balance.
All policies thereafter balance and ride the
bicycle to the goal.
Crash still happens because of noise in the
model.
MACHINE LEARNING – 2012
24
Successful policies are found after a few thousand training episodes.
With 5000 training episodes (60000 samples) the probability of success is about
95% and the expected number of balancing steps is about 70000 steps.
Lagoudakis & Parr, Journal of Machine Learning Research, 2003
LSPI: Example learning to ride a bike
MACHINE LEARNING – 2012
25
RL in continuous state and action spaces
1....States and actions , , are continuous:
,
One can no longer swipe through all states and actions
to determine the optimal policy.
Instead, one can either
t t
N P
t t
t Ts a
s a
:
1) use function approximation to estimate the value function V(s)
2) use function approximation to estimate the state-action
value function Q(s, a)
3) optimize a parameterized policy | (policy seara s ch)
MACHINE LEARNING – 2012
26
Policy Gradients
An alternative is to parametrize the stochastic policy:
| is approximated by drawing from the distribution
| ;
a s
p a s
Parameters of the policy
Actor in state , choose an action following a stochastic
policy: | drawing from the distribution | ;
t t
t t t
s a
a s p a s
MACHINE LEARNING – 2012
27
1
Starting from a deterministic estimate of the policy:
| ; = | |
where | are again a set of known basis functions
and the weights for each basis function.
Adding some noise and
KT
j j
j
j
j
a s a s a s
a s K
making the policy explicitly time-
dependent | , ; = | , , ~ 0,
leads to stochastic estimate:
| , ; ~ | , , | , | ,
T
t t t t t
T T
t t t t t
a s t a s t N
a s t N a s t a s t a s t
Kobler and Peters, Machine Learning, 2011
Policy learning by Weighted Exploration with the Returns (PoWER)
Structured state-
dependent exploration
MACHINE LEARNING – 2012
28
Policy learning by Weighted Exploration with the Returns (PoWER)
Finite Horizon T; episodic restarts
1: 1: 1 1:
1
At each roll-out (episode), the agent gathers tupple of
rewards and associated states and actions: = , , .
Each roll-out is generated by using a policy with parameter :
~
with
T T T
t
a s r
p
p p s p
1
1
| , | , ;T
t t t t ts s a a s t
1
1
1
The cumulative reward for roll-out is:
, , ,T
t t t
t
R T r s s a t
Kobler and Peters, Machine Learning, 2011
MACHINE LEARNING – 2012
31
At each new trial, parameters are computed as the weighted
average of all previous trials plus some Gaussian noise
1~ ,
1,...number if trials
i i
i i
i
N w
w
i
Policy learning by Weighted Exploration with the Returns (PoWER)
Kobler and Peters, Machine Learning, 2011
MACHINE LEARNING – 2012
32
Policy learning by Weighted Exploration with the Returns (PoWER)
Teaching a robot to the ball-in-a-cup task
State: joint angle + velocities of robot + cartesian coordinates of the ball
Action: joint space accelerations
| , , ~ , , , with 31 basis functions per DOFTx x
Kobler and Peters, Machine Learning, 2011
MACHINE LEARNING – 2012
33
Policy learning by Weighted Exploration with the Returns (PoWER)
Kobler and Peters, Machine Learning, 2011
MACHINE LEARNING – 2012
34
Policy learning by Weighted Exploration with the Returns (PoWER)
MACHINE LEARNING – 2012
35
Inverse Reinforcement Learning
A major difficulty in RL is to determine the reward.
The reward encapsulates key information about the task.
A poor choice may lead to suboptimal solutions.
IRL was proposed as framework to estimate what the reward could be.
The reward is such that it best explain the observed behavior of an
“expert agent”. The expert is an agent that knows how to perform
the task (in an optimal manner).
Estimation of the policy and the reward is then done simultaneously.
MACHINE LEARNING – 2012
36
Inverse Reinforcement Learning
1
Consider first a discrete state - action space.
Assume a parametrization of the reward function:
where : 0,1 are bounded basis functions.
0,1 , so that the reward is also bound
KT
j j
j
j
j
r s s s
s S
ed 0,1r s
Abbel & Ng, Int. Conf. on Machine Learning, 2004
MACHINE LEARNING – 2012
37
Inverse Reinforcement Learning
1
1
1
Given:
The value of a given policy over one time steps roll-out is given by:
;
T
Tt
t
t
Tt T
t
TT t
t
r s s
T
E V E r s
E s
E s
Expected features given policy
A particular policy is evaluated as a linear combination of features.
MACHINE LEARNING – 2012
39
Inverse Reinforcement Learning
:1: 1 1
*
Step 1:
i) Record a set of demonstrated paths (a series of roll-outs) yielding a set of
1 state transitions .
ii) Build an estimate of the optimal policy , of the demonstrator by
comp
t T
Mi
i
M
T s
s a
* *
*
*
1 1
uting expected features for policy ,
1:
M Tt i
t
i t
s a
sM
MACHINE LEARNING – 2012
40
Inverse Reinforcement Learning
0
Step 2:
Initialize the agent's policy , , e.g. picking random policy.
and then do for i=0...n iterations:
i) Approximate (e.g. via monte-carlo) expected feature for the policy .
ii) Find the opt
i i
s a
*
0,... 1: 1
1
imal weight solution of max min .
iii) Set the current estimate of the reward to be
and use RL to find the optimal associated policy .
Terminate if is inferio
j
i i T
j i
Ti i
i
i
c
r s s
c
r to a threshold.
MACHINE LEARNING – 2012
41
Inverse Reinforcement Learning
*
The algorithm is trying to find a reward function
such that
i.e. a reward on which the expert does better, by a margin of c,
than any of the -1 policies found previously.
i
Ti ir s s V V c
i
,
*1... 1,
The optimization problem is equivalent to
max
with constraints ,
and 1.
Akin to maximum margin optimization in SVM.
j
c
T Tj i
c
c
MACHINE LEARNING – 2012
42
Learning optimal policy to drive a car, or learn driving styles (Nice, Nasty, Right lane
nice, right lane nasty, middle lane) Abbel & Ng, Int. Conf. on Machine Learning, 2004
IRL Example: Driving a car
Agent car is faster than any
other car and needs to learn
when and how to pass other
cars.
5 actions (steer onto any of
the 3 lanes or outside the
road)
5 features indicate where the
car is
10 features for discretized
distances to closest car
MACHINE LEARNING – 2012
43
30 iteration steps for learning each policy through IRL: (from top to bottom, left to
right: nice, nasty, right lane nice, right lane nasty, middle lane)
Learning based on 2 minutes expert demonstration of each driving style.
IRL Example: Driving a car
MACHINE LEARNING – 2012
44
30 iteration steps for learning each policy through IRL: (from top to bottom, left to
right: nice, nasty, right lane nice, right lane nasty, middle lane)
Learning based on 2 minutes expert demonstration of each driving style.
IRL Example: Driving a car
MACHINE LEARNING – 2012
Learning without reward: Donut
Flipping a block to stand on its end Launching a ball into a basket
The robot is provided solely with failed examples.
It has no information about the task – no reward, no indication of what
was incorrect.
Grollman and Billard, ICRA 2012, Best Paper Award Cognitive Robotics
MACHINE LEARNING – 2012
47
Build a distribution (DONUT) that moves away from the bad demonstrations
Explore around the demonstrations and use the covariance to guide the
exploration.
Move away from things that have been visited a lot during unsuccessful
demonstrations but remain within the vicinity of the demonstrations.
2D DONUT distribution; high likelihood of
drawing datapoints in the vicinity but away from
the demonstrated datapoints.
Original distribution of points
from failed demonstrations
Learning without reward: Donut
MACHINE LEARNING – 2012
48
Exploration parameter that determines how far
away the peaks are from the base distribution.
Base distribution learned from training
example (failed attempts) using GMM
Donut distribution
Exploratory policy
Learning without reward: Donut
Continuous state: joint angle ; Continuous action: joint velocity
, draw from DONUT distribution ~ | | ; , 1 | ; , /
s a
s a D a s N N f
MACHINE LEARNING – 2012
High coherence across
trials high confidence
Little coherence across trials
low confidence
Angular Position (radian) of robot’s wrist
Vel
oci
ty
• Search around the demonstrations
• Reproduce only parts where all demonstrators agreed
• Avoid regions with high uncertainty
Learning without reward: Donut
MACHINE LEARNING – 2012
High coherence across
trials high confidence
Little coherence across trials
low confidence
Angular Position (radian) of robot’s wrist
Vel
oci
ty
• Search around the demonstrations
• Reproduce only parts where all demonstrators agreed
• Avoid regions with high uncertainty
Learning without reward: Donut
MACHINE LEARNING – 2012
Find a solution in a few trials
Is comparable in efficiency to classical reinforcement learning approaches
But does not need a reward!
Learning without reward: Donut
MACHINE LEARNING – 2012
Learning without reward: Donut
Multidimensional DONUT applied to learn how to play Golf
Region in parameter space that
lead to successful hit
State: position of ball with respect to the hole
Action: speed and orientation of end-effector
when hitting the ball
MACHINE LEARNING – 2012
Learning without reward: Donut
Multidimensional DONUT; Create holes in the distribution located
on each unsuccessful attempt.
Lightness corresponds to the likelihood of generating arbitrary
2D parameters. Red dots are human attempts, blue are system-
generated trials.
MACHINE LEARNING – 2012
Learning without reward: Donut
Multidimensional DONUT applied to learn how to play Golf
Grollman and Billard, IJSR, 2012
MACHINE LEARNING – 2012
RL is a good alternative to supervised learning when you do
not know what the optimal solution would be.
Drawbacks of classical discrete state-action RL can be
overcome by considering continuous value functions.
Reward function can be estimated through IRL
Learning without rewards and from purely unsuccessful
examples is yet a little explored problem.
Summary