reinforcement learning unsupervised learning 1. 2 so far …. supervised machine learning: given a...

Reinforcement LearningUnsupervised learning

1

2

So far ….

h Supervised machine learning: given a set of annotated istances and a set of categories, find a model to automatically categorize unseen istances

h Unsupervided learning: no examples are available

h Previous lesson: clustering5 Given a set of instances described by feature

vectors, group them in clusters such as intra-cluster similarity is maximized and inter-cluster dissimilarity is maximized

Reinforcement learning

h Reinforcement learning:

5Agent receives no examples and starts with no model of the environment.

5Agent gets feedback through rewards, or reinforcement.

3

Learning by reinforcementh Examples:

5 Learning to play Backgammon5 Robot learning to move in an environment

h Characteristics:5 No direct training examples – (delayed) rewards

instead5 Need for exploration of environment & exploitation5 The environment might be stochastic and/or

unknown5 The actions of the learner affect future rewards

Example: Robot moving in an environment

5

Example: playing a game (e.g. backgammon, chess)

6

Reinforcement Learning

Supervised Learning

Input is an istance, output is aclassification of the istance

Input is some “goal” outputIs a sequence of actions tomeet the goal

Elements of RL

h Transition model, how actions A influence states S

h Reward R, immediate value of state-action transition

h Policy , maps states to actions

Agent

Environment

State Reward Action

Policy

In RL, the “model” to be learned is a policy to meet “at best” some given goal (e.g. win a game)

Markov Decision Process (MDP)h MDP is a formal model of the RL problem

h At each discrete time point 5 Agent observes state st and chooses action at

(according to some probability distribution)5 Receives reward rt from the environment and the state

changes to st+1

h Markov assumption: rt=r(st,at) st+1=(st,at)

i.e. rt and st+1 depend only on the current state and action5 In general, the functions r and may not be

deterministic (i.e. they are stochastic) and are not necessarily known to the agent

s

Agent’s Learning Task

Execute actions in environment, observe results and

h Learn action policy that maximises expected cumulative rewardfrom any starting state in S.

Here is the discount factor for future rewards

h Note: • Target function is • There are no training examples of the form (s,a) but only

of the form ((s,a),r)

AS :

...][ 22

1 ttt rrrE 10

AS :

Example: TD-Gammon

h Immediate reward:+100 if win

-100 if lose

0 for all other states

h Trained by playing 1.5 million games against itself

h Now approximately equal to the best human player

Example: Mountain-Car

h States: position and velocity

h Actions: accelerate forward, accelerate backward, coast

h Rewards5 Reward=-1for every step, until the car reaches the top5 Reward=1 at the top, 0 otherwise, <1

h The eventual reward will be maximised by minimising the number of steps to the top of the hill

Value functionWe will consider deterministic worlds first

In a deterministic environment, rewards are known and the environment is known

h Given a policy π (adopted by the agent), define an evaluation function over states:

02

21 ...)(

iit

itttt rrrrsV

)(

...)()(

1

21

tt

tttt

sVr

rrrsV

Example: robot in a grid environment

r(state, action)immediate reward values

100

0

0

100

G

0

0

0

0

0

0

0

0

0

r( (2,3), UP)=100

Grid world environmentSix possible statesArrows represent possible actionsG: goal state

Example: robot in a grid environment

h Value function: maps states to state values, acording to some policy π

V*(state) valuesr(state, action)immediate reward values

100

0

0

100

G

0

0

0

0

0

0

0

0

0 G

90 100 0

81 90 100

G 90 100 0

81 90 100

G 90 100 0

81 90 100

The “value” of beingin (1,2) according to some known π is100

02

21 ...)(

iit

itttt rrrrsV

Computing V(s) (if policy is given)

Suppose the following is an

Optimal policy – denoted * :

* tells us what is the best thing to do when in each state.

According to * , we can compute the values of the states for this policy – denoted V*, or V*

r(s,a) (immediate reward) values: V*(s) values, with =0.9

one optimal policy: V*(s6) = 100 + 0.9*0 = 100

V*(s5) = 0 + 0.9*100 = 90

V*(s4) = 0 + 0.9*90 = 81

However, * is usually unknown and the task is to learn the optimal policy

)(),(maxarg* ssV

)( )( 1**

ttt sVrsV

Learning *

h Target function is : state action

h However…

5 We have no training examples of form

<state, action>

5 We only know:

<<state, action>, reward>

h Reward might not be known for all states!

Utility-based agents

h Try to learn V * (abbreviated V*)

h Perform look ahead search to choose best action from any state s

h Works well if agent knows

5 : state action state

5 r : state action R

h When agent doesn’t know and r, cannot choose actions

this way

a s,δ*Va s,rmaxargsπ*a

Q-learning: an utility-based learnerin deterministic environmentsh Q-values

5 Define new function very similar to V*

5 If agent learns Q, it can choose optimal action

even without knowing or R

h Using Q

a s,Q maxargsπ*a

a s,δ*Va s,ra s,Q

a s,δ*Va s,rmaxargsπ*a

Learning the Q-value

5 Note: Q and V* closely related

5 Allows us to write Q recursively as:

5 This policy is also called Temporal

Difference learning

a' s,Q maxargs*Va'

a' ,tsQmax γta ,tsr

ta ,tsδγVta ,tsr ta ,tsQ

a'1

V(s(t+1))

Learning the Q-value

h FOR each <s, a> DO

5 Initialize table entry:

h Observe current state s

h WHILE (true) DO

5 Select action a and execute it

5 Receive immediate reward r

5 Observe new state s’

5 Update table entry for as follows

5 Move: record transition from s to s’

0 a s,Q̂

a s,Q̂

a' ,s'Q max γa s,r a s,Q a'

ˆˆ

Q-Learning: Example (=0.9)

s1 s2 s3

s4 s5 s6

N S W E

s1 0 0 0 0

s2 0 0 0 0

s3 0 0 0 0

s4 0 0 0 0+0.9*0

=0

s5 0 0 0 0

s6 0 0 0 0

1

10

0

0 0

0

• Select a possible action in s and execute • Obtain the reward r• Observe new state s’ reached when

executing a from s• Update estimate as follows:

• s s’ )','(ˆmax),(ˆ

'asQrasQ

a

s=s4, s’=s5

actions

The “max Q” function search for the “most promising action” from s’But in s5: therefore:

Q-values table


s1 s2 s3

s4 s5 s6

N S O E

s1 0 0 0 0

s2 0 0 0 0

s3 0 0 0 0

s4 0 0 0 0

s5 0 0 0 0+0.9*0

=0

s6 0 0 0 0

1

10

0

0 0

0

)','(ˆmax),(ˆ'

asQrasQa

s=s5, s’=s6



• s s’


s1 s2 s3

s4 s5 s6

N S O E

s1 0 0 0 0

s2 0 0 0 0

s3 0 0 0 0

s4 0 0 0 0

s5 0 0 0 0

s61+0.9*0

=1 0 0 0

1

10

0

0 0

0

s=s6, s’=s3

)','(ˆmax),(ˆ'

asQrasQa



• s s’

• GOAL REACHED: END OF FIRST EPISODE

In s3 r=1 so Q is updated to 1


s1 s2 s3

s4 s5 s6

N S O E

s1 0 0 0 0

s2 0 0 0 0

s3 0 0 0 0

s40+0.9*0

=0 0 0 0

s5 0 0 0 0

s6 1 0 0 0

1

10

0

0 0

0

s=s4, s’=s1

)','(ˆmax),(ˆ'

asQrasQa



• s s’


s1 s2 s3

s4 s5 s6

N S O E

s1 0 0 0 0+0.9*0

=0

s2 0 0 0 0

s3 0 0 0 0

s4 0 0 0 0

s5 0 0 0 0

s6 1 0 0 0

1

10

0

0 0

0

)','(ˆmax),(ˆ'

asQrasQa

s=s1, s’=s2



• s s’


s1 s2 s3

s4 s5 s6

N S O E

s1 0 0 0 0

s2 0 0 0 1+0.9*0

=1

s3 0 0 0 0

s4 0 0 0 0

s5 0 0 0 0

s6 1 0 0 0

1

10

0

0 0

0

s=s2, s’=s3

)','(ˆmax),(ˆ'

asQrasQa



• s s’

• GOAL REACHED: END OF 2nd EPISODE


s1 s2 s3

s4 s5 s6

N S O E

s1 0 0 0 0

s2 0 0 0 1

s3 0 0 0 0

s4 0 0 0 0+0

s5 0 0 0 0

s6 1 0 0 0

1

10

0

0 0

0

s=s4, s’=s5


s1 s2 s3

s4 s5 s6

N S O E

s1 0 0 0 0

s2 0 0 0 1

s3 0 0 0 0

s4 0 0 0 0

s50+0.9*1

=0.9 0 0 0

s6 1 0 0 0

1

10

0

0 0

0

s=s5, s’=s2


s1 s2 s3

s4 s5 s6

N S O E

s1 0 0 0 0

s2 0 0 0 1+0

s3 0 0 0 0

s4 0 0 0 0

s5 0.9 0 0 0

s6 1 0 0 0

1

10

0

0 0

0

s=s2, s’=s3

• GOAL REACHED: END OF 3rd EPISODE


s1 s2 s3

s4 s5 s6

N S O E

s1 0 0 0 0

s2 0 0 0 1

s3 0 0 0 0

s4 0 0 0 0+0.9*0.9

=0.81

s5 0.9 0 0 0

s6 1 0 0 0

1

10

0

0 0

0

s=s4, s’=s5


s1 s2 s3

s4 s5 s6

N S O E

s1 0 0 0 0

s2 0 0 0 1

s3 0 0 0 0

s4 0 0 0 0.81

s50+0.9*1

=0.9 0 0 0

s6 1 0 0 0

1

10

0

0 0

0

s=s5, s’=s2

Q-Learning: Esempio (=0.9)

s1 s2 s3

s4 s5 s6

N S O E

s1 0 0 0 0

s2 0 0 0 1+0

s3 0 0 0 0

s4 0 0 0 0.81

s5 0.9 0 0 0

s6 1 0 0 0

1

10

0

0 0

0

s=s2, s’=s3

• GOAL REACHED: END OF 4th EPISODE


s1 s2 s3

s4 s5 s6

N S O E

s1 0 0 0 0

s2 0 0 0 1

s3 0 0 0 0

s4 0+0 0 0 0.81

s5 0.9 0 0 0

s6 1 0 0 0

1

10

0

0 0

0

s=s4, s’=s1


s1 s2 s3

s4 s5 s6

N S O E

s1 0 0 0 0+0.9*1=0.9

s2 0 0 0 1

s3 0 0 0 0

s4 0 0 0 0.81

s5 0.9 0 0 0

s6 1 0 0 0

1

10

0

0 0

0

s=s1, s’=s2

Q-Learning: Esempio (=0.9)

s1 s2 s3

s4 s5 s6

N S O E

s1 0 0 0 0.9

s2 0 0 0 1+0

s3 0 0 0 0

s4 0 0 0 0.81

s5 0.9 0 0 0

s6 1 0 0 0

1

10

0

0 0

0

s=s2, s’=s3

• GOAL REACHED: END OF 5th EPISODE

Q-Learning: Example(=0.9)

s1 s2 s3

s4 s5 s6

N S O E

s1 0 0.72 0 0.9

s2 0 0.81 0.81 1

s3 0 0 0 0

s4 0.81 0 0 0.81

s5 0.9 0 0.72 0.9

s6 1 0 0.81 0

1

0.90.81

0.72

After several iterations, the algorithm converges to the following table:

10.90.81

0.81

0.90.81

0.81

0.72

A crawling robot using Q-Learning

39

An example of Q-learning using 16 states (4 positions for each servo's) and 4 possible actions (moving 1 position up or down for on of the servos). The robot is build with lego components, two servo's, a wheel encoder and a dwengo-board with a microcontroller.The robot learns using the Q-learning algorithm and its reward is related with the moved distance.

The non-deterministic case

40

• What if the reward and the state transition are not deterministic? – e.g. in Backgammon learning and playing depends on rolls of dice!

• Then V and Q needs redefined by taking expected values

• Similar reasoning and convergent update iteration will apply

))],((),([),(

][...][)(

*

02

21

asVasrEasQ

rErrrEsVi

iti

ttt

41

Summary

h Reinforcement learning is suitable for learning in uncertain environments where rewards may be delayed and subject to chance

h The goal of a reinforcement learning program is to maximise the eventual reward

h Q-learning is a form of reinforcement learning that doesn’t require that the learner has prior knowledge of how its actions affect the environment

reinforcement learning unsupervised learning 1. 2 so far …. supervised machine learning: given a...

Documents

deterministic environment

starting state

state changes

current state

policy v

anoptimal policy

immediate value of state

supervised machine learning