online q-learner using moving prototypes

53
ONLINE Q-LEARNER USING MOVING PROTOTYPES ONLINE Q-LEARNER USING MOVING PROTOTYPES by by Miguel Ángel Soto Santibáñez Miguel Ángel Soto Santibáñez

Upload: garron

Post on 10-Feb-2016

24 views

Category:

Documents


0 download

DESCRIPTION

ONLINE Q-LEARNER USING MOVING PROTOTYPES. by Miguel Ángel Soto Santibáñez. Reinforcement Learning. What does it do? Tackles the problem of learning control strategies for autonomous agents. What is the goal? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ONLINE Q-LEARNER USING MOVING PROTOTYPES

ONLINE Q-LEARNER USING MOVING PROTOTYPESONLINE Q-LEARNER USING MOVING PROTOTYPES

byby

Miguel Ángel Soto SantibáñezMiguel Ángel Soto Santibáñez

Page 2: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Reinforcement LearningReinforcement Learning

What does it do?What does it do?

Tackles the problem of learning control strategies for Tackles the problem of learning control strategies for autonomous agents. autonomous agents.

What is the goal?What is the goal?

The goal of the agent is to The goal of the agent is to learn learn anan action policy action policy that that maximizes the total reward it will receive from any starting maximizes the total reward it will receive from any starting state.state.

Page 3: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Reinforcement LearningReinforcement Learning

What does it need?What does it need?

This method assumes that training information is available in the This method assumes that training information is available in the form of a real-valued reward signal given for each state-action form of a real-valued reward signal given for each state-action transition. transition.

i.e. (s, a, r)i.e. (s, a, r)

What problems?What problems?

Very often, reinforcement learning fits a problem setting known Very often, reinforcement learning fits a problem setting known as a as a Markov decision processMarkov decision process (MDP). (MDP).

Page 4: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Reinforcement Learning vs. Dynamic programmingReinforcement Learning vs. Dynamic programming

reward functionreward function

r(s, a) r(s, a) r r

state transition function state transition function

δδ(s, a) (s, a) s’ s’

Page 5: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Q-learningQ-learning

An off-policy control algorithmAn off-policy control algorithm..

Advantage:Advantage:

Converges to an optimal policy in both deterministic and Converges to an optimal policy in both deterministic and nondeterministic MDPs.nondeterministic MDPs.

Disadvantage:Disadvantage:

Only practical on a small number of problems.Only practical on a small number of problems.

Page 6: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Q-learning AlgorithmQ-learning Algorithm

Initialize Initialize Q(s, a)Q(s, a) arbitrarily arbitrarily Repeat (for each episode)Repeat (for each episode) Initialize Initialize ss Repeat (for each step of the episode)Repeat (for each step of the episode)

Choose Choose aa from from ss using an exploratory policy using an exploratory policy Take action Take action aa, observe , observe rr, , s’s’

Q(s, a)Q(s, a) Q(s, a)Q(s, a) + α[ + α[rr + γ max + γ max Q(s’, a’)Q(s’, a’) – – Q(s, a)Q(s, a)] ]

a’a’ ss s’s’

Page 7: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Introduction to Q-learning AlgorithmIntroduction to Q-learning Algorithm

• An episode: { (sAn episode: { (s11, a, a11, r, r11), (s), (s22, a, a22, r, r22), … (s), … (snn, a, ann, r, rnn), }), }

• s’: s’: δδ(s, a) (s, a) s’ s’

• Q(s, a):Q(s, a):

• γ, α :γ, α :

Page 8: ONLINE Q-LEARNER USING MOVING PROTOTYPES

A Sample ProblemA Sample Problem

A

B

r = 8r = 0

r = - 8

Page 9: ONLINE Q-LEARNER USING MOVING PROTOTYPES

StatesStates and and actionsactions

11 22 33 44 55

66 77 88 99 1010

1111 1212 1313 1414 1515

1616 1717 1818 1919 2020

states: actions:

N

S

E

W

Page 10: ONLINE Q-LEARNER USING MOVING PROTOTYPES

The The Q(s, a)Q(s, a) function function

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

N

S

W

E

states

actions

Page 11: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Q-learning AlgorithmQ-learning Algorithm

Initialize Initialize Q(s, a)Q(s, a) arbitrarily arbitrarily Repeat (for each episode)Repeat (for each episode) Initialize Initialize ss Repeat (for each step of the episode)Repeat (for each step of the episode)

Choose Choose aa from from ss using an exploratory policy using an exploratory policy Take action Take action aa, observe , observe rr, , s’s’

Q(s, a)Q(s, a) Q(s, a)Q(s, a) + α[ + α[rr + γ max + γ max Q(s’, a’)Q(s’, a’) – – Q(s, a)Q(s, a)] ]

a’a’ ss s’s’

Page 12: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Initializing the Initializing the Q(s, a)Q(s, a) function function

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

N 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

S 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

W 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

E 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

states

actions

Page 13: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Q-learning AlgorithmQ-learning Algorithm

Initialize Initialize Q(s, a)Q(s, a) arbitrarily arbitrarily Repeat (for each episode)Repeat (for each episode) Initialize Initialize ss Repeat (for each step of the episode)Repeat (for each step of the episode)

Choose Choose aa from from ss using an exploratory policy using an exploratory policy Take action Take action aa, observe , observe rr, , s’s’

Q(s, a)Q(s, a) Q(s, a)Q(s, a) + α[ + α[rr + γ max + γ max Q(s’, a’)Q(s’, a’) – – Q(s, a)Q(s, a)] ]

a’a’ ss s’s’

Page 14: ONLINE Q-LEARNER USING MOVING PROTOTYPES

An episodeAn episode

11 22 33 44 55

66 77 88 99 1010

1111 1212 1313 1414 1515

1616 1717 1818 1919 2020

Page 15: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Q-learning AlgorithmQ-learning Algorithm

Initialize Initialize Q(s, a)Q(s, a) arbitrarily arbitrarily Repeat (for each episode)Repeat (for each episode) Initialize Initialize ss Repeat (for each step of the episode)Repeat (for each step of the episode)

Choose Choose aa from from ss using an exploratory policy using an exploratory policy Take action Take action aa, observe , observe rr, , s’s’

Q(s, a)Q(s, a) Q(s, a)Q(s, a) + α[ + α[rr + γ max + γ max Q(s’, a’)Q(s’, a’) – – Q(s, a)Q(s, a)] ]

a’a’ ss s’s’

Page 16: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Calculating new Q(s, a) valuesCalculating new Q(s, a) values

]0)0(*5.00[*10),( 12 EsQ0),( 12 EsQ

1st step:

2nd step: ]0)0(*5.00[*10),( 13 NsQ0),( 13 NsQ

3rd step:]0)0(*5.00[*10),( 8 WsQ

0),( 8 WsQ

4th step:]0)0(*5.08[*10),( 7 NsQ

8),( 7 NsQ

)],()','([),(),( max'

asQasQrasQasQa

Page 17: ONLINE Q-LEARNER USING MOVING PROTOTYPES

The The Q(s, a)Q(s, a) function after the first episode function after the first episode

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

N 00 00 00 00 00 00 -8-8 00 00 00 00 00 00 00 00 00 00 00 00 00

S 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

W 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

E 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

states

actions

Page 18: ONLINE Q-LEARNER USING MOVING PROTOTYPES

A second episodeA second episode

11 22 33 44 55

66 77 88 99 1010

1111 1212 1313 1414 1515

1616 1717 1818 1919 2020

Page 19: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Calculating new Calculating new Q(s, a)Q(s, a) values values

)],()','([),(),( max'

asQasQrasQasQa

]0}0,0,0,8max{*5.00[*10),( 12 NsQ0),( 12 EsQ

1st step:

2nd step: ]0)0(*5.00[*10),( 7 EsQ0),( 7 EsQ

3rd step:]0)0(*5.00[*10),( 8 EsQ

0),( 8 EsQ

4th step:]0)0(*5.08[*10),( 9 EsQ

8),( 9 EsQ

Page 20: ONLINE Q-LEARNER USING MOVING PROTOTYPES

The The Q(s, a)Q(s, a) function after the second episode function after the second episode

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

N 00 00 00 00 00 00 -8-8 00 00 00 00 00 00 00 00 00 00 00 00 00

S 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

W 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

E 00 00 00 00 00 00 00 00 88 00 00 00 00 00 00 00 00 00 00 00

states

actions

Page 21: ONLINE Q-LEARNER USING MOVING PROTOTYPES

The The Q(s, a)Q(s, a) function after a few episodes function after a few episodes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

N 00 00 00 00 00 00 -8-8 -8-8 -8-8 00 00 11 22 44 00 00 00 00 00 00

S 00 00 00 00 00 00 0.50.5 11 22 00 00 -8-8 -8-8 -8-8 00 00 00 00 00 00

W 00 00 00 00 00 00 -8-8 11 22 00 00 -8-8 0.50.5 11 00 00 00 00 00 00

E 00 00 00 00 00 00 22 44 88 00 00 11 22 -8-8 00 00 00 00 00 00

states

actions

Page 22: ONLINE Q-LEARNER USING MOVING PROTOTYPES

One of the optimal policiesOne of the optimal policies

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

N 00 00 00 00 00 00 -8-8 -8-8 -8-8 00 00 11 22 44 00 00 00 00 00 00

S 00 00 00 00 00 00 0.50.5 11 22 00 00 -8-8 -8-8 -8-8 00 00 00 00 00 00

W 00 00 00 00 00 00 -8-8 11 22 00 00 -8-8 0.50.5 11 00 00 00 00 00 00

E 00 00 00 00 00 00 22 44 88 00 00 11 22 -8-8 00 00 00 00 00 00

states

actions

Page 23: ONLINE Q-LEARNER USING MOVING PROTOTYPES

An optimal policy graphicallyAn optimal policy graphically

11 22 33 44 55

66 77 88 99 1010

1111 1212 1313 1414 1515

1616 1717 1818 1919 2020

Page 24: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Another of the optimal policiesAnother of the optimal policies

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

N 00 00 00 00 00 00 -8-8 -8-8 -8-8 00 00 11 22 44 00 00 00 00 00 00

S 00 00 00 00 00 00 0.50.5 11 22 00 00 -8-8 -8-8 -8-8 00 00 00 00 00 00

W 00 00 00 00 00 00 -8-8 11 22 00 00 -8-8 0.50.5 11 00 00 00 00 00 00

E 00 00 00 00 00 00 22 44 88 00 00 11 22 -8-8 00 00 00 00 00 00

states

actions

Page 25: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Another optimal policy graphicallyAnother optimal policy graphically

11 22 33 44 55

66 77 88 99 1010

1111 1212 1313 1414 1515

1616 1717 1818 1919 2020

Page 26: ONLINE Q-LEARNER USING MOVING PROTOTYPES

The problem with tabular Q-learningThe problem with tabular Q-learning

What is the problem?What is the problem?

Only practical in a small number of problems because:Only practical in a small number of problems because:

a)a) Q-learning can require many thousands of training Q-learning can require many thousands of training iterations to converge in even modest-sized problems.iterations to converge in even modest-sized problems.

b)b) Very often, the memory resources required by this Very often, the memory resources required by this method method become too large.become too large.

Page 27: ONLINE Q-LEARNER USING MOVING PROTOTYPES

SolutionSolution

What can we do about it?What can we do about it?

Use generalization.Use generalization.

What are some examples?What are some examples?

Tile coding, Radial Basis Functions, Fuzzy function Tile coding, Radial Basis Functions, Fuzzy function approximation, Hashing, Artificial Neural Networks, LSPI, approximation, Hashing, Artificial Neural Networks, LSPI, Regression Trees, Kanerva coding, etc.Regression Trees, Kanerva coding, etc.

Page 28: ONLINE Q-LEARNER USING MOVING PROTOTYPES

ShortcomingsShortcomings

• Tile codingTile coding: Curse of Dimensionality.: Curse of Dimensionality.

• Kanerva codingKanerva coding: Static prototypes.: Static prototypes.

• LSPILSPI: Require a priori knowledge of the Q-function.: Require a priori knowledge of the Q-function.

• ANNANN: Require a large number of learning experiences.: Require a large number of learning experiences.

• Batch + Regression treesBatch + Regression trees: Slow and requires lots of memory.: Slow and requires lots of memory.

Page 29: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Needed propertiesNeeded properties

1)1) Memory requirements Memory requirements should not explode exponentiallyshould not explode exponentially with the dimensionality of the problem.with the dimensionality of the problem.

2)2) It should It should tackle the pitfallstackle the pitfalls caused by the usage of “ caused by the usage of “static static prototypesprototypes”. ”.

3)3) It should try to It should try to reducereduce the the number of learning experiencesnumber of learning experiences required to generate an acceptable policy.required to generate an acceptable policy.

NOTE: NOTE: All this All this withoutwithout requiring requiring a prioria priori knowledge of the Q-functionknowledge of the Q-function..

Page 30: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Overview of the proposed methodOverview of the proposed method

1) The proposed method limits the number of prototypes 1) The proposed method limits the number of prototypes available to describe the Q-function (available to describe the Q-function (as Kanerva codingas Kanerva coding).).

2) The Q-function is modeled using a regression tree (2) The Q-function is modeled using a regression tree (asas the the batch method proposed by batch method proposed by Sridharan and TesauroSridharan and Tesauro).).

3) 3) But prototypes are not staticBut prototypes are not static, as in Kanerva coding, but , as in Kanerva coding, but dynamic. dynamic.

4) The proposed method has the capacity to update the Q-4) The proposed method has the capacity to update the Q-function once for every available learning experience (it can function once for every available learning experience (it can be an be an online learneronline learner). ).

Page 31: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Changes on the normal regression treeChanges on the normal regression tree

Page 32: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Basic operations in the regression treeBasic operations in the regression tree

Rupture Rupture

MergingMerging

Page 33: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Impossible MergingImpossible Merging

Page 34: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Rules for a sound treeRules for a sound tree

parent

childrenchildren

parent

Page 35: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Impossible MergingImpossible Merging

Page 36: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Sample MergingSample Merging

The “smallest predecessor”

Page 37: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Sample MergingSample Merging

List 1

Page 38: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Sample MergingSample Merging

The node to be inserted

Page 39: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Sample MergingSample Merging

List 1

List 1.1 List 1.2

Page 40: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Sample MergingSample Merging

Page 41: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Sample MergingSample Merging

Page 42: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Sample MergingSample Merging

Page 43: ONLINE Q-LEARNER USING MOVING PROTOTYPES

The agentThe agent

Detectors’Signals

Actuators’SignalsAgent

Reward

Page 44: ONLINE Q-LEARNER USING MOVING PROTOTYPES

ApplicationsApplications

BOOK STORE

Page 45: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Results first applicationResults first application

TabularTabularQ-learningQ-learning

MovingMovingPrototypesPrototypes

BatchBatchMethodMethod

PolicyPolicyQualityQuality

BestBest BestBest WorstWorst

ComputationalComputationalComplexityComplexity

O(O(nn)) O(O(nn log( log(nn)) )) O(O(n2n2))

O(O(n3n3))

Memory Memory UsageUsage

BadBad BestBest WorstWorst

Page 46: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Results first application (details)Results first application (details)

TabularTabularQ-learningQ-learning

MovingMovingPrototypesPrototypes

BatchBatchMethodMethod

PolicyPolicyQualityQuality

$2,423,355$2,423,355 $2,423,355$2,423,355 $2,297,10$2,297,1000

Memory Memory UsageUsage

10,202 10,202 prototypesprototypes

413 413 prototypesprototypes

11,975 11,975 prototypesprototypes

Page 47: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Results second applicationResults second application

MovingMovingPrototypesPrototypes

LSPILSPI(least-squares (least-squares

policy iteration)policy iteration)

PolicyPolicyQualityQuality

BestBest WorstWorst

ComputationalComputationalComplexityComplexity

O(O(nn log( log(nn)) )) O(O(n2n2))

O(O(nn))

Memory Memory UsageUsage

WorstWorst BestBest

Page 48: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Results second application (details)Results second application (details)

MovingMovingPrototypesPrototypes

LSPILSPI(least-squares (least-squares

policy iteration)policy iteration)

PolicyPolicyQualityQuality

forever (succeeded)forever (succeeded)forever (succeeded)forever (succeeded)forever (succeeded)forever (succeeded)

26 time steps (failed)26 time steps (failed)170 time steps (failed)170 time steps (failed)

forever (succeeded)forever (succeeded)

Required Required Learning Learning

ExperiencesExperiences

216 216 324324216216

1,902,6211,902,621183,618 183,618

648648

Memory Memory UsageUsage

about 170 prototypesabout 170 prototypesabout 170 prototypesabout 170 prototypesabout 170 prototypesabout 170 prototypes

2 weight parameters2 weight parameters2 weight parameters2 weight parameters2 weight parameters2 weight parameters

Page 49: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Results third applicationResults third application

Reason for this experiment:Reason for this experiment:

Evaluate the performance of the proposed method in a Evaluate the performance of the proposed method in a scenario that we consider ideal for this method, namely one, scenario that we consider ideal for this method, namely one, for which there is no application specific knowledge available.for which there is no application specific knowledge available.

What took to learn a good policy:What took to learn a good policy:

• Less than 2 minutes of CPU time.Less than 2 minutes of CPU time.• Less that 25,000 learning experiences. Less that 25,000 learning experiences. • Less than 900 state-action-value tuples.Less than 900 state-action-value tuples.

Page 50: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Swimmer first movieSwimmer first movie

Page 51: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Swimmer second movieSwimmer second movie

Page 52: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Swimmer third movieSwimmer third movie

Page 53: ONLINE Q-LEARNER USING MOVING PROTOTYPES

Future WorkFuture Work

• Different types of Different types of splitssplits..

• Continue Continue characterizationcharacterization of the method Moving Prototypes. of the method Moving Prototypes.

• Moving prototypes + Moving prototypes + LSPILSPI. .

• Moving prototypes + Moving prototypes + Eligibility tracesEligibility traces..