people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (c) hauskrecht 1 cs 1675 introduction...

51
Week 14 Jeongmin Lee Computer Science Department University of Pittsburgh CS 1675 Intro to Machine Learning – Recitation

Upload: others

Post on 24-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

Week14

JeongminLeeComputerScienceDepartment

UniversityofPittsburgh

CS1675IntrotoMachineLearning– Recitation

Page 2: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

•RecaponReinforcementLearning

Page 3: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

(C)Hauskrecht

1

CS 1675 Introduction to Machine LearningLecture 26

Milos [email protected] Sennott Square

Reinforcement learning II

Reinforcement learning

Basics:

• Learner interacts with the environment– Receives input with information about the environment (e.g.

from sensors)– Makes actions that (may) effect the environment– Receives a reinforcement signal that provides a feedback on

how well it performed

LearnerInput x Output a

Critic

Reinforcement r

(action)

Page 4: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

(C)Hauskrecht

2

Reinforcement learning

Objective: Learn how to act in the environment in order to maximize the reinforcement signal • The selection of actions should depend on the input• A policy maps inputs to actions• Goal: find the optimal policy that gives the best

expected reinforcements

Example: learn how to play games (AlphaGo)

LearnerInput x Output a

Critic

Reinforcement r

AX o:SAX o:S

Gambling example• Game: 3 biased coins

– The coin to be tossed is selected randomly from the three coin options. The agent always sees which coin is going to be played next. The agent makes a bet on either a head or a tail with a wage of $1. If after the coin toss, the outcome agrees with the bet, the agent wins $1, otherwise it looses $1

• RL model:– Input: X – a coin chosen for the next toss, – Action: A – choice of head or tail the agent bets on, – Reinforcements: {1, -1}

• A policyExample:

AX o:SCoin1 headCoin2 tailCoin3 head

:S12

3

headtail

head

:S

1 2 3

Page 5: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

(C)Hauskrecht

3

Gambling exampleRL model:• Input: X – a coin chosen for the next toss, • Action: A – choice of head or tail the agent bets on, • Reinforcements: {1, -1}• A policy

State, action reward trajectories

Coin1 headCoin2 tailCoin3 head

:S

Coin2

Tail

-1

Coin1

Head

1

Step0 Step1

Coin2

Tail

1

Step2

Coin1

Head

1

Step k

.. ..state

action

reward

RL learning: objective functions

• Objective: Find a policy

That maximizes some combination of future reinforcements (rewards) received over time

• Valuation models (quantify how good the mapping is):– Finite horizon models

– Infinite horizon discounted model

– Average reward

)(0¦

T

ttrE

)(0¦f

tt

t rE J Discount factor:

)(1lim0¦

fo

T

ttTrE

T

0!TTime horizon:

AX o:*S

10 �d J

)(0¦

T

tt

t rE J Discount factor: 10 �d J

Coin1 ? Coin2 ?Coin3 ?

:*S

Page 6: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

3

Gambling exampleRL model:• Input: X – a coin chosen for the next toss, • Action: A – choice of head or tail the agent bets on, • Reinforcements: {1, -1}• A policy

State, action reward trajectories

Coin1 headCoin2 tailCoin3 head

:S

Coin2

Tail

-1

Coin1

Head

1

Step0 Step1

Coin2

Tail

1

Step2

Coin1

Head

1

Step k

.. ..state

action

reward

RL learning: objective functions

• Objective: Find a policy

That maximizes some combination of future reinforcements (rewards) received over time

• Valuation models (quantify how good the mapping is):– Finite horizon models

– Infinite horizon discounted model

– Average reward

)(0¦

T

ttrE

)(0¦f

tt

t rE J Discount factor:

)(1lim0¦

fo

T

ttTrE

T

0!TTime horizon:

AX o:*S

10 �d J

)(0¦

T

tt

t rE J Discount factor: 10 �d J

Coin1 ? Coin2 ?Coin3 ?

:*S

State

Action

Reward

Step0

Coin2

?

Theagentalwaysseeswhichcoinisgoingtobeplayednext.Theagentmakesabetoneitheraheadoratailwithawageof$1.

Ifafterthecointoss,theoutcomeagreeswiththebet,theagentwins$1,otherwiseitlooses$1

?

Page 7: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

3

Gambling exampleRL model:• Input: X – a coin chosen for the next toss, • Action: A – choice of head or tail the agent bets on, • Reinforcements: {1, -1}• A policy

State, action reward trajectories

Coin1 headCoin2 tailCoin3 head

:S

Coin2

Tail

-1

Coin1

Head

1

Step0 Step1

Coin2

Tail

1

Step2

Coin1

Head

1

Step k

.. ..state

action

reward

RL learning: objective functions

• Objective: Find a policy

That maximizes some combination of future reinforcements (rewards) received over time

• Valuation models (quantify how good the mapping is):– Finite horizon models

– Infinite horizon discounted model

– Average reward

)(0¦

T

ttrE

)(0¦f

tt

t rE J Discount factor:

)(1lim0¦

fo

T

ttTrE

T

0!TTime horizon:

AX o:*S

10 �d J

)(0¦

T

tt

t rE J Discount factor: 10 �d J

Coin1 ? Coin2 ?Coin3 ?

:*S

State

Action

Reward

Step0

Coin2

Tail

-1

Page 8: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

3

Gambling exampleRL model:• Input: X – a coin chosen for the next toss, • Action: A – choice of head or tail the agent bets on, • Reinforcements: {1, -1}• A policy

State, action reward trajectories

Coin1 headCoin2 tailCoin3 head

:S

Coin2

Tail

-1

Coin1

Head

1

Step0 Step1

Coin2

Tail

1

Step2

Coin1

Head

1

Step k

.. ..state

action

reward

RL learning: objective functions

• Objective: Find a policy

That maximizes some combination of future reinforcements (rewards) received over time

• Valuation models (quantify how good the mapping is):– Finite horizon models

– Infinite horizon discounted model

– Average reward

)(0¦

T

ttrE

)(0¦f

tt

t rE J Discount factor:

)(1lim0¦

fo

T

ttTrE

T

0!TTime horizon:

AX o:*S

10 �d J

)(0¦

T

tt

t rE J Discount factor: 10 �d J

Coin1 ? Coin2 ?Coin3 ?

:*S

State

Action

Reward

Step0

Coin2

Tail

-1

Step1

Coin1

Head

?

Page 9: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

3

Gambling exampleRL model:• Input: X – a coin chosen for the next toss, • Action: A – choice of head or tail the agent bets on, • Reinforcements: {1, -1}• A policy

State, action reward trajectories

Coin1 headCoin2 tailCoin3 head

:S

Coin2

Tail

-1

Coin1

Head

1

Step0 Step1

Coin2

Tail

1

Step2

Coin1

Head

1

Step k

.. ..state

action

reward

RL learning: objective functions

• Objective: Find a policy

That maximizes some combination of future reinforcements (rewards) received over time

• Valuation models (quantify how good the mapping is):– Finite horizon models

– Infinite horizon discounted model

– Average reward

)(0¦

T

ttrE

)(0¦f

tt

t rE J Discount factor:

)(1lim0¦

fo

T

ttTrE

T

0!TTime horizon:

AX o:*S

10 �d J

)(0¦

T

tt

t rE J Discount factor: 10 �d J

Coin1 ? Coin2 ?Coin3 ?

:*S

State

Action

Reward

Step0

Coin2

Tail

-1

Step1

Coin1

Head

1

Page 10: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

3

Gambling exampleRL model:• Input: X – a coin chosen for the next toss, • Action: A – choice of head or tail the agent bets on, • Reinforcements: {1, -1}• A policy

State, action reward trajectories

Coin1 headCoin2 tailCoin3 head

:S

Coin2

Tail

-1

Coin1

Head

1

Step0 Step1

Coin2

Tail

1

Step2

Coin1

Head

1

Step k

.. ..state

action

reward

RL learning: objective functions

• Objective: Find a policy

That maximizes some combination of future reinforcements (rewards) received over time

• Valuation models (quantify how good the mapping is):– Finite horizon models

– Infinite horizon discounted model

– Average reward

)(0¦

T

ttrE

)(0¦f

tt

t rE J Discount factor:

)(1lim0¦

fo

T

ttTrE

T

0!TTime horizon:

AX o:*S

10 �d J

)(0¦

T

tt

t rE J Discount factor: 10 �d J

Coin1 ? Coin2 ?Coin3 ?

:*S

State

Action

Reward

Step0

Coin2

Tail

-1

Step1

Coin1

Head

1

Step2

Coin3

Tail

1

Step3

Coin1

Head

-1

Step4

Coin2

Tail

1

Step5

Coin1

Head

1

Page 11: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

Insequenceofthese,ourgoalistomaximizeThetotalreward.

Inotherwords,itisaboutfindingaoptimalpolicy

Page 12: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

3

Gambling exampleRL model:• Input: X – a coin chosen for the next toss, • Action: A – choice of head or tail the agent bets on, • Reinforcements: {1, -1}• A policy

State, action reward trajectories

Coin1 headCoin2 tailCoin3 head

:S

Coin2

Tail

-1

Coin1

Head

1

Step0 Step1

Coin2

Tail

1

Step2

Coin1

Head

1

Step k

.. ..state

action

reward

RL learning: objective functions

• Objective: Find a policy

That maximizes some combination of future reinforcements (rewards) received over time

• Valuation models (quantify how good the mapping is):– Finite horizon models

– Infinite horizon discounted model

– Average reward

)(0¦

T

ttrE

)(0¦f

tt

t rE J Discount factor:

)(1lim0¦

fo

T

ttTrE

T

0!TTime horizon:

AX o:*S

10 �d J

)(0¦

T

tt

t rE J Discount factor: 10 �d J

Coin1 ? Coin2 ?Coin3 ?

:*S

Page 13: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

Howcanwequantifythegoodnessoftheactions?

Page 14: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

Howcanwequantifythegoodnessoftheactions?:Valuationmodels

Page 15: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

(C)Hauskrecht

3

Gambling exampleRL model:• Input: X – a coin chosen for the next toss, • Action: A – choice of head or tail the agent bets on, • Reinforcements: {1, -1}• A policy

State, action reward trajectories

Coin1 headCoin2 tailCoin3 head

:S

Coin2

Tail

-1

Coin1

Head

1

Step0 Step1

Coin2

Tail

1

Step2

Coin1

Head

1

Step k

.. ..state

action

reward

RL learning: objective functions

• Objective: Find a policy

That maximizes some combination of future reinforcements (rewards) received over time

• Valuation models (quantify how good the mapping is):– Finite horizon models

– Infinite horizon discounted model

– Average reward

)(0¦

T

ttrE

)(0¦f

tt

t rE J Discount factor:

)(1lim0¦

fo

T

ttTrE

T

0!TTime horizon:

AX o:*S

10 �d J

)(0¦

T

tt

t rE J Discount factor: 10 �d J

Coin1 ? Coin2 ?Coin3 ?

:*S

T:Numberoffuturesteps

->Morefuture,morediscounted

Page 16: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

3

Gambling exampleRL model:• Input: X – a coin chosen for the next toss, • Action: A – choice of head or tail the agent bets on, • Reinforcements: {1, -1}• A policy

State, action reward trajectories

Coin1 headCoin2 tailCoin3 head

:S

Coin2

Tail

-1

Coin1

Head

1

Step0 Step1

Coin2

Tail

1

Step2

Coin1

Head

1

Step k

.. ..state

action

reward

RL learning: objective functions

• Objective: Find a policy

That maximizes some combination of future reinforcements (rewards) received over time

• Valuation models (quantify how good the mapping is):– Finite horizon models

– Infinite horizon discounted model

– Average reward

)(0¦

T

ttrE

)(0¦f

tt

t rE J Discount factor:

)(1lim0¦

fo

T

ttTrE

T

0!TTime horizon:

AX o:*S

10 �d J

)(0¦

T

tt

t rE J Discount factor: 10 �d J

Coin1 ? Coin2 ?Coin3 ?

:*S

State

Action

Reward

Step0

Coin2

Tail

-1

Step1

Coin1

Head

1

Step2

Coin3

Tail

1

Step3

Coin1

Head

-1

Step4

Coin2

Tail

1

Step5

Coin1

Head

1

3

Gambling exampleRL model:• Input: X – a coin chosen for the next toss, • Action: A – choice of head or tail the agent bets on, • Reinforcements: {1, -1}• A policy

State, action reward trajectories

Coin1 headCoin2 tailCoin3 head

:S

Coin2

Tail

-1

Coin1

Head

1

Step0 Step1

Coin2

Tail

1

Step2

Coin1

Head

1

Step k

.. ..state

action

reward

RL learning: objective functions

• Objective: Find a policy

That maximizes some combination of future reinforcements (rewards) received over time

• Valuation models (quantify how good the mapping is):– Finite horizon models

– Infinite horizon discounted model

– Average reward

)(0¦

T

ttrE

)(0¦f

tt

t rE J Discount factor:

)(1lim0¦

fo

T

ttTrE

T

0!TTime horizon:

AX o:*S

10 �d J

)(0¦

T

tt

t rE J Discount factor: 10 �d J

Coin1 ? Coin2 ?Coin3 ?

:*S

Let’ssayT=4,

3

Gambling exampleRL model:• Input: X – a coin chosen for the next toss, • Action: A – choice of head or tail the agent bets on, • Reinforcements: {1, -1}• A policy

State, action reward trajectories

Coin1 headCoin2 tailCoin3 head

:S

Coin2

Tail

-1

Coin1

Head

1

Step0 Step1

Coin2

Tail

1

Step2

Coin1

Head

1

Step k

.. ..state

action

reward

RL learning: objective functions

• Objective: Find a policy

That maximizes some combination of future reinforcements (rewards) received over time

• Valuation models (quantify how good the mapping is):– Finite horizon models

– Infinite horizon discounted model

– Average reward

)(0¦

T

ttrE

)(0¦f

tt

t rE J Discount factor:

)(1lim0¦

fo

T

ttTrE

T

0!TTime horizon:

AX o:*S

10 �d J

)(0¦

T

tt

t rE J Discount factor: 10 �d J

Coin1 ? Coin2 ?Coin3 ?

:*S

Page 17: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

3

Gambling exampleRL model:• Input: X – a coin chosen for the next toss, • Action: A – choice of head or tail the agent bets on, • Reinforcements: {1, -1}• A policy

State, action reward trajectories

Coin1 headCoin2 tailCoin3 head

:S

Coin2

Tail

-1

Coin1

Head

1

Step0 Step1

Coin2

Tail

1

Step2

Coin1

Head

1

Step k

.. ..state

action

reward

RL learning: objective functions

• Objective: Find a policy

That maximizes some combination of future reinforcements (rewards) received over time

• Valuation models (quantify how good the mapping is):– Finite horizon models

– Infinite horizon discounted model

– Average reward

)(0¦

T

ttrE

)(0¦f

tt

t rE J Discount factor:

)(1lim0¦

fo

T

ttTrE

T

0!TTime horizon:

AX o:*S

10 �d J

)(0¦

T

tt

t rE J Discount factor: 10 �d J

Coin1 ? Coin2 ?Coin3 ?

:*S

State

Action

Reward

Step0

Coin2

Tail

-1

Step1

Coin1

Head

1

Step2

Coin3

Tail

1

Step3

Coin1

Head

-1

Step4

Coin2

Tail

1

Step5

Coin1

Head

1

3

Gambling exampleRL model:• Input: X – a coin chosen for the next toss, • Action: A – choice of head or tail the agent bets on, • Reinforcements: {1, -1}• A policy

State, action reward trajectories

Coin1 headCoin2 tailCoin3 head

:S

Coin2

Tail

-1

Coin1

Head

1

Step0 Step1

Coin2

Tail

1

Step2

Coin1

Head

1

Step k

.. ..state

action

reward

RL learning: objective functions

• Objective: Find a policy

That maximizes some combination of future reinforcements (rewards) received over time

• Valuation models (quantify how good the mapping is):– Finite horizon models

– Infinite horizon discounted model

– Average reward

)(0¦

T

ttrE

)(0¦f

tt

t rE J Discount factor:

)(1lim0¦

fo

T

ttTrE

T

0!TTime horizon:

AX o:*S

10 �d J

)(0¦

T

tt

t rE J Discount factor: 10 �d J

Coin1 ? Coin2 ?Coin3 ?

:*S

Let’ssayT=4,

3

Gambling exampleRL model:• Input: X – a coin chosen for the next toss, • Action: A – choice of head or tail the agent bets on, • Reinforcements: {1, -1}• A policy

State, action reward trajectories

Coin1 headCoin2 tailCoin3 head

:S

Coin2

Tail

-1

Coin1

Head

1

Step0 Step1

Coin2

Tail

1

Step2

Coin1

Head

1

Step k

.. ..state

action

reward

RL learning: objective functions

• Objective: Find a policy

That maximizes some combination of future reinforcements (rewards) received over time

• Valuation models (quantify how good the mapping is):– Finite horizon models

– Infinite horizon discounted model

– Average reward

)(0¦

T

ttrE

)(0¦f

tt

t rE J Discount factor:

)(1lim0¦

fo

T

ttTrE

T

0!TTime horizon:

AX o:*S

10 �d J

)(0¦

T

tt

t rE J Discount factor: 10 �d J

Coin1 ? Coin2 ?Coin3 ?

:*S

=-1+1+1+-1+1=-2+3=1

Page 18: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

3

Gambling exampleRL model:• Input: X – a coin chosen for the next toss, • Action: A – choice of head or tail the agent bets on, • Reinforcements: {1, -1}• A policy

State, action reward trajectories

Coin1 headCoin2 tailCoin3 head

:S

Coin2

Tail

-1

Coin1

Head

1

Step0 Step1

Coin2

Tail

1

Step2

Coin1

Head

1

Step k

.. ..state

action

reward

RL learning: objective functions

• Objective: Find a policy

That maximizes some combination of future reinforcements (rewards) received over time

• Valuation models (quantify how good the mapping is):– Finite horizon models

– Infinite horizon discounted model

– Average reward

)(0¦

T

ttrE

)(0¦f

tt

t rE J Discount factor:

)(1lim0¦

fo

T

ttTrE

T

0!TTime horizon:

AX o:*S

10 �d J

)(0¦

T

tt

t rE J Discount factor: 10 �d J

Coin1 ? Coin2 ?Coin3 ?

:*S

3

Gambling exampleRL model:• Input: X – a coin chosen for the next toss, • Action: A – choice of head or tail the agent bets on, • Reinforcements: {1, -1}• A policy

State, action reward trajectories

Coin1 headCoin2 tailCoin3 head

:S

Coin2

Tail

-1

Coin1

Head

1

Step0 Step1

Coin2

Tail

1

Step2

Coin1

Head

1

Step k

.. ..state

action

reward

RL learning: objective functions

• Objective: Find a policy

That maximizes some combination of future reinforcements (rewards) received over time

• Valuation models (quantify how good the mapping is):– Finite horizon models

– Infinite horizon discounted model

– Average reward

)(0¦

T

ttrE

)(0¦f

tt

t rE J Discount factor:

)(1lim0¦

fo

T

ttTrE

T

0!TTime horizon:

AX o:*S

10 �d J

)(0¦

T

tt

t rE J Discount factor: 10 �d J

Coin1 ? Coin2 ?Coin3 ?

:*S

State

Action

Reward

Step0

Coin2

Tail

-1

Step1

Coin1

Head

1

Step2

Coin3

Tail

1

Step3

Coin1

Head

-1

Step4

Coin2

Tail

1

Step5

Coin1

Head

1

Let’ssayT=4and𝛾 = 0.5

=-1*(0.5)^0+1*(0.5)^1+1*(0.5)*2+-1*(0.5)^3+1*(0.5)^4

=-0.3125

3

Gambling exampleRL model:• Input: X – a coin chosen for the next toss, • Action: A – choice of head or tail the agent bets on, • Reinforcements: {1, -1}• A policy

State, action reward trajectories

Coin1 headCoin2 tailCoin3 head

:S

Coin2

Tail

-1

Coin1

Head

1

Step0 Step1

Coin2

Tail

1

Step2

Coin1

Head

1

Step k

.. ..state

action

reward

RL learning: objective functions

• Objective: Find a policy

That maximizes some combination of future reinforcements (rewards) received over time

• Valuation models (quantify how good the mapping is):– Finite horizon models

– Infinite horizon discounted model

– Average reward

)(0¦

T

ttrE

)(0¦f

tt

t rE J Discount factor:

)(1lim0¦

fo

T

ttTrE

T

0!TTime horizon:

AX o:*S

10 �d J

)(0¦

T

tt

t rE J Discount factor: 10 �d J

Coin1 ? Coin2 ?Coin3 ?

:*S

𝛾

Page 19: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

Twowaysofgettingrewards:

1.RLwithimmediate rewards(thisrecitation)

2.RLwithdelayed rewards

Page 20: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

(C)Hauskrecht

4

RL with immediate rewards

• Expected reward

• Immediate reward case:– Reward depends only on x and the action choice – The action does not affect the environment and hence future

inputs (states) and future rewards– Expected one step reward for input x (coin to play next) and

the choice a :

)(0¦f

tt

t rE J

),( aR x

10 �d J

RL with immediate rewards

• Expected reward

• Optimal strategy:

: Expected one step reward for input x (coin to play next) and the choice a

...)()()()( 22

100

��� ¦f

rErErErEt

tt JJJ

AX o:*S

),(maxarg)(* aRa

xx S

),( aR x

4

RL with immediate rewards

• Expected reward

• Immediate reward case:– Reward depends only on x and the action choice – The action does not affect the environment and hence future

inputs (states) and future rewards– Expected one step reward for input x (coin to play next) and

the choice a :

)(0¦f

tt

t rE J

),( aR x

10 �d J

RL with immediate rewards

• Expected reward

• Optimal strategy:

: Expected one step reward for input x (coin to play next) and the choice a

...)()()()( 22

100

��� ¦f

rErErErEt

tt JJJ

AX o:*S

),(maxarg)(* aRa

xx S

),( aR x

State

Action

Reward

Step0

Coin2

Tail

-1

Step1

Coin1

Head

1

Step2

Coin3

Tail

1

Step3

Coin1

Head

-1

Step4

Coin2

Tail

1

Step5

Coin1

Head

1

Currentstep

Page 21: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

Inhere,whatmattersisknowingrewardsaboutactionaatinputx,R(x,a)

Page 22: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

Page 23: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

Knowingexpectedrewardsmeans,weknowTheprobabilitiesofR(x,a)forallpossiblepairsof(x,a)s

e.g.,P(coin2,head)=0.3P(coin2,tail)=0.7P(coin1,head)=0.9P(coin1,tail)=0.1…

Then,givencurrentinput(coin1,2,or3),maximizingrewardsshouldbemucheasier!

Page 24: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

Knowingexpectedrewardsmeans,weknowTheprobabilitiesofR(x,a)forallpossiblepairsof(x,a)s

e.g.,P(coin2,head)=0.3P(coin2,tail)=0.7P(coin1,head)=0.9P(coin1,tail)=0.1…

Page 25: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

So,weneedtoestimate R(x,a):thatisdenotedas

e.g.,P(coin2,head)=0.3P(coin2,tail)=0.7P(coin1,head)=0.9P(coin1,tail)=0.1…

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

Page 26: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

(C)Hauskrecht

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,:Numberofsamples*we’resamplingactionsanditsrewardsandgetaverageofthem

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

Page 27: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

(C)Hauskrecht

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

or

Tail

State

Action

Reward

Step0

Coin2 …

Currentstep

Heador

TrythisN{x,a}times..

Page 28: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

(C)Hauskrecht

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

Page 29: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

(C)Hauskrecht

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

Withthisapproach,wekeep“reward-memory”ofallpossiblepairs ofinputxandactiona,andkeepitupdateeachtimeweobserveapair.

Page 30: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

State

Action

Reward

Step0

Coin2

?

?

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

000000

Currentstep

Page 31: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

State

Action

Reward

Step0

Coin2

Tail

?

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

000000

Currentstep

Page 32: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

State

Action

Reward

Step0

Coin2

Tail

-1

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

000000

Currentstep

Page 33: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

State

Action

Reward

Step0

Coin2

Tail

-1

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

000000

Currentstep

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

=(1-0.5)*0+(0.5)*-1=-0.5

Page 34: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

State

Action

Reward

Step0

Coin2

Tail

-1

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

000-0.500

Currentstep

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

=(1-0.5)*0+(0.5)*-1=-0.5

Memoryupdate!

Page 35: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

State

Action

Reward

Step0

Coin2

Tail

-1

Step1

Coin1

Head

1

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

000-0.500

Currentstep

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

=?

Page 36: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

State

Action

Reward

Step0

Coin2

Tail

-1

Step1

Coin1

Head

1

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

000-0.500

Currentstep

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

=(1-0.5)*0+(0.5)*1=0.5

Page 37: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

State

Action

Reward

Step0

Coin2

Tail

-1

Step1

Coin1

Head

1

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

0.500-0.500

Currentstep

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

=(1-0.5)*0+(0.5)*1=0.5Memoryupdate!

Page 38: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

State

Action

Reward

Step0

Coin2

Tail

-1

Step1

Coin1

Head

1

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

0.500-0.500

Currentstep

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

=(1-0.5)*0+(0.5)*1=0.5

Step2

Coin3

Tail

1

Page 39: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

State

Action

Reward

Step0

Coin2

Tail

-1

Step1

Coin1

Head

1

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

0.500-0.500.5

Currentstep

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

=(1-0.5)*0+(0.5)*1=0.5

Step2

Coin3

Tail

1

Memoryupdate!

Page 40: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

State

Action

Reward

Step0

Coin2

Tail

-1

Step1

Coin1

Head

1

Step2

Coin3

Tail

1

Step3

Coin1

Tail

1

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

0.500-0.500.5

Currentstep

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

=(1-0.5)*(0.5)+(0.5)*1=0.75

Page 41: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

State

Action

Reward

Step0

Coin2

Tail

-1

Step1

Coin1

Head

1

Step2

Coin3

Tail

1

Step3

Coin1

Tail

1

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

0.7500-0.500.5

Currentstep

5

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

xx S

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

¦

axN

i

axi

ax

rN

aR,

1

,

,

1),(~ x

),( aR x

)(iD

axi

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

=(1-0.5)*(0.5)+(0.5)*1=0.75Memoryupdate!

Page 42: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

(C)Hauskrecht

12

Exploration vs. Exploitation in RL

The (learner) actively interacts with the environment via actions:• At the beginning the learner does not know anything about the

environment• It gradually gains the experience and learns how to react to the

environmentDilemma (exploration-exploitation):• After some number of steps, should I select the best current

choice (exploitation) or try to learn more about the environment (exploration)?

• Exploitation may involve the selection of a sub-optimal action and prevent the learning of the optimal choice

• Exploration may spend to much time on trying bad currently suboptimal actions

Exploration vs. Exploitation

• In the RL framework– the (learner) actively interacts with the environment and

choses the action to play for the current input x– Also at any point in time it has an estimate of for

any (input,action) pair• Dilemma for choosing the action to play for x:

– Should the learner choose the current best choice of action (exploitation)

– Or choose some other action a which may help to improve its estimate (exploration)

This dilemma is called exploration/exploitation dilemma• Different exploration/exploitation strategies exist

),(~ aR x

),(~maxarg)(ˆ aRAa

xx�

S

),(~ aR x

Page 43: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

(C)Hauskrecht

12

Exploration vs. Exploitation in RL

The (learner) actively interacts with the environment via actions:• At the beginning the learner does not know anything about the

environment• It gradually gains the experience and learns how to react to the

environmentDilemma (exploration-exploitation):• After some number of steps, should I select the best current

choice (exploitation) or try to learn more about the environment (exploration)?

• Exploitation may involve the selection of a sub-optimal action and prevent the learning of the optimal choice

• Exploration may spend to much time on trying bad currently suboptimal actions

Exploration vs. Exploitation

• In the RL framework– the (learner) actively interacts with the environment and

choses the action to play for the current input x– Also at any point in time it has an estimate of for

any (input,action) pair• Dilemma for choosing the action to play for x:

– Should the learner choose the current best choice of action (exploitation)

– Or choose some other action a which may help to improve its estimate (exploration)

This dilemma is called exploration/exploitation dilemma• Different exploration/exploitation strategies exist

),(~ aR x

),(~maxarg)(ˆ aRAa

xx�

S

),(~ aR x

Page 44: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

(C)Hauskrecht

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

Prob (notchoosingcurrentbest)

Page 45: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

(C)Hauskrecht

7

Exploration vs. Exploitation

• Boltzman exploration– The action is chosen randomly but proportionally to its

current expected reward estimate– Can be tuned with a temperature parameter T to promote

exploration or exploitation• Probability of choosing action a

• Effect of T: – For high values of T, p(a | x) is uniformly distributed for

all actions– For low values of T, p(a | x) of the action with the highest

value of is approaching 1

> @> @¦

Aa

TaxRTaxR

ap

'

/)',(~exp/),(~exp

)|( x

),(~ aR x

Agent navigation example

• Agent navigation in the maze:– 4 moves in compass directions– Effects of moves are stochastic – we may wind up in other

than intended location with a non-zero probability– Objective: learn how to reach the goal state in the shortest

expected time

moves

G

Page 46: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

7

Exploration vs. Exploitation

• Boltzman exploration– The action is chosen randomly but proportionally to its

current expected reward estimate– Can be tuned with a temperature parameter T to promote

exploration or exploitation• Probability of choosing action a

• Effect of T: – For high values of T, p(a | x) is uniformly distributed for

all actions– For low values of T, p(a | x) of the action with the highest

value of is approaching 1

> @> @¦

Aa

TaxRTaxR

ap

'

/)',(~exp/),(~exp

)|( x

),(~ aR x

Agent navigation example

• Agent navigation in the maze:– 4 moves in compass directions– Effects of moves are stochastic – we may wind up in other

than intended location with a non-zero probability– Objective: learn how to reach the goal state in the shortest

expected time

moves

G

ExampleoftemperatureT

Let’ssaycurrentx=coin1,andweuseestimationofrewardsfromonlineupdate

Page 47: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

2.751.256.5-3.52.15.8

7

Exploration vs. Exploitation

• Boltzman exploration– The action is chosen randomly but proportionally to its

current expected reward estimate– Can be tuned with a temperature parameter T to promote

exploration or exploitation• Probability of choosing action a

• Effect of T: – For high values of T, p(a | x) is uniformly distributed for

all actions– For low values of T, p(a | x) of the action with the highest

value of is approaching 1

> @> @¦

Aa

TaxRTaxR

ap

'

/)',(~exp/),(~exp

)|( x

),(~ aR x

Agent navigation example

• Agent navigation in the maze:– 4 moves in compass directions– Effects of moves are stochastic – we may wind up in other

than intended location with a non-zero probability– Objective: learn how to reach the goal state in the shortest

expected time

moves

G

ExampleoftemperatureT

Let’ssaycurrentx=coin1,andweuseestimationofrewardsfromonlineupdate

Page 48: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

6

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)(),2(~ iheadcoinR)(),2(~ itailcoinR

)(),3(~ iheadcoinR)(),3(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

xx�

S

H�1

1|| �AH

10 dd H

),(~ aR x

2.751.256.5-3.52.15.8

7

Exploration vs. Exploitation

• Boltzman exploration– The action is chosen randomly but proportionally to its

current expected reward estimate– Can be tuned with a temperature parameter T to promote

exploration or exploitation• Probability of choosing action a

• Effect of T: – For high values of T, p(a | x) is uniformly distributed for

all actions– For low values of T, p(a | x) of the action with the highest

value of is approaching 1

> @> @¦

Aa

TaxRTaxR

ap

'

/)',(~exp/),(~exp

)|( x

),(~ aR x

Agent navigation example

• Agent navigation in the maze:– 4 moves in compass directions– Effects of moves are stochastic – we may wind up in other

than intended location with a non-zero probability– Objective: learn how to reach the goal state in the shortest

expected time

moves

G

ExampleoftemperatureT

Let’ssaycurrentx=coin1,andweuseestimationofrewardsfromonlineupdate

P(action=head|coin1)=Exp(2.75/T)

Exp(2.75/T)+Exp(1.25/T)

Page 49: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

7

Exploration vs. Exploitation

• Boltzman exploration– The action is chosen randomly but proportionally to its

current expected reward estimate– Can be tuned with a temperature parameter T to promote

exploration or exploitation• Probability of choosing action a

• Effect of T: – For high values of T, p(a | x) is uniformly distributed for

all actions– For low values of T, p(a | x) of the action with the highest

value of is approaching 1

> @> @¦

Aa

TaxRTaxR

ap

'

/)',(~exp/),(~exp

)|( x

),(~ aR x

Agent navigation example

• Agent navigation in the maze:– 4 moves in compass directions– Effects of moves are stochastic – we may wind up in other

than intended location with a non-zero probability– Objective: learn how to reach the goal state in the shortest

expected time

moves

G

ExampleoftemperatureT

IfTishigh,likeT=1000

P(action=head|coin1)=Exp(2.75/1000)

Exp(2.75/1000)+Exp(1.25/1000)=0.50003

Page 50: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

7

Exploration vs. Exploitation

• Boltzman exploration– The action is chosen randomly but proportionally to its

current expected reward estimate– Can be tuned with a temperature parameter T to promote

exploration or exploitation• Probability of choosing action a

• Effect of T: – For high values of T, p(a | x) is uniformly distributed for

all actions– For low values of T, p(a | x) of the action with the highest

value of is approaching 1

> @> @¦

Aa

TaxRTaxR

ap

'

/)',(~exp/),(~exp

)|( x

),(~ aR x

Agent navigation example

• Agent navigation in the maze:– 4 moves in compass directions– Effects of moves are stochastic – we may wind up in other

than intended location with a non-zero probability– Objective: learn how to reach the goal state in the shortest

expected time

moves

G

ExampleoftemperatureT

IfTissmaller,likeT=1

P(action=head|coin1)=Exp(2.75/1)

Exp(2.75/1)+Exp(1.25/1)=0.8175

Page 51: people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction to Machine Learning Lecture 26 Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

Thanks!-

Questions?