computer science education lab, umass, amherst ...smaji/cmpsci689/slides/lec23_rl...markov decision...

84
Reinforcement Learning Kevin Spiteri April 21, 2015

Upload: others

Post on 07-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Reinforcement Learning

Kevin SpiteriApril 21, 2015

Page 2: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

n-armed bandit

Page 3: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

n-armed bandit

0.9 0.5 0.1

Page 4: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

n-armed bandit

0.9 0.5 0.1

0.0 0.0 0.0 estimate0.0

Page 5: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

n-armed bandit

0.9 0.5 0.1

0.0 0.0 0.0 estimate0.0

0 0 0.0 attempts0

0 0 0.0 payoff00

0

0

Page 6: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

n-armed bandit

0.9 0.5 0.1

0.0 0.0 1.0 0.0 estimate0.0

0 0 1 0.0 attempts0

0 0 1 0.0 payoff01

1

1.0

Page 7: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

n-armed bandit

0.9 0.5 0.1

0.0 1.0 0.5 0.0 estimate0.0

0 1 2 0.0 attempts0

0 1 1 0.0 payoff01

2

0.5

Page 8: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Exploration

0.9 0.5 0.1

0.0 1.0 0.5 0.0 estimate0.0

0 1 2 0.0 attempts0

0 1 1 0.0 payoff02

3

0.67

Page 9: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Going on …

0.9 0.5 0.1

0.9 0.5 0.0 estimate0.1

280 10 0.0 attempts10

252 5 0.0 payoff1258

300

0.86

Page 10: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Changing environment

0.7 0.8 0.1

0.9 0.5 0.0 estimate0.1

280 10 0.0 attempts10

252 5 0.0 payoff1258

300

0.86

Page 11: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Changing environment

0.7 0.8 0.1

0.8 0.65 0.0 estimate0.1

560 20 0.0 attempts20

448 13 0.0 payoff2463

600

0.77

Page 12: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Changing environment

0.7 0.8 0.1

0.74 0.74 0.0 estimate0.1

1400 50 0.0 attempts50

1036 37 0.0 payoff51078

1500

0.72

Page 13: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

n-armed bandit

● Optimal payoff (0.82):0.9 x 300 + 0.8 x 1200 = 1230

● Actual payoff (0.72):0.9 x 280 + 0.5 x 10 + 0.1 x 10 +0.7 x 1120 + 0.8 x 40 + 0.1 x 40 = 1078

Page 14: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

n-armed bandit

● Evaluation vs instruction.

● Discounting.

● Initial estimates.

● There is no best way or standard way.

Page 15: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Markov Decision Process(MDP)

Page 16: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Markov Decision Process(MDP)● States

Page 17: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Markov Decision Process(MDP)● States

Page 18: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Markov Decision Process(MDP)● States● Actions

ab

c

Page 19: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Markov Decision Process(MDP)● States● Actions● Model

a 0.25

a 0.75b

c

Page 20: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Markov Decision Process(MDP)● States● Actions● Model

a 0.25

a 0.75b

c

Page 21: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Markov Decision Process(MDP)● States● Actions● Model● Reward

a 0.25

a 0.75b

c

5 -1

0

Page 22: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Markov Decision Process(MDP)● States● Actions● Model● Reward● Policy a 0.25

a 0.75b

c

5 -1

0

Page 23: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Markov Decision Process(MDP)● States: ball

tablehandbasketfloor

t ht

b f

Page 24: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Markov Decision Process(MDP)● States: ball

tablehandbasketfloor

ht

b f

Page 25: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Markov Decision Process(MDP)● States: ball

tablehandbasketfloor

● Actions:a) attemptb) dropc) wait

ab

c

ht

b f

Page 26: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Markov Decision Process(MDP)● States: ball

tablehandbasketfloor

● Actions:a) attemptb) dropc) wait

a 0.25

a 0.75b

c

ht

b f

Page 27: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Markov Decision Process(MDP)● States: ball

tablehandbasketfloor

● Actions:a) attemptb) dropc) wait

a 0.25

a 0.75b

c

ht

b f

Page 28: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Markov Decision Process(MDP)● States: ball

tablehandbasketfloor

● Actions:a) attemptb) dropc) wait

a 0.25

a 0.75b

c

5 -1

0

ht

b f

Page 29: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Markov Decision Process(MDP)● States: ball

tablehandbasketfloor

● Actions:a) attemptb) dropc) wait

a 0.25

a 0.75b

c

5 -1

0

ht

b f

Expected reward per round:0.25 x 5 + 0.75 x (-1) = 0.5

Page 30: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Markov Decision Process(MDP)● States: ball

tablehandbasketfloor

● Actions:a) attemptb) dropc) wait

a 0.25

a 0.75b

c

5 -1

0-1

ht

b f

Page 31: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Markov Decision Process(MDP)● States: ball

tablehandbasketfloor

● Actions:a) attemptb) dropc) wait

a 0.25

a 0.75b

c

5 -1

0-1

ht

b f

Page 32: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Reinforcement Learning Tools

● Dynamic Programming

● Monte Carlo Methods

● Temporal Difference Learning

Page 33: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Grid World

Reward:

Normal move:-1

Over obstacle:-10

Best reward:-15

Page 34: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Optimal Policy

Page 35: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Value Function

-15 -8

-14 -9

-7 0

-6 -1

-13 -10

-12 -11

-5 -2

-4 -3

Page 36: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Initial Policy

Page 37: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Policy Iteration

-21 -11

-22 -12

-10 0

-11 -1

-23 -13

-24 -14

-12 -2

-13 -3

Page 38: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Policy Iteration

-21 -11

-22 -12

-10 0

-11 -1

-23 -13

-24 -14

-12 -2

-13 -3

Page 39: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Policy Iteration

-21 -11

-22 -12

-10 0

-11 -1

-23 -13

-24 -14

-12 -2

-13 -3

Page 40: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Policy Iteration

Page 41: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Policy Iteration

-21 -11

-22 -12

-10 0

-11 -1

-23 -13

-15 -14

-12 -2

-4 -3

Page 42: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Policy Iteration

-21 -11

-22 -12

-10 0

-11 -1

-23 -13

-15 -14

-12 -2

-4 -3

Page 43: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Policy Iteration

-21 -11

-22 -12

-10 0

-11 -1

-23 -13

-15 -14

-12 -2

-4 -3

Page 44: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Policy Iteration

Page 45: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Policy Iteration

-15 -8

-14 -9

-7 0

-6 -1

-13 -10

-12 -11

-5 -2

-4 -3

Page 46: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Value Iteration

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

Page 47: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Value Iteration

-1 -1

-1 -1

-1 0

-1 -1

-1 -1

-1 -1

-1 -1

-1 -1

Page 48: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Value Iteration

-2 -2

-2 -2

-2 0

-2 -1

-2 -2

-2 -2

-2 -2

-2 -2

Page 49: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Value Iteration

-3 -3

-3 -3

-3 0

-3 -1

-3 -3

-3 -3

-3 -2

-3 -3

Page 50: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Value Iteration

-15 -8

-14 -9

-7 0

-6 -1

-13 -10

-12 -11

-5 -2

-4 -3

Page 51: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Stochastic Model

0.025

0.95

0.025

Page 52: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Value Iteration

-19.2 -10.4

-18.1 -12.1

-9.3 0

-8.2 -1.5

-17.0 -13.6

-15.7 -14.7

-6.7 -2.9

-5.1 -4.0

0.95

0.025 0.025

Page 53: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Value Iteration

-19.2 -10.4

-18.1 -12.1

-9.3 0

-8.2 -1.5

-17.0 -13.6

-15.7 -14.7

-6.7 -2.9

-5.1 -4.0

0.95

0.025 0.025

E.g. 13.6:

13.6 =0.950 x 13.1 +0.025 x 27.0 +0.025 x 16.7

16.6 =0.950 x 16.7 +0.025 x 13.1 +0.025 x 15.7

Page 54: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Richard Bellman

Page 55: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Bellman Equation

Page 56: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Reinforcement Learning Tools

● Dynamic Programming

● Monte Carlo Methods

● Temporal Difference Learning

Page 57: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Monte Carlo Methods

0.025

0.95

0.025

Page 58: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Monte Carlo Methods0.95

0.025 0.025

Page 59: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Monte Carlo Methods

-32 -22

-21

-10 0

-11

0.95

0.025 0.025

Page 60: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Monte Carlo Methods0.95

0.025 0.025

Page 61: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Monte Carlo Methods

-21 -11 -10 0

0.95

0.025 0.025

Page 62: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Monte Carlo Methods0.95

0.025 0.025

Page 63: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Monte Carlo Methods

-32

-31 -21

-10 0

-11

0.95

0.025 0.025

Page 64: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Q-Value

-15 -10

-8 -20

0.95

0.025 0.025

Page 65: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Bellman Equation

-15 -10

-8 -20

Page 66: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Learning Rate

● We do not replace an old Q value with a new one.

● We update at a designed learning rate.● Learning rate too small: slow to converge.● Learning rate too large: unstable.● Will Dabney PhD Thesis:

Adaptive Step-Sizes for Reinforcement Learning.

Page 67: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Reinforcement Learning Tools

● Dynamic Programming

● Monte Carlo Methods

● Temporal Difference Learning

Page 68: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Richard Sutton

Page 69: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Temporal Difference Learning

● Dynamic Programming:Learn a guess from other guesses(Bootstrapping).

● Monte Carlo Methods:Learn without knowing model.

Page 70: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Temporal Difference Learning

Temporal Difference:● Learn a guess from other guesses(Bootstrapping).

● Learn without knowing model.● Works with longer episodes than Monte

Carlo methods.

Page 71: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Temporal Difference Learning

Monte Carlo Methods:● First run through whole episode.● Update states at end.

Temporal Difference Learning:● Update state at each step using earlier

guesses.

Page 72: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Monte Carlo Methods

-32

-31 -21

-10 0

-11

0.95

0.025 0.025

Page 73: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Monte Carlo Methods

-32

-31 -21

-10 0

-11

0.95

0.025 0.025

Page 74: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Temporal Difference

0

0.95

0.025 0.025

-19 -10

-22 -18 -12

Page 75: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Temporal Difference

-23

-28 -21

-10 0

-11

0.95

0.025 0.025

-19 -10

-22 -18 -12

Page 76: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Temporal Difference

-23

-28 -21

-10 0

-11

0.95

0.025 0.025

-19 -10

-22 -18 -12

23 = 1 + 22

28 = 10 + 18

21 = 10 + 11

11 = 1 + 10

10 = 10 + 0

Page 77: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Function Approximation

● Most problems have large state space.● We can generally design an approximation

for the state space.● Choosing the correct approximation has a

large influence on system performance.

Page 78: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Mountain Car Problem

Page 79: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Mountain Car Problem

● Car cannot make it to top.● Can can swing back and forth to gain

momentum.● We know x and ẋ.● x and ẋ give an infinite state space.● Random – may get to top in 1000 steps.● Optimal – may get to top in 102 steps.

Page 80: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Function Approximation

● We can partition state space in 200 x 200 grid.

● Coarse coding – different ways of partitioning state space.

● We can approximate V = wT f● E.g. f = ( x ẋ height ẋ2 )T

● We can estimate w to solve problem.

Page 81: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Problems with Reinforcement Learning

Policy sometimes gets worse:● Safe Reinforcement Learning (Phil Thomas)

guarantees an improved policy over the current policy.

Very specific to training task:● Learning Parameterized Skills

Bruno Castro da Silva PhD Thesis

Page 82: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Checkers

● Arthur Samuel (IBM) 1959

Page 83: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

TD-Gammon

● Neural networks and temporal difference.● Current programs play better than human

experts.● Expert work

in inputselection.

Page 84: Computer Science Education Lab, UMASS, Amherst ...smaji/cmpsci689/slides/lec23_rl...Markov Decision Process (MDP) States: ball table hand basket floor Actions: a) attempt b) drop c)

Deep Learning: Atari

● Inputs: score and pixels.● Deep learning used to discover features.● Some games

played at super-human level.

● Some gamesplayed atmediocrelevel.