reinforcement learning 2 - courses.cit.cornell.edu · reinforcement learning 2 pantelis p. analytis...

25
Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal difference learning Q-learning Applications Midterm revision Reinforcement Learning 2 Pantelis P. Analytis March 24, 2018 1 / 25

Upload: nguyendan

Post on 06-Sep-2018

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Reinforcement Learning 2

Pantelis P. Analytis

March 24, 2018

1 / 25

Page 2: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

1 Introduction

2 Temporal difference learning

3 Q-learning

4 Applications

5 Midterm revision

2 / 25

Page 3: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Different types of learning

3 / 25

Page 4: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Characteristics of reinforcement learning

Evaluative feedback.

Sequentiality, delayed rewards.

Need for trial and error, to explore as well as to exploit.

Non stationary world.

4 / 25

Page 5: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Temporal difference learning

Broadly used to predict future rewards.It appears to be how the brain reward system works.It is learning a prediction from another later, learnedprediction.The TD error is the difference between two predictions,the temporal difference.

5 / 25

Page 6: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Temporal difference learning

V (s)← V (s) + α(

The TD target︷ ︸︸ ︷r + γV (s ′) −V (s))

r + γV (s ′) is known as the TD target

6 / 25

Page 7: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Temporal difference learning in the brain (Schultz,Dayan, Montague, 1997)

7 / 25

Page 8: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Temporal difference learning in the brain

V (s)← V (s) + α(

The TD target︷ ︸︸ ︷r + γV (s ′) −V (s))

r + γV (s ′) is known as the TD target8 / 25

Page 9: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Temporal difference learning: example

Predicting the outcome of a game like chess orbackgammon.Long-term predictions by simulation are complex and evensmall errors in one-step predictions might be amplified. 9 / 25

Page 10: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Temporal difference learning: example

Predicting the outcome of a game like chess orbackgammon.Long-term predictions by simulation are complex and evensmall errors in one-step predictions might be amplified. 10 / 25

Page 11: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Temporal difference learning

11 / 25

Page 12: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Temporal difference learning

12 / 25

Page 13: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Q-learning

13 / 25

Page 14: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Q-learning

Q-learning converges to the optimal even if you are actingsub-optimally.

14 / 25

Page 15: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Model based and model free learning

Many situations involve conflict between a model-freesystem like TD-learning and a model-based system thatplans ahead.

15 / 25

Page 16: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Samuel’s checkers program

Inspired by Shannon’s paper on chess-playing computers.

It achieved good, but not expert level of playing.

Used a learning process that was similar to TD-learning.

16 / 25

Page 17: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Tesauro’s TD-Grammon

Developed in 1992 by Gerard Tesauro. After playing300.000 games against itself it performed approximately atthe level of human world class players.

17 / 25

Page 18: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Atari breakthrough

Google brained trained an agent that learned 49 Atarigames by receiving as input the pixels of the screen andevaluated the rewards from different positions of thejoystick. It learned half of them at human level.

18 / 25

Page 19: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Alpha Go

Alpha go searched planned much deeper in the game tree.

It uses reinforcement learning to evaluate which pathswhere worthwhile searching.

19 / 25

Page 20: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Attention allocation in online interfaces

20 / 25

Page 21: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Music lab experiment

21 / 25

Page 22: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Learning from others

22 / 25

Page 23: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Clinical vs. actuarial decision making

23 / 25

Page 24: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Exploration-exploitation dilemma

24 / 25

Page 25: Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis Introduction Temporal di erence learning Q-learning Applications Midterm revision

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Iowa gambling task

25 / 25