reinforcement learning - redwood center for theoretical …€¦ · reinforcement learning ? ?...

33
Reinforcement Learning VS265 - Neural Computation, 2018

Upload: others

Post on 19-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Reinforcement Learning

VS265 - Neural Computation, 2018

Page 2: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

What we have covered

Passive Learning Today:Active Learning (RL)

Page 3: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

What is Reinforcement Learning?

Page 4: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

How is it different than other models?

Passive Learning Active Learning (RL)

? ?

Page 5: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Why is this hard?

● Actions affect future data● Rewards are sparse● Feedback is delayed

Reinforcement Learning

? ?

Page 6: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Outline● Markov Decision Processes (MDPs)● How to maximize reward (Q-Learning)● Connection to neurons in the Ventral Tegmental Area

(VTA)● How to learn in large, unstructured**, environments● Open Questions

Page 7: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Markov Decision Process

Page 8: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Markov Decision Process (MDP)An MDP fully describes an Environment:

○ S: State Space○ A: Action Space○ P: Transition Kernel - ○ R: Reward Function -

Page 9: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Markov Decision Process (MDP)● Markov

○● Decision

○ Decide on an action at each time point○

● Process○ States evolve over time

Page 10: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Markov Decision Process (MDP)

A B

CD

Page 11: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Q-Learning - Algorithm

Page 12: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Q-Learning - Algorithm● Find a good policy, , that maximizes the

expected sum of rewards over time:

Page 13: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Q-Learning - Algorithm● Q(s,a) is the total expected reward starting from state s,

taking action a, and then following optimal policy

Page 14: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Q-Learning - Update Rule

Page 15: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Q-Learning - Update Rule

Page 16: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Q-Learning - Update RuleState-Value Function:

Page 17: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Q-Learning - Update RuleAction-Value Function:

Page 18: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Q-Learning - Update Rule

Page 19: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Q-Learning - Update Rule

Temporal Difference

Critic (New Belief)

Belief

Iterate:

Page 20: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Q-Learning - Exercise

Page 21: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Q-Learning (Exercise)

Temporal Difference

A B

CD

Page 22: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Connection to VTA

Page 23: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Connection to VTA

Theoretical Neuroscience, ch.9 (Dayan & Abbot)(Adapted from Mirenowicz & Schultz, ‘94 & Schultz ‘98)

Page 24: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Q-Learning in large environments

Page 25: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Q-Learning in large environments

A B

CD

Page 26: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Q-Learning in large environments● Deep Q-Networks (DQN): Estimate Q using a

neural network

Page 27: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Q-Learning in large environments● Objective Function: Use the temporal difference

signal

Page 28: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Deep-Q-Network

● Use a Convolutional Neural Network (CNN) as the function approximator

● Experience Replay - Store experiences in a data-set and randomly sample them during learning

Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).

Page 29: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Open Questions

Page 30: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Open Questions● Credit assignment in worlds with sparse rewards● Exploration vs. Exploitation● Generalization to the real world● Continual Learning

Page 31: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Q-Learning in even more complex worlds

A B

CD

Page 32: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Resources● David Silver’s Lectures

○ http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

● CS294 - Deep Reinforcement Learning ○ http://rll.berkeley.edu/deeprlcourse/

Page 33: Reinforcement Learning - Redwood Center for Theoretical …€¦ · Reinforcement Learning ? ? Outline Markov Decision Processes (MDPs) How to maximize reward (Q-Learning) Connection

Q-Learning

Temporal Difference

A B

CD