introduction to deep reinforcement...

30
Introduction to Deep Reinforcement Learning Yen-Chen Wu 2015/12/11

Upload: duongdang

Post on 13-Mar-2018

230 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

Introduction to Deep Reinforcement Learning

Yen-Chen Wu

2015/12/11

Page 2: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

Outline

• Reinforcement Learning

• Markov Decision Process

• How to Solve MDPs

– DP

– MC

– TD

– Q-learning (DQN)

• Paper Review

Page 3: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

REINFORCEMENT LEARNING

Page 4: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

Branches of Machine Learning

Page 5: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

What makes different?

• There is no supervisor, only a reward signal

• Feedback is delayed, not instantaneous

• Time really matters (sequential, non i.i.d data)

• Agent’s actions affect the subsequent data it receives

Page 6: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

Goal: Maximize Cumulative Reward

• Actions may have long term consequences

• Reward may be delayed

• It may be better to sacrifice immediate reward to gain more long-term reward

Page 7: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

Agent & Enviroment

→←↑↓DefenseAttackJump

Page 8: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

MARKOV DECISION PROCESS

Markov Processes

Markov Reward Processes

Markov Decision Processes

Page 9: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

Markov Process

Page 10: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

Markov Reward Processes

Page 11: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

Markov Decision Process

Page 12: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

Markov Decision Process(MDP)

• S : finite set of states (observations)

• A : finite set of actions

• P : transition probability

• R : immediate reward

• γ : discount factor

• Goal :– Choose policy π

– Maximize expected return :

Page 13: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

HOW TO SOLVE MDP

Dynamic Programming

Monte-Carlo

Temporal-Difference

Q-Learning

Page 14: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

Model-based

• Dynamic Programming

– Evaluate policy

– Update policy

Page 15: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

Model Free

• Unknown Transition Probability & Reward

• MC vs TD

Page 16: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

Model Free:Q-learning

• Instead of tabular

• optimal action-value function (Q-learning)

– =

• Bellman equation

Basic idea : iterative update (lack of generalization)

In practical : function approximator

Linear ?

Using DNN !

Page 17: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

DEEP Q-NETWORK (DQN)

Page 18: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

Video

• https://www.youtube.com/watch?v=LJ4oCb6u7kk

Page 19: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

Deep Q-Network

• compute Q-values for all actions

Input : 84x84x4

Convolves 32 filters of 8x8 with stride 4Convolves 64 filters of 4x4 with stride 2Convolves 64 filters of 3x3 with stride 1

Full-connected 512 nodes

Output a node for each action

Page 20: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

Update DQN

• Loss function

• Gradient

Page 21: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

Two Technique

• Experience Replay– Experience– Pooled Memory

• Data efficiency (bootstrap)• Avoid correlation between samples (variance between

batches)• Off –policy is suitable for Q-learning

– Random sampled mini-batch– Prioritized sweeping (active learning)

• Separate Target Network– more stable than online learning

Example Learn the value of… Pros & Cons

On-policy SARSA policy being carried out by the agent

Fast but weak

Off-policy DQN optimal policy independently of the agent's actions

Slow but robust

Page 22: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

DEMO

Page 23: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

PAPER REVIEW

Page 24: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

Paper list

• Massively Parallel Methods for Deep Reinforcement Learning

• Continuous control with deep reinforcement learning

• Deep Reinforcement Learning with Double Q-learning

• Policy Distillation

• Dueling Network Architectures for Deep Reinforcement Learning

• Multiagent Cooperation and Competition with Deep Reinforcement Learning

Page 25: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

Massively Parallel Methods for Deep Reinforcement LearningArun NairarXiv:1507.04296

Page 26: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

DDPG (Deterministic Policy Gradient)

• DDAC (Deep Deterministic Actor-Critic)

Continuous control with deep reinforcement learningTimothy P. LillicraparXiv:1509.02971https://goo.gl/J4PIAz

Page 27: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

Double Q-learning

Page 28: Introduction to Deep Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture... · Outline •Reinforcement Learning •Markov Decision Process •How to Solve

Policy Distillation

• Soft target