reinforcement learning : a beginners tutorial

REINFORCEMENT LEARNINGA Beginner’s Tutorial

By: Omar Enayet

(Presentation Version)

The Problem

Agent-Environment Interface

Environment Model

Goals & Rewards

Returns

Credit-Assignment Problem

Markov Decision Process

t

An MDP is defined by < S, A, p, r, >S - set of states of the environmentA(s) – set of actions possible in state s - probability of transition from s

- expected reward when executing a in s - discount rate for expected reward

Assumption: discrete time t = 0, 1, 2, . . .

sr

s sr

s. . .t a

t +1t +1

t +1a

rt +2

t +2t +2

at +3

t +3. . .

t +3a

rsa

pss'a

Value Functions

Optimal Value Functions

Exploration-Exploitation Problem

Policies

Elementary Solution Methods

Dynamic Programming

Perfect Model

Bootstrapping

Generalized Policy Iteration

Efficiency of DP

Monte-Carlo Methods

Episodic Return

Advantages over DP•No Model

•Simulation OR part of Model

•Focus on small subset of states

•Less Harmed by violations of Markov Property

First Visit VS Every-Visit

On-Policy VS Off-Policy

Action-value instead of State-value

Temporal-Difference Learning

Advantages of TD Learning

SARSA (On-Policy)

Q-Learning (Off-Policy)

Actor-Critic Methods(On-Policy)

R-Learning (Off-Policy)>>Average Expected reward per time-step

Eligibility Traces

REFERENCES

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning, Bradford Books, 1998.

Richard Crouch, Peter Bennett, Stephen Bridges, Nick Piper and Robert Oates - Monte Carlo - 2003

SLIDES FOR READING WITH : Omar Enayet – Reinforcement Learning : A

Beginner’s Tutorial - 2009

reinforcement learning : a beginners tutorial

Documents

s expected reward

environment model

optimal value functions

state s probability

perfect model

ain s discount rate

creditassignment problem

expected rewardassumption