reinforcement learning introduction subramanian ramamoorthy school of informatics 17 january 2012

26
Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

Upload: alexandra-moore

Post on 03-Jan-2016

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

Reinforcement Learning

Introduction

Subramanian RamamoorthySchool of Informatics

17 January 2012

Page 2: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

Admin

Lecturer: Subramanian (Ram) RamamoorthyIPAB, School of Informaticss.ramamoorthy@ed (preferred method of contact)Informatics Forum 1.41, x505119

Main Tutor: Majd Hawasly, [email protected], IF 1.43

Class representative?

Mailing list: Are you on it – I will use it for announcements!

17/01/2012 Reinforcement Learning 2

Page 3: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

Admin

Lectures: • Tuesday and Friday 12:10 - 13:00 (FH 3.D02 and AT 2.14)

Assessment: Homework/Exam 10+10% / 80%• HW1: Out 7 Feb, Due 23 Feb

– Use MDP based methods in a robot navigation problem

• HW2: Out 6 Mar, Due 29 Mar– POMDP version of the previous exercise

17/01/2012 Reinforcement Learning 3

Page 4: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

Admin

Tutorials (M. Hawasly, K. Etessami, M. von Rossum), tentatively:T1 [Warm-up: Formulation, bandits] - week of 30th Jan T2 [Dyn. Prog.] - week of 6th Feb T3 [MC methods] - week of 13th Feb T4 [TD methods] - week of 27th Feb T5 [POMDP] - week of 12th Mar

- We’ll assign questions (combination of pen & paper and computational exercises) – you attempt them before sessions.

- Tutor will discuss and clarify concepts underlying exercises- Tutorials are not assessed; gain feedback from participation

17/01/2012 Reinforcement Learning 4

Page 5: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

Admin

Webpage: www.informatics.ed.ac.uk/teaching/courses/rl• Lecture slides will be uploaded as they become available

Readings:• R. Sutton and A. Barto, Reinforcement Learning, MIT Press, 1998• S. Thrun, W. Burgard, D. Fox, Probabilistic Robotics, MIT Press,

2006• Other readings: uploaded to web page as needed

Background: Mathematics, Matlab, Exposure to machine learning?

17/01/2012 Reinforcement Learning 5

Page 6: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

Problem of Learning from Interaction

with environmentto achieve some goal

• Baby playing. No teacher. Sensorimotor connection to environment.

– Cause – effect– Action – consequences– How to achieve goals

• Learning to drive car, hold conversation, etc.• Environment’s response affects our subsequent actions• We find out the effects of our actions later

17/01/2012 Reinforcement Learning 6

Page 7: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

Rough History of RL Ideas

• Psychology – learning by trial and error… actions followed by good or bad outcomes have their tendency to be reselected altered accordingly- Selectional: try alternatives and pick good ones- Associative: associate alternatives with particular situations

• Computational studies (e.g., credit assignment problem) – Minsky’s SNARC, 1950– Michie’s MENACE, BOXES, etc. 1960s

• Temporal Difference learning (Minsky, Samuel, Shannon, …)– Driven by differences between successive estimates over time

17/01/2012 Reinforcement Learning 7

Page 8: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

Rough History of RL, contd.

• In 1970-80, many researchers, e.g., Klopf, Sutton & Barto,…, looked seriously at issues of “getting results from the environment” as opposed to supervised learning (distinction is subtle!)– Although supervised learning methods such as backpropagation

were sometimes used, emphasis was different

• Stochastic optimal control (mathematics, operations research)– Deep roots: Hamilton-Jacobi → Bellman/Howard– By the 1980s, people began to realize the connection between

MDPs and the RL problem as above…17/01/2012 Reinforcement Learning 8

Page 9: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

What is the Nature of the Problem?

• As you can tell from the history, many ways to understand the problem – you will see this as we proceed through course

• One unifying perspective: Stochastic optimization over time

• Given (a) Environment to interact with, (b) Goal• Formulate cost (or reward)• Objective: Maximize rewards over time• The catch: Reward may not be rich enough as optimization is

over time – selecting entire paths

• Let us unpack this through a few application examples…

17/01/2012 Reinforcement Learning 9

Page 10: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

Motivating RL Problem 1: Control

17/01/2012 Reinforcement Learning 10

Page 11: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

The Notion of Feedback Control

Compute corrective actions so as to minimise a measured error

Design involves the following:- What is a good policy for

determining the corrections?

- What performance specificationsare achievable by such systems?

17/01/2012 Reinforcement Learning 11

Page 12: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

Feedback Control

• The Proportional-Integral-Derivative Controller Architecture

• ‘Model-free’ technique, works reasonably in simple (typically first & second order) systems

17/01/2012 Reinforcement Learning 12

• More general: consider feedback architecture, u= - Kx

• When applied to a linear system, closed-loop dynamics:

• Using basic linear algebra, you can study dynamic properties

• e.g., choose K to place the eigenvalues and eigenvectors of the closed-loop system

Page 13: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

The Optimal Feedback Controller

• Begin with the following:– Dynamics: “Velocity” = f(State, Control)– Cost: Integral involving State/Control squared (e.g., x’Qx)

• Basic idea: Optimal control actions correspond to a cost or value surface in an augmented state space

• Computation: What is the path equivalent of f’(x) = 0?

• In the special case of linear dynamics and quadratic cost, we can explicitly solve the resulting Ricatti equation

17/01/2012 Reinforcement Learning 13

Page 14: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

The Linear Quadratic Regulator

17/01/2012 Reinforcement Learning 14

Page 15: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

Idea in Pictures

17/01/2012 Reinforcement Learning 15

Main point to takeaway: notion of value surface

Page 16: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

Connection between Reinforcement Learning and Control Problems

• RL has close connection to stochastic control (and OR)

• Main differences seem to arise from what is ‘given’

• Also, motivations such as adaptation

• In RL, we emphasize sample-based computation, stochastic approximation

17/01/2012 Reinforcement Learning 16

Page 17: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

Example Application 2: Inventory Control

Objective: Minimize total inventory cost

Decisions:How much to order?When to order?

17/01/2012 17Reinforcement Learning

Page 18: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

Components of Total Cost

1. Cost of items2. Cost of ordering3. Cost of carrying or holding inventory4. Cost of stockouts5. Cost of safety stock (extra inventory held to help avoid

stockouts)

17/01/2012 18Reinforcement Learning

Page 19: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

The Economic Order Quantity Model- How Much to Order?

1. Demand is known and constant2. Lead time is known and constant3. Receipt of inventory is instantaneous4. Quantity discounts are not available5. Variable costs are limited to: ordering cost and carrying (or

holding) cost6. If orders are placed at the right time, stockouts can be

avoided

17/01/2012 19Reinforcement Learning

Page 20: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

Inventory Level Over Time Based on EOQ Assumptions

17/01/2012 20Reinforcement Learning

Page 21: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

EOQ Model Total Cost

At optimal order quantity (Q*): Carrying cost = Ordering cost

)/2( hCDCo

Demand Costs

17/01/2012 21Reinforcement Learning

Page 22: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

Realistically, How Much to Order –If these Assumptions Didn’t Hold?

1. Demand is known and constant2. Lead time is known and constant3. Receipt of inventory is instantaneous4. Quantity discounts are not available5. Variable costs are limited to: ordering cost and carrying (or

holding) cost6. If orders are placed at right time, stockouts can be avoided

The result may be need for a more detailed stochastic optimization.

17/01/2012 22Reinforcement Learning

Page 23: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

Example Application 3: Dialogue Mgmt.

17/01/2012 Reinforcement Learning 23

[S. Singh et al., JAIR 2002]

Page 24: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

Dialogue Management: What is Going On?

• System is interacting with the user by choosing things to say• Possible policies for things to say is huge, e.g., 242 in NJFun

Some questions:- What is the model of dynamics?- What is being optimized?- How much experimentation is possible?

17/01/2012 Reinforcement Learning 24

Page 25: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

The Dialogue Management Loop

17/01/2012 Reinforcement Learning 25

Page 26: Reinforcement Learning Introduction Subramanian Ramamoorthy School of Informatics 17 January 2012

Common Themes in these Examples

• Stochastic Optimization – make decisions!Over time; may not be immediately obvious how we’re doing

• Some notion of cost/reward is implicit in problem – defining this, and constraints to defining this, are key!

• Often, we may need to work with models that can only generate sample traces from experiments

17/01/2012 Reinforcement Learning 26