reinforcement learning
TRANSCRIPT
![Page 1: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/1.jpg)
WELCOME TO ALL
![Page 2: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/2.jpg)
REINFORCEMENT LEARNING
NAME : S. THARABAI
COURSE: M.TECH(CSE) II YEAR PT
REG. No : 121322201011
REVIEW No: 2
![Page 3: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/3.jpg)
![Page 4: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/4.jpg)
![Page 5: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/5.jpg)
WHAT IS REINFORCEMENT LEARNING?
![Page 6: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/6.jpg)
REINFORCEMENT IS DEFINED AS CHARACTERISING THE LEARNING PROBLEM
![Page 7: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/7.jpg)
REINFORCEMENT LEARNING WORKS ON PRIOR AND
ACCURATE MEASUREMENTSIT WORKS ON SOME
CALCULATIONS AND PLACES WHERE HUMAN CAN'T REACH. IT
IS INTANGIBLE
![Page 8: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/8.jpg)
IT IS NOT A REAL WORLD PROBLEM BUT TRYING TO
REACH THINGS BEYOND THE WORLD!
![Page 9: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/9.jpg)
Reinforcement learning is
learning what to do--how to map situations to actions--so as to maximize a numerical reward signal.
These two characteristics--trial-and-error search and delayed reward--are the two most important distinguishing features of
reinforcement learning.
![Page 10: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/10.jpg)
EXAMPLES
➢ CHESS BOARD MOVEMENTS➢ ROBOTICS MOVEMENTS➢ DEPTH OF THE PETROLEUM REFINERY
AND ADAPTOR
![Page 11: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/11.jpg)
Comparison Between Machine Learning And Reinforcement Learning.
Machine learning works classified and unclassified data sets. Reinforcement learning requires prior – accurate measurements. Example, Robotics Arms should be measured and given, how many steps to move and in what direction based on temperature adaptors.Machine learning is a real-world problem. Reinforcement learning works on some calculations, and helps in places where man can not reach. Example, Petroleum Refineries.Machine learning is tangible. Reinforcement learning is intangible. Example, Chandrayan-1 project.Machine learning works on sequence of step, reinforcement learning works on episodesepisodes
![Page 12: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/12.jpg)
![Page 13: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/13.jpg)
ELEMENTS OF REINFOREMENT LEARNING
➢ A POLICY➢ A REWARD FUNCTION➢ A VALUE FUNCTION➢ A MODEL ENVIRONMENT
![Page 14: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/14.jpg)
POLICY, REWARD AND VALUE
➢ POLICY DEFINES LEARNING AGENTS'S WAY OF BEHAVING AT A GIVEN TIME.
➢ REWARD DEFINES A GOAL IN A REINFORECEMENT PROBLEM
➢ VALUE FUNCTION SPECIFIES WHAT IS GOOD IN THE LONG RUN
![Page 15: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/15.jpg)
ENVIRONMENT
The fourth and final element of some reinforcement learning systems is a model of the environment. This is something that mimics the behavior of the environment. For example, given a state and action, the model might predict the resultant next state and next reward. Models are used for planning
![Page 16: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/16.jpg)
![Page 17: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/17.jpg)
An -Armed Bandit Problem
An analogy is that of a doctor choosing between experimental treatments for a series of seriously ill patients. Each play is a treatment selection, and each reward is the survival or well-being of the patient.
![Page 18: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/18.jpg)
In our -armed bandit problem,
Each action has an expected or mean reward given that the action is selected; let us call this the value of that action. If you knew the value of each action, then it would be trivial to solve the -armed bandit problem: you would always select the action with highest value. We assume that you do not know the action values with certainty, although you may have estimates.
![Page 19: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/19.jpg)
![Page 20: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/20.jpg)
Monte Carlo Methods
Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns. To ensure that well-defined returns are available, we define Monte Carlo methods only for episodic tasks. That is, we assume experience is divided into episodes, and that all episodes eventually terminate no matter what actions are selected. It is only upon the completion of an episode that value estimates and policies are changed. Monte Carlo methods are thus incremental in an episode-by-episode sense, but not in a step-by-step sense. The term "Monte Carlo" is often used more broadly for any estimation method whose operation involves a significant random component. Here we use it specifically for methods based on averaging complete returns.
![Page 21: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/21.jpg)
The backup diagram for Monte Carlo estimation of
![Page 22: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/22.jpg)
Approximate state-value functions for the blackjack policy that sticks only on 20 or 21, computed by Monte Carlo policy evaluation.
![Page 23: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/23.jpg)
The spectrum ranging from the one-step backups of simple TD methods(step-by-step value) to the
up-until-termination backups of Monte Carlo methods(next episode reward)
![Page 24: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/24.jpg)
Performance of -step TD methods. The performance measure shown is the root mean-squared (RMS) error between the true values of states and the values found by the learning methods, averaged over the 19 states, the first 10 trials, and 100 different
sequences of walks.
![Page 25: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/25.jpg)
A simple maze (inset) and the average learning curves for Dyna-Q agents varying in their number of planning steps per real step. The task is to travel from S to S as
quickly as possible.
![Page 26: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/26.jpg)
Action-Value Methods
we denote the true (actual) value of action a as, Q*(a) and the estimated value at the r th play as Qt(a) . Recall that the true value of an action is the mean reward received when that action is selected. One natural way to estimate this is by averaging the rewards actually received when the action was selected. And r be the number of plays in which Ka is the times prior to r th play.
![Page 27: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/27.jpg)
![Page 28: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/28.jpg)
yielding rewards, r1+r2+....rk
then its value is estimated to be,
![Page 29: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/29.jpg)
![Page 30: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/30.jpg)
DYNAMIC PROGRAMMING
The reinforcement learning is more useful in tracking the Non-stationary problems.
In such cases it makes sense to weight recent rewards more heavily than long-past ones. Ex. DYNAMIC PROGRAMMING.
![Page 31: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/31.jpg)
The Agent-Environment Interface
The reinforcement learning problem is meant to be a straightforward framing of the problem of learning from interaction to achieve a goal. The learner and decision-maker is called the agent. The thing it interacts with, comprising everything outside the agent, is called the environment.
![Page 32: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/32.jpg)
The Markov Property
In the reinforcement learning framework, the agent makes its decisions as a function of a signal from the environment called the environment's state. In particular, we formally define a property of environments and their state signals that is of particular interest, called the Markov property.
![Page 33: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/33.jpg)
Example 1: Bioreactor Suppose reinforcement learning is being applied to determine moment-by-moment temperatures and stirring rates for a bioreactor (a large vat of nutrients and bacteria used to produce useful chemicals)
![Page 34: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/34.jpg)
Example 2: Pick-and-Place Robot Consider using reinforcement learning to control the motion of a robot arm in a repetitive pick-and-place task. If we want to learn movements that are fast and smooth, the learning agent will have to control the motors directly and have low-latency information about the current positions and velocities of the mechanical linkages.
![Page 35: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/35.jpg)
Sample Remote RCXLisp Code Illustrating Sensor-Based Motor Control
LISP CODING
![Page 36: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/36.jpg)
(DEFUN full-speed-ahead (r s dir)
“This will make the rcx in stream R go at speed S
in direction DIR until touch sensor on its ‘2’ port returns 1.”
(LET ((result 0))
(set-effector-state ‘(:A :B :C) :power :off r)
;in case things are in an inconsistent state,
;turn everything off first
(set-effector-state ‘(:A :C) :speed s r)
(set-effector-state ‘(:A :C) :direction dir r)
; dir is eq to :forward, :backward, or :toggle
; no motion will occur until the
; next call to set-effector-state
(set-sensor-state 2 :type :touch :mode :boolean r)
(set-effector-state ‘(:A :C) :power :on r)
(LOOP ;this loop will repeat forever until sensor 2 returns a 1
(SETF result (sensor 2 r))
(WHEN (AND (NUMBERP result)
;needed to keep = from causing error if
;sensor function returns nil due to timeout.
(= result 1))
(RETURN)))
(set-effector-state ‘(:A :C) :power :float r))))
(DEFUN testing ()
(with-open-com-port (port :LEGO-USB-TOWER)
(with-open-rcx-stream (rcx10 port :timeout-interval 80 :rcx-unit 10)
; increase/decrease serial timeout value of 80 ms depending on
;environmental factors like ambient light.
(WHEN (alivep rcx10)
(full-speed-ahead rcx10 5 :forward)))))
![Page 37: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/37.jpg)
Integrating Planning, Acting, and Learning
When planning is done on-line, while interacting with the environment, a number of interesting issues arise. New information gained from the interaction may change the model and thereby interact with planning. It may be desirable to customize the planning process in some way to the states or decisions currently under consideration, or expected in the near future.
![Page 38: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/38.jpg)
Relationships among learning, planning, and acting.
![Page 39: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/39.jpg)
Dyna-Q, a simple architecture integrating the major functions needed in an on-line planning agent. Each function appears in
Dyna-Q in a simple, almost trivial, form
![Page 40: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/40.jpg)
The complete algorithm for Dyna-Q.
![Page 41: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/41.jpg)
A simple maze (inset) and the average learning curves for Dyna-Q agents varying in their number of
planning steps per real step. The task is to travel from S to S as quickly as possible.
![Page 42: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/42.jpg)
Policies found by planning and nonplanning Dyna-Q agents halfway through the second episode. The arrows
indicate the greedy action in each state; no arrow is shown for a state if all of its action values are equal. The
black square indicates the location of the agent.
![Page 43: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/43.jpg)
Dimensions of Reinforcement Learning
The dimensions of the reinforcement learning have to do with the kind of backup used to improve the value function. The vertical dimension is whether they are sample backups (based on a sample trajectory) or full backups (based on a distribution of possible trajectories). Full backups of course require a model, whereas sample backups can be done either with or without a model (another dimension of variation).
![Page 44: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/44.jpg)
A slice of the space of reinforcement learning methods.
![Page 45: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/45.jpg)
The horizontal dimension corresponds to the depth of backups, that is, to the degree of bootstrapping. At three of the four corners of the space are the three primary methods for estimating values: DP, TD, and Monte Carlo.
A third important dimension is that of function approximation. Function approximation can be viewed as an orthogonal spectrum of possibilities ranging from tabular methods at one extreme through state aggregation, a variety of linear methods, and then a diverse set of nonlinear methods
![Page 46: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/46.jpg)
In addition to the four dimensions just discussed, we have identified a number of others throughout the book:
Definition of return
Is the task episodic or continuing, discounted or undiscounted?
Action values vs. state values vs. afterstate values
What kind of values should be estimated? If only state values are estimated, then either a model or a separate policy (as in actor-critic methods) is required for action selection.
Action selection/exploration
How are actions selected to ensure a suitable trade-off between exploration and exploitation? We have considered only the simplest ways to do this: -greedy and softmax action selection, and optimistic initialization of values.
![Page 47: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/47.jpg)
Synchronous vs. asynchronous
Are the backups for all states performed simultaneously or one by one in some order?
Replacing vs. accumulating traces
If eligibility traces are used, which kind is most appropriate?
Real vs. simulated
Should one backup real experience or simulated experience? If both, how much of each?
![Page 48: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/48.jpg)
Location of backups
What states or state-action pairs should be backed up? Modelfree methods can choose only among the states and state-action pairs actually encountered, but model-based methods can choose arbitrarily. There are many potent possibilities here.
Timing of backups
Should backups be done as part of selecting actions, or only afterward?
Memory for backups
How long should backed-up values be retained? Should they be retained permanently, or only while computing an action selection, as in heuristic search?
![Page 49: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/49.jpg)
Elevator Dispatching
Waiting for an elevator is a situation :
We press a button and then wait for an elevator to arrive traveling in the right direction.
We may have to wait a long time if there are too many passengers or not enough elevators.
Just how long we wait depends on the dispatching strategy the elevators use to decide where to go.
For example, if passengers on several floors have requested pickups, which should be served first? If there are no pickup requests, how should the elevators distribute themselves to await the next request?
Elevator dispatching is a good example of a stochastic optimal control problem of economic importance that is too large to solve by classical techniques such as dynamic programming.
![Page 50: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/50.jpg)
Four elevators in a ten-story building
![Page 51: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/51.jpg)
Job-Shop Scheduling
Many jobs in industry and elsewhere require completing a collection of tasks while satisfying temporal and resource constraints. Temporal constraints say that some tasks have to be finished before others can be started; resource constraints say that two tasks requiring the same resource cannot be done simultaneously (e.g., the same machine cannot do two tasks at once). The objective is to create a schedule specifying when each task is to begin and what resources it will use that satisfies all the constraints while taking as little overall time as possible. This is the job-shop scheduling problem.
![Page 52: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/52.jpg)
JOB SHOP SCHDULING APPLICATION IN NASA
The NASA space shuttle payload processing problem (SSPP), which requires scheduling the tasks required for installation and testing of shuttle cargo bay payloads. An SSPP typically requires scheduling for two to six shuttle missions, each requiring between 34 and 164 tasks. An example of a task is MISSION-SEQUENCE-TEST.
![Page 53: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/53.jpg)
CONCLUSION
All of the reinforcement learning methods we have explored in this ppt have three key ideas in common.
First, the objective of all of them is the estimation of value functions.
Second, all operate by backing up values along actual or possible state trajectories.
Third, all follow the general strategy of generalized policy iteration (GPI), meaning that they maintain an approximate value function and an approximate policy, and they continually try to improve each on the basis of the other.
We suggest that value functions, backups, and GPI are powerful organizing principles potentially relevant to any model of intelligence.
![Page 54: REINFORCEMENT LEARNING](https://reader034.vdocuments.us/reader034/viewer/2022051400/55c3803abb61eba52d8b4646/html5/thumbnails/54.jpg)
THANK YOU!