orf 544 stochastic optimization and learning · derivative-free stochastic search notes: » the...
TRANSCRIPT
![Page 1: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/1.jpg)
.
ORF 544
Stochastic Optimization and LearningSpring, 2019
Warren PowellPrinceton University
http://www.castlelab.princeton.edu
© 2018 W.B. Powell
![Page 2: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/2.jpg)
Week 4
Chapter 7: Derivative-free stochastic search
© 2019 Warren Powell Slide 2
![Page 3: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/3.jpg)
Derivative-free stochastic search
Notes:» The material for this week and next will all be drawn
from chapter 7.» Chapter 7 is 60 pages, and desperately in need of a
rewrite. I don’t have time to do this, but the lectures will follow a new (and improved) outline.
» Derivative-free stochastic optimization is an extremely rich problem class. We will use this to illustrate four fundamental classes of policies, which can be organized along two core strategies:
• Policy search – Here we will search within a class of policies to identify which work best over time/iterations.
• Lookahead policies – These are policies that are constructed to optimize the value of an experiment plus the value of the downstream state.
© 2019 Warren Powell
![Page 4: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/4.jpg)
Derivative-free stochastic search
Week 4 (this week): We will cover:» Introduction to derivative-free stochastic optimization» Introduction to two strategies for developing policies:
• Policy search class• Lookahead class
» Then we will focus on the “policy search” class, which can be divided into two classes:
• Policy function approximations (PFAs)• Cost function approximations (CFAs)
» We will do the more difficult lookahead classes next week.
© 2019 Warren Powell
![Page 5: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/5.jpg)
Introduction to derivative-free stochastic search
© 2019 Warren Powell Slide 5
![Page 6: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/6.jpg)
Sports
Finding the best player» We have a set of
players from which to choose a team
» The effectiveness of a certain team is best revealed by playing a game
» We maximize the total number of games won in the season
© 2019 Warren Powell Slide 6
![Page 7: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/7.jpg)
Optimal learning in diabetesHow do we find the best treatment for diabetes?
» The standard treatment is a medication called metformin, which works for about 70 percent of patients.
» What do we do when metformin does not work for a patient?
» There are about 20 other treatments, and it is a process of trial and error. Doctors need to get through this process as quickly as possible.
© 2019 Warren Powell
![Page 8: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/8.jpg)
8
Drug discovery
Biomedical research» How do we find the
best drug to cure cancer?
» There are millions of combinations, with laboratory budgets that cannot test everything.
» We need a method for sequencing experiments.
© 2019 Warren Powell Slide 8
![Page 9: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/9.jpg)
Drug discovery
Designing molecules
» X and Y are sites where we can hang substituents to change the behavior of the molecule. We approximate the performance using a linear belief model:
0
ij ijsites i substituents j
Y X
» How to sequence experiments to learn the best molecule as quickly as possible?
© 2019 Warren Powell
![Page 10: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/10.jpg)
© 2019 Warren Powell Slide 10
![Page 11: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/11.jpg)
Designing menus
What should you offer?» No-one will like
everything on a menu.
» The trick is to find choices that will satisfy people.
» You will probably need to observe the response to a particular menu over a period of time.
» So, observations are time © 2019 Warren Powell
![Page 12: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/12.jpg)
Derivative free stochastic search
Settings
© 2019 Warren Powell
![Page 13: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/13.jpg)
The multiarmed bandit problem
Slide 13
![Page 14: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/14.jpg)
![Page 15: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/15.jpg)
Multiarmed bandit problemsBandit problems and objective functions» The classical bandit problem is cumulative reward:
where
» The essential characteristic of bandit problems is the exploration vs. exploitation tradeoff. Do you try something that does not look as good, potentially incurrent a lower reward, so you can learn and make better decisions in the future.
» Then the bandit community discovered that the same tradeoff exists in final reward. These problems became known as “best arm” bandit problems in the bandit vocabulary.
© 2019 Warren Powell
11
0
max ( ( ), )N
n n
n
F X S W
1 "winnings"State of knowledge
( )
n
n
n n
WSx X S Choose next “arm” to play
New information
What we know about each slot machine
![Page 16: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/16.jpg)
Multiarmed bandit problems Arms:
![Page 17: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/17.jpg)
Bandits:
Multiarmed bandit problems
![Page 18: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/18.jpg)
Multiarmed bandit problems
![Page 19: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/19.jpg)
Multiarmed bandit problems
![Page 20: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/20.jpg)
Multiarmed bandit problems
Dimensions of a “bandit” problem:» The “arms” (decisions) may be
• Binary (A/B testing, stopping problems)• Discrete alternatives (drug, catalyst, …)• Continuous choices (price)• Vector-valued (basketball team, products, movies, …)• Multiattribute (attributes of a movie, song, person)• Static vs. dynamic choice sets• Sequential vs. batch
» Information (what we observe)• Success-failure/discrete outcome• Exponential family (e.g. Gaussian, exponential, …)• Heavy-tailed (e.g. Cauchy)• Data-driven (distribution unknown)• Stationary vs. nonstationary processes• Lagged responses?• Adversarial?
![Page 21: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/21.jpg)
Multiarmed bandit problems
Dimensions of a “bandit” problem:» Belief models
• Lookup tables (these are most common)– Independent or correlated beliefs
• Parametric models– Linear or nonlinear in the parameters
• Nonparametric models– Locally linear– Deep neural networks/SVM
• Bayesian vs. frequentist
» Objective function• Expected performance (e.g. regret)• Offline (final reward) vs. online (cumulative reward)
– Just interested in final design?– Or optimizing while learning?
• Risk metrics
![Page 22: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/22.jpg)
Multiarmed bandit problemsWhat is a “bandit problem”?
» The literature seems to characterize a “bandit problem” as any problem where a policy has to balance exploration vs. exploitation.
» But this means that a bandit “problem” is defined by how it is solved. E.g., if you use a pure exploration policy, is it a bandit problem?
My definition:» Any sequential learning problem:
• Maximizing cumulative rewards (finite or infinite horizon)• Maximizing the terminal reward with a finite budget.
» Fundamental elements of a “bandit” problem:• A belief model that describes what we know about the function• The ability to learn sequentially about the function.
» I prefer to distinguish three problem classes:• Active learning – These problems arise when our decisions affect the
information that arrives. This opens the door to making decisions that balance current performance against learning to improve future performance.
• Passive learning – Here we may learn as we go, but we do not directly influence the learning process through our decisions.
• No learning – This would be for problems where there is no element of learning.
![Page 23: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/23.jpg)
An energy storage problem
Transition function
E
G
B
L
1 1
1 0 1 1 2 2 1
1 , 1 1
1
ˆt t t
pt t t t t t t t
D Dt t t t
battery batteryt t t
E E E
p p p p
D f
R R x
![Page 24: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/24.jpg)
An energy storage problem
Types of learning:» No learning ( are known)
» Passive learning (learn from price data)
We have no control over the evolution of prices.
» Active learning (“bandit problems”)
1 0 1 1 2 2 1p
t t t t tp p p p
1 0 1 1 2 2 1p
t t t t t t t tp p p p
1 0 1 1 2 2 3 1GB p
t t t t t t t t t tp p p p x If we have to learn the parameters, then we have to introducethis learning process into the dynamics.
![Page 25: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/25.jpg)
Learning in stochastic optimization
Learning the pricing model:» Let be the new price and let
» We update our estimate using our recursive least squares equations:
1 11
1t t t t t
t
B pq q eg+ +
+
= -
( )1 1
11
1
( | ) , 1 ( )
1 ( )
pricet t t t t
Tt t t t t t
t
Tt t t t
F p p
B B B p p B
p B p
e q
g
g
+ +
++
+
= -
= -
= +
0 1 1 2 2( | ) ( )price Tt t t t t t t t t t tF p p p p p
![Page 26: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/26.jpg)
An energy storage problem
Types of learning:» No learning ( are known)
» Passive learning (learn from price data)
» Active learning (“bandit problems”)
1 0 1 1 2 2 1p
t t t t tp p p p
1 0 1 1 2 2 1p
t t t t t t t tp p p p
1 0 1 1 2 2 3 1GB p
t t t t t t t t t tp p p p x
Buy/sell decisions
Our decisions influence the prices we observe, which helps with learning.
![Page 27: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/27.jpg)
Classes of problems
© 2019 Warren Powell Slide 27
![Page 28: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/28.jpg)
Major problem classesSpecial structure» There are special cases where we can solve
exactly. But not very many.
Sampled problems (SAA, scenario trees)» If the only problem is that we cannot compute the expectation, we
might solve a sampled approximation
Adaptive learning algorithms» This is what we have to turn to for most problems, and is the focus
of this tutorial.
max ( , )x F x W
1
1ˆmax ( , ) ( , )N
nx
n
F x W F x WN
Deterministic!
Also deterministic!
![Page 29: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/29.jpg)
Major problem classes
State independent problems» The problem does not depend on the state of the system.
» The only state variable is what we know (or believe) about the unknown function , called the belief state , so .
State dependent problems» Now the problem may depend on what we know at time t:
» Now the state is
max ( , ) min( , )xF x W p x W cx
( , )F x W
0max ( , , ) min( , )
tx R tC S x W p x W cx
![Page 30: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/30.jpg)
Major problem classes
Offline (final reward)» We can iteratively search for the best solution, but only
care about the final answer.» Asymptotic formulation:
» Finite horizon formulation:
Online (cumulative reward)» We have to learn as we go
max ( , )xF x W
,max ( , )NF x W
11
0
max ( ( ), )N
n n
n
F X S W
üïïïïïýïïïïïþ
“ranking and selection”or
“stochastic search”
![Page 31: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/31.jpg)
Major problem classes
There are entire fields of stochastic optimization built around “final reward” and “cumulative reward” objections:» Final reward
• Stochastic search (e.g. derivative-based stochastic optimization)• Ranking and selection (finding the best out of a set of choices)
» Cumulative reward• Multiarmed bandit problems• Classical dynamic programming, optimal control, stochastic
programming, …
» Our presentation will not care whether we are optimizing final or cumulative reward – it is just an objective function.
© 2019 Warren Powell
![Page 32: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/32.jpg)
Major problem classes
“Offline” vs. “online” learning» My view (or perhaps the view of stochastic optimization):
• “offline learning” is learning in the computer. We do not care about making mistakes as long as we get a good answer in the end.
• “online learning” would be learning in the field, where you have to live with your mistakes.
» The machine learning community uses these terms differently:
• “offline learning” means batch – you have a batch dataset from which you fit a model.
• “online learning” is sequential, as would happen if data is arriving in the field over time.
• The problem is that there are many sequential algorithms that are used in “offline” (e.g. laboratory) settings, and the machine learning community calls these “online.”
© 2019 Warren Powell
![Page 33: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/33.jpg)
Major problem classes
Notes:» For chapter 7, we will focus on the top row, “state
independent problems.”» These are pure learning problems, where we are trying
to learn a deterministic decision or policy.
© 2019 Warren Powell
![Page 34: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/34.jpg)
Modeling
© 2019 Warren Powell Slide 34
![Page 35: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/35.jpg)
ModelingAny learning problem can be written
where:» 𝑆 𝜇 , 𝛽 ∈ , 𝑥 ∈ 𝑋 𝑥 , … , 𝑥
= Belief about the value of 𝜇 .» 𝑆 Initial belief» 𝑥 Decision made using information from first 𝑛 experiments.» 𝑥 Initial decision based on information in 𝑆^0.» 𝑊 Observation from 𝑛𝑡ℎ experiment, 𝑛 1,2, … , 𝑁
Decisions are made using a policy:𝑥 𝑋 𝑆
![Page 36: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/36.jpg)
ModelingAll sequential decision problems can be described using:» States
• Belief states – what do we know about 𝔼𝐹 𝑥, 𝑊 , or any other function we are learning (transition functions, value functions, policies)
• Physical and informational states (we will get to these later).» Decisions – What are our choices?
• Pure information collection decisions (e.g. run an MRI, purchase a credit report)
• Implementation decisions, but which may also affect what we learn.» Exogenous information
• Initial belief/prior• What we learn from an experiment
» Transition function• Updating beliefs (statistical estimation)• Possibly updating physical state.
» Objective function• Performance metrics• Finding the best policy
© 2019 Warren Powell
![Page 37: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/37.jpg)
Modeling
State variables» For derivative-free, we often just have a belief state,
capturing what we know about our function:• Lookup tables• Parametric models• Nonparametric models
» Physical state • What node we are at in a network• How much inventory do we have• We will get to this in the second half of the course
» Information state• Weather forecast, current price• Information state at time t may be independent of the state at t-1.• State at time t depends on state at t-1:
– Information evolves exogenously– Our decisions influence the state (e.g. selling stock)
© 2019 Warren Powell
![Page 38: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/38.jpg)
Modeling
Decisions» We have a finite set of choices:
• 𝑥 ∈ 𝑋 𝑥 , 𝑥 , … , 𝑥
» Examples:• Experimenting with different drugs to kill cancer in mice.• Running simulations to plan operations (e.g. for Amazon)• Evaluating different players for baseball during spring training.• Testing different materials to maximize the energy density of a
battery.• Test marketing new products in specific regions.
» We only care about the performance at the end of a set of N experiments. We do not care about how well we do along the way.
![Page 39: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/39.jpg)
Modeling
Types of decisions» Binary
» Finite
» Continuous scalar
» Continuous vector
» Discrete vector
» Categorical
0,1x X
1,2,...,x X M
,x X a b
1( ,..., ), K kx x x x
1( ,..., ), K kx x x x
1( ,..., ), is a category (e.g. red/green/blue)I ix a a a
There are entire fields dedicated to particular classes of decisions.
![Page 40: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/40.jpg)
Modeling
Exogenous information» Simplest: let be the performance of alternative
.» If we are using a lookup table representation with a
Bayesian belief model, we would assume:
» If we are using a parametric representation, we would write
© 2019 Warren Powell
![Page 41: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/41.jpg)
Modeling
Transition function:» Now use to update our belief model:
• Frequentist?
• Bayesian?
» Use any of the recursive learning methods from chapter 3.
© 2019 Warren Powell
11
1 11 If 1 1
Otherwise
n n nx xn n n
x x xnx
W x xN N
mm
m
++
ìæ öï ÷ïç ÷- + =ïç ÷ïç ÷ç= + +è øíïïïïî
1
1 If
Otherwise
n n W nnx x x
n n Wx x
nx
W x xb m bm b b
m
+
+
ìï +ï =ïï= +íïïïïî
![Page 42: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/42.jpg)
Modeling
Objective functions for» State independent and state dependent problems.» Final reward and cumulative reward
![Page 43: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/43.jpg)
Modeling
Simulating objective functions» It is important to know how to simulate objective
functions. We will get to this in chapter 9.
![Page 44: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/44.jpg)
Modeling
Learning policies:Approximate dynamic programmingQ-learningSDDP…
Objective functions
We will focus on these in the second half of the course.
![Page 45: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/45.jpg)
Modeling
“Online” (cumulative reward) dynamic programming is recognized as the “dynamic programming problem,” but the entire literature on solving dynamic programs describes class (4) problems. Class (3) appears to be an open problem class.
Objective functions
![Page 46: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/46.jpg)
Belief models
© 2019 Warren Powell Slide 46
![Page 47: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/47.jpg)
Belief models
Notes:» With derivative-based search, we had gradients.» With derivative-free, we have to form a belief model
about . Classes of belief models are• Lookup table
– Independent beliefs– Correlated beliefs
• Parametric models– Linear– Nonlinear
» Logistic regression» Step functions (sell if price is over some number)» Neural network
( , )F x W
© 2019 Warren Powell
![Page 48: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/48.jpg)
Approximation strategies
Approximation strategies» Lookup tables
• Independent beliefs • Correlated beliefs
» Linear parametric models• Linear models • Sparse-linear• Tree regression
» Nonlinear parametric models• Logistic regression• Neural networks
» Nonparametric models• Gaussian process regression• Kernel regression• Support vector machines• Deep neural networks
© 2019 Warren Powell
![Page 49: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/49.jpg)
Approximating strategies
Lookup tables
» Independent beliefs
» Correlated beliefs• A few dozen observations can
teach us about thousands of points.
» Hierarchical models• Create beliefs at different levels of
aggregation and then use weighted combinations
1( , ) ,...,nx MF x W x x x
© 2019 Warren Powell
![Page 50: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/50.jpg)
Approximating strategies
Parametric models» Linear models
• Might include sparse-additive, where many parameters are zero.
» Nonlinear models
• (Shallow) Neural networks
| ( )f ff F
F x x
0 1 1
0 1 1
( ) ...
( ) ...|1
x
x
eF xe
charge
charge discharge
charge
1 if ( | ) 0 if
1 if
t
t t
t
pX S p
p
tStx
© 2019 Warren Powell
![Page 51: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/51.jpg)
Approximating strategies
Nonparametric models» Kernel regression
• Weighted average of neighboring points
• Limited to low dimensional problems
» Locally linear methods• Dirichlet process mixtures• Radial basis functions
» Splines» Support vector machines» Deep neural networks
© 2019 Warren Powell
![Page 52: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/52.jpg)
Belief models
Lookup table» We can organize potential catalysts into groups» Scientists using domain knowledge can estimate
correlations in experiments between similar catalysts.
© 2019 Warren Powell
![Page 53: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/53.jpg)
53
Belief models
We start with a belief about each material (lookup table)
1 2 3 4 4 5
© 2019 Warren Powell Slide 53
![Page 54: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/54.jpg)
54
Belief models
Testing one material teaches us about other materials
1 2 3 4 4 5
© 2019 Warren Powell Slide 54
![Page 55: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/55.jpg)
Belief models
We express our belief using a linear, additive QSAR model»»
0
ij ijsites i substituents j
Y X Indicator variable for molecule .m m
ij ijX X m
© 2019 Warren Powell
![Page 56: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/56.jpg)
Belief models
A sampled belief model
1
2
3
4
Res
pons
e
1 2
1 2
( )
0 ( )( | )1
i i
i
x
i i x
eR xe
0 1 2, ,i i i i
Thickness proportional to nkp
Concentration/temperature
© 2019 Warren Powell
![Page 57: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/57.jpg)
Belief modelsParametric belief models» Value of information (the
KG) behaves in an unexpected way
» Graph to right shows the value of information while learning a quadratic approximation.
» We are using these insights to develop simple rules for where to run experiments that avoid the complexity of KG calculations.
21 2Y x x
Experiment x0 20 40 60 80 100 120
0
0.5
1
1.5
2
2.5
3
3.5
4
© 2019 Warren Powell
![Page 58: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/58.jpg)
Belief models
Truckload brokerages:» Now we have a logistic curve for
each origin-destination pair (i,j)
» Number of offers for each (i,j) pair is relatively small.
» Need to generalize the learning across “traffic lanes.”
0
0( , | )1
aij ij ij
aij ij ij
p aY
p a
eP p ae
Shipper Carrier
Offered price
© 2019 Warren Powell
![Page 59: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/59.jpg)
Belief models
Hotel revenue management» Locally linear curves
Price $/night100 120 140 160 180 200 220
Rese
rvat
ions
/day
.5
1
.0
1
.5
2
.0
2
.5
3
.0
© 2019 Warren Powell
![Page 60: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/60.jpg)
Belief models
Hotel revenue management» Guessing the right logistics curve
Price $/night100 120 140 160 180 200 220
Rese
rvat
ions
/day
.5
1
.0
1
.5
2
.0
2
.5
3
.0
© 2019 Warren Powell
![Page 61: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/61.jpg)
Designing policies
© 2019 Warren Powell Slide 61
![Page 62: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/62.jpg)
Designing policies
We have to start by describing what we mean by a policy.» Definition:
A policy is a mapping from a state to an action. … any mapping.
How do we search over an arbitrary space of policies?
![Page 63: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/63.jpg)
Designing policies
Two fundamental strategies:
1) Policy search – Search over a class of functions for making decisions to optimize some metric.
2) Lookahead approximations – Approximate the impact of a decision now on the future.
0( , )0
max , ( | ) |f f
T
t t tf Ft
E C S X S S
*' ' ' 1
' 1( ) arg max ( , ) max ( , ( )) | | ,
t
T
t t x t t t t t t t tt t
X S C S x C S X S S S x
![Page 64: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/64.jpg)
Designing policiesPolicy search:1a) Policy function approximations (PFAs)
• Lookup tables – “when in this state, take this action”
• Parametric functions– Order-up-to policies: if inventory is less than s, order up to S.– Affine policies -– Neural networks
• Locally/semi/non parametric– Requires optimizing over local regions
1b) Cost function approximations (CFAs)• Optimizing a deterministic model modified to handle uncertainty
(buffer stocks, schedule slack)
( )( | ) arg max ( , | )
t t
CFAt t tx
X S C S x
X
( | )PFAt tx X S
( | ) ( )PFAt t f f t
f Fx X S S
![Page 65: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/65.jpg)
Designing policiesLookahead approximations – Approximate the impact of a decision now on the future:» An optimal policy (based on looking ahead):
2a) Approximating the value of being in a downstream state using machine learning (“value function approximations”)
*1 1
1 1
( ) arg max ( , ) ( ) | ,
( ) arg max ( , ) ( ) | ,
arg max ( , ) ( )
t
t
t
t t x t t t t t t
VFAt t x t t t t t t
x xx t t t t
X S C S x V S S x
X S C S x V S S x
C S x V S
*' ' ' 1
' 1( ) arg max ( , ) max ( , ( )) | | ,
t
T
t t x t t t t t t t tt t
X S C S x C S X S S S x
![Page 66: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/66.jpg)
Designing policiesLookahead approximations – Approximate the impact of a decision now on the future:» An optimal policy (based on looking ahead):
2a) Approximating the value of being in a downstream state using machine learning (“value function approximations”)
*1 1
1 1
( ) arg max ( , ) ( ) | ,
( ) arg max ( , ) ( ) | ,
arg max ( , ) ( )
t
t
t
t t x t t t t t t
VFAt t x t t t t t t
x xx t t t t
X S C S x V S S x
X S C S x V S S x
C S x V S
*' ' ' 1
' 1( ) arg max ( , ) max ( , ( )) | | ,
t
T
t t x t t t t t t t tt t
X S C S x C S X S S S x
![Page 67: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/67.jpg)
Designing policiesLookahead approximations – Approximate the impact of a decision now on the future:» An optimal policy (based on looking ahead):
2a) Approximating the value of being in a downstream state using machine learning (“value function approximations”)
*1 1
1 1
( ) arg max ( , ) ( ) | ,
( ) arg max ( , ) ( ) | ,
arg max ( , ) ( )
t
t
t
t t x t t t t t t
VFAt t x t t t t t t
x xx t t t t
X S C S x V S S x
X S C S x V S S x
C S x V S
*' ' ' 1
' 1( ) arg max ( , ) max ( , ( )) | | ,
t
T
t t x t t t t t t t tt t
X S C S x C S X S S S x
![Page 68: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/68.jpg)
The ultimate lookahead policy is optimal*
' ' ' 1' 1
( ) arg max ( , ) max ( , ( )) | | ,
t
T
t t x t t t t t t t tt t
X S C S x C S X S S S x
Designing policies
![Page 69: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/69.jpg)
*' ' ' 1
' 1( ) arg max ( , ) max ( , ( )) | | ,
t
T
t t x t t t t t t t tt t
X S C S x C S X S S S x
Designing policies
The ultimate lookahead policy is optimal
Expectations that we cannot compute
Maximization that we cannot compute
![Page 70: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/70.jpg)
Designing policies
The ultimate lookahead policy is optimal
» 2b) Instead, we have to solve an approximation called the lookahead model:
» A lookahead policy works by approximating the lookahead model.
*' ' ' 1
' 1( ) arg max ( , ) max ( , ( )) | | ,
t
T
t t x t t t t t t t tt t
X S C S x C S X S S S x
*' ' ' , 1
' 1( ) arg max ( , ) max ( , ( )) | | ,
t
t H
t t x t t tt tt tt t t t tt t
X S C S x C S X S S S x
![Page 71: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/71.jpg)
Designing policiesTypes of lookahead approximations » One-step lookahead – Widely used in pure learning
policies:• Bayes greedy/naïve Bayes• Expected improvement• Value of information (knowledge gradient)
» Multi-step lookahead• Deterministic lookahead, also known as model predictive
control, rolling horizon procedure• Stochastic lookahead:
– Two-stage (widely used in stochastic linear programming)– Multistage
» Monte carlo tree search (MCTS) for discrete action spaces
» Multistage scenario trees (stochastic linear programming) – typically not tractable.
![Page 72: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/72.jpg)
1) Policy function approximations (PFAs)» Lookup tables, rules, parametric/nonparametric functions
2) Cost function approximation (CFAs)»
3) Policies based on value function approximations (VFAs)»
4) Direct lookahead policies (DLAs)» Deterministic lookahead/rolling horizon proc./model predictive control
» Chance constrained programming
» Stochastic lookahead /stochastic prog/Monte Carlo tree search
» “Robust optimization”
Four (meta)classes of policies
( )( | ) arg max ( , | )
t t
CFAt t tx
X S C S x
X
( ) arg max ( , ) ( , )t
VFA x xt t x t t t t t tX S C S x V S S x
' '' 1
( ) arg max ( , ) ( ) ( ( ), ( ))t
TLA St t tt tt tt tt
t tX S C S x p C S x
xtt , xt ,t1,..., xt ,tT
,' '( ),..., ' 1
( ) arg max min ( , ) ( ( ), ( ))ttt t t H
TLA ROt t tt tt tt ttw Wx x t t
X S C S x C S w x w
[ ( )] 1t tP A x f W
Polic
y se
arch
,' ',..., ' 1
( ) arg max ( , ) ( , )
tt t t H
TLA Dt t tt tt tt ttx x t t
X S C S x C S x
Look
ahea
dap
prox
imat
ions
![Page 73: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/73.jpg)
1) Policy function approximations (PFAs)» Lookup tables, rules, parametric/nonparametric functions
2) Cost function approximation (CFAs)»
3) Policies based on value function approximations (VFAs)»
4) Direct lookahead policies (DLAs)» Deterministic lookahead/rolling horizon proc./model predictive control
» Chance constrained programming
» Stochastic lookahead /stochastic prog/Monte Carlo tree search
» “Robust optimization”
Four (meta)classes of policies
( )( | ) arg max ( , | )
t t
CFAt t tx
X S C S x
X
,' ',..., ' 1
( ) arg max ( , ) ( , )
tt t t H
TLA Dt t tt tt tt ttx x t t
X S C S x C S x
( ) arg max ( , ) ( , )t
VFA x xt t x t t t t t tX S C S x V S S x
' '' 1
( ) arg max ( , ) ( ) ( ( ), ( ))t
TLA St t tt tt tt tt
t tX S C S x p C S x
xtt , xt ,t1,..., xt ,tT
,' '( ),..., ' 1
( ) arg max min ( , ) ( ( ), ( ))ttt t t H
TLA ROt t tt tt tt ttw Wx x t t
X S C S x C S w x w
[ ( )] 1t tP A x f W
Func
tion
appr
ox.
![Page 74: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/74.jpg)
1) Policy function approximations (PFAs)» Lookup tables, rules, parametric/nonparametric functions
2) Cost function approximation (CFAs)»
3) Policies based on value function approximations (VFAs)»
4) Direct lookahead policies (DLAs)» Deterministic lookahead/rolling horizon proc./model predictive control
» Chance constrained programming
» Stochastic lookahead /stochastic prog/Monte Carlo tree search
» “Robust optimization”
Four (meta)classes of policies
( )( | ) arg max ( , | )
t t
CFAt t tx
X S C S x
X
,' ',..., ' 1
( ) arg max ( , ) ( , )
tt t t H
TLA Dt t tt tt tt ttx x t t
X S C S x C S x
( ) arg max ( , ) ( , )t
VFA x xt t x t t t t t tX S C S x V S S x
' '' 1
( ) arg max ( , ) ( ) ( ( ), ( ))t
TLA St t tt tt tt tt
t tX S C S x p C S x
xtt , xt ,t1,..., xt ,tT
,' '( ),..., ' 1
( ) arg max min ( , ) ( ( ), ( ))ttt t t H
TLA ROt t tt tt tt ttw Wx x t t
X S C S x C S w x w
[ ( )] 1t tP A x f W
Imbe
dded
opt
imiz
atio
n
![Page 75: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/75.jpg)
Optimal policies
Online vs. offline learning
Slide 75
![Page 76: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/76.jpg)
Optimal policies
Earlier we expressed an optimal policy using Bellman’s equation:» Recall that we used a graph:
» Bellman equation (state = node)
![Page 77: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/77.jpg)
Optimal policiesBellman’s equation for a learning problem:» A belief state
» Bellman equation (state = node)
» Can only solve this in very special cases (and not with normally distributed beliefs). 5 alternatives -> 10-dimensional continuous state.
![Page 78: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/78.jpg)
Optimal policies
Online (cumulative reward) vs. offline (final reward)
» Bellman’s equation for online (cumulative reward) learning:
» Bellman’s equation for offline (final reward) learning:
» If we only care about the final reward, we do not care h h k l h
![Page 79: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/79.jpg)
Policy function approximation
© 2019 Warren Powell Slide 79
![Page 80: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/80.jpg)
Policy function approximation
Examples:
» Optimizing buying-selling energy from the grid
» Finding the best selling price for my book on Amazon.
© 2019 Warren Powell
![Page 81: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/81.jpg)
Policy function approximations
Battery arbitrage – When to charge, when to discharge, given volatile LMPs
![Page 82: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/82.jpg)
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71
Grid operators require that batteries bid charge and discharge prices, an hour in advance.
We have to search for the best values for the policy parameters
DischargeCharge
Charge Dischargeand .
Policy function approximations
![Page 83: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/83.jpg)
Policy function approximations
Our policy function might be the parametric model (this is nonlinear in the parameters):
charge
charge discharge
charge
1 if ( | ) 0 if
1 if
t
t t
t
pX S p
p
Energy in storage:
Price of electricity:
![Page 84: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/84.jpg)
Policy function approximations
Finding the best policy» We need to maximize
» We cannot compute the expectation, so we run simulations:
DischargeCharge
0
max ( ) , ( | )T
tt t t
tF C S X S
![Page 85: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/85.jpg)
Policy function approximation
Optimal pricing
![Page 86: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/86.jpg)
Policy function approximation
Dynamic pricing» After n sales, our estimate of the
demand function is
» Revenue is
» The optimal price given our estimates would be:
![Page 87: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/87.jpg)
Revenue managementEarning vs. learning
» You earn the most with prices near the middle.
» You learn the most with extreme prices.
» Challenge is to strike a balance.
![Page 88: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/88.jpg)
Policy function approximation
Designing a policy:» Just doing what appears to be best will encourage prices
too close to the middle. » We can try an excitation policy by adding noise:
where , where is a tunable parameter. » After charging price , we observe revenue:
,
where , , is the noise when observing demand.
© 2019 Warren Powell
![Page 89: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/89.jpg)
Policy function approximation
Designing a policy:is a tunable parameter that needs to solve:
, ,…, |
» The best value of should strike a balance between doing a better job of exploration, without overdoing it.
© 2019 Warren Powell
![Page 90: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/90.jpg)
Policy function approximation
Other examples of PFAs:» A basic linear decision rule
∈
» Where the features for have to be designed by hand.
© 2019 Warren Powell
![Page 91: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/91.jpg)
Cost function approximation
© 2019 Warren Powell Slide 91
Materials science example
![Page 92: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/92.jpg)
92 92
Materials science application
Heuristic measurement policies» Boltzmann exploration
• Explore choice x with probability
•» Upper confidence bounding
» Thompson sampling
» Interval estimation (or upper confidence bounding)• Choose x which maximizes
'
'
( )
nx
nx
nx
x
ePe
( | ) arg max IE n IE n IE nx x xX S
0
nxz
log( | ) arg max
UCB n UCB n UCBx x n
x
nX SN
1ˆ ˆ( ) arg max where ( , ) TS n n n n nx x x x xX S N
[0,1]U
![Page 93: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/93.jpg)
Materials science application
Notes:» Each one of these policies has an imbedded “max” (or
“min”) operator.» But each one still has a tunable parameter – this is the
distinguishing feature of “policy search” policies.» For these policies, the max involves a simple search
over a set of discrete choices, so this is not too difficult. » Another example is solving a shortest path into New
York. The tunable parameter might be how much time you add to the trip to deal with traffic.
» The max/min could also be a large optimization problem, such as what is solved to plan energy generation for tomorrow.
© 2019 Warren Powell
![Page 94: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/94.jpg)
Materials science application
Lookup table» We can organize potential catalysts into groups» Scientists using domain knowledge can estimate
correlations in experiments between similar catalysts.
![Page 95: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/95.jpg)
95
Materials science application
Correlated beliefs: Testing one material teaches us about other materials
1 2 3 4 4 5
![Page 96: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/96.jpg)
96 96
Materials science application
Heuristic measurement policies» Boltzmann exploration
• Explore choice x with probability
•» Upper confidence bounding
» Thompson sampling
» Interval estimation• Choose x which maximizes
'
'
( )
nx
nx
nx
x
ePe
( | ) arg max IE n IE n IE nx x xX S
0
nxz
log( | ) arg max
UCB n UCB n UCBx x n
x
nX SN
1ˆ ˆ( ) arg max where ( , ) TS n n n n nx x x x xX S N
[0,1]U
![Page 97: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/97.jpg)
Materials science application
Notes:» Upper confidence bounding and Thompson sampling
are very popular in the computer science community.» These have been shown to have provable regret bounds,
which the CS community then uses to claim that they must be very good.
» My own experimental work has not supported this claim, but we have found that a properly tuned version of interval estimation tends to work surprisingly well.
» … The problem is the tuning.
© 2019 Warren Powell
![Page 98: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/98.jpg)
98
Materials science application
Picking means we are evaluating each choice at the mean.
1 2 3 4 4 5
![Page 99: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/99.jpg)
99
Materials science application
Picking means we are evaluating each choice at the 95th percentile.
1 2 3 4 4 5
![Page 100: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/100.jpg)
Materials science applicationOptimizing the policy» We optimize 𝜃 to maximize:
where
Notes:» This can handle any belief model,
including correlated beliefs, nonlinear belief models.
» All we require is that we be able to simulate a policy.
( | ) arg max ( , )n IE n IE n IE n n n nx x x x xx X S S
,max ( ) ,IEIE NF F x W
Reg
ret
![Page 101: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/101.jpg)
Materials science application
Other applications
» Airlines optimizing schedules with schedule slack to handle weather uncertainty.
» Manufacturers using buffer stocks to hedge against production delays and quality problems.
» Grid operators scheduling extra generation capacity in case of outages.
» Adding time to a trip planned by Google maps to account for uncertain congestion.
![Page 102: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/102.jpg)
Materials science application
Notes:» CFA policies are exceptionally popular in internet
applications:» Choosing ads, news articles, products, … that
maximize ad-clicks.» The ease of computation of the policies is very
attractive.» Representative from google once noted: “We can use
any policy that can be computed in under 50 milliseconds.”
© 2019 Warren Powell
![Page 103: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/103.jpg)
Cost function approximation
© 2019 Warren Powell Slide 103
Fleet management example
![Page 104: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/104.jpg)
Schneider National
© 2018 Warren B. Powell
![Page 105: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/105.jpg)
Slide 105© 2018 Warren B. Powell
![Page 106: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/106.jpg)
Fleet management application
DemandsDrivers
© 2018 Warren B. Powell
![Page 107: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/107.jpg)
Fleet management application
t t+1 t+2
The assignment of drivers to loads evolves over time, with new loads being called in, along with updates to the status of a driver.
© 2018 Warren B. Powell
![Page 108: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/108.jpg)
Fleet management application
A purely myopic policy would solve this problem using
where
What if a load it not assigned to any driver, and has been delayed for a while? This model ignores the fact that we eventually have to assign someone to the load.
1 If we assign driver to load 0 Otherwise
Cost of assigning driver to load at time
tdl
tdl
d lx
c d l t
min x tdl tdld l
c x
© 2018 Warren B. Powell
![Page 109: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/109.jpg)
Fleet management application
We can minimize delayed loads by solving a modified objective function:
where
We refer to our modified objective function as a cost function approximation.
How long load has been delayed by time Bonus for moving a delayed load
tl l t
min x tdl tl tdld l
c x
© 2018 Warren B. Powell
![Page 110: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/110.jpg)
Fleet management application
We now have to tune our policy, which we define as:
We can now optimize , another form of policy search, by solving
( | ) arg mint x tdl tl tdld l
X S c x
0min ( ) ( , ( | ))
T
t t tt
F C S X S
( , | )t tC S x
© 2018 Warren B. Powell
![Page 111: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/111.jpg)
Cost function approximation
© 2019 Warren Powell Slide 111
Logistics example
![Page 112: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/112.jpg)
Logistics application
Inventory management
» How much product should I order to anticipate future demands?
» Need to accommodate different sources of uncertainty.
• Market behavior• Transit times• Supplier uncertainty• Product quality
© 2018 Warren B. Powell
![Page 113: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/113.jpg)
Logistics application
Imagine that we want to purchase parts from different suppliers. Let be the amount of product we purchase at time t from supplier p to meet forecasted demand . We would solve
» This assumes our demand forecast is accurate.
tD
tpx
tD
( ) arg maxtt t x p tp
p PX S c x
subject to
0
tp tp P
tp p
tp
x D
x u
x
t
© 2018 Warren B. Powell
![Page 114: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/114.jpg)
Imagine that we want to purchase parts from different suppliers. Let be the amount of product we purchase at time t from supplier p to meet forecasted demand . We would solve
» This is a “parametric cost function approximation”
( )
Reserve
buffer
( | ) arg max
subject to
tt t p tpx
p P
tp tp P
tp p
tp
X S c x
x D
x u
x
( )t
Logistics application
tD
tpx
Reserve
Buffer stock
© 2018 Warren B. Powell
![Page 115: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/115.jpg)
An even more general CFA model:» Define our policy:
subject to
» We tune by optimizing:
Logistics application
( ) arg max ( , | )t x t tX C S x
0
min ( ) ( , ( ))T
t tt
F C S X
( )Ax b
Parametricallymodified costs
Parametricallymodified constraints
© 2018 Warren B. Powell
![Page 116: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/116.jpg)
Cost function approximation
© 2019 Warren Powell Slide 116
Energy storage example
![Page 117: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/117.jpg)
Parametric cost function approximation
Notes:» In this set of slides, we are going to illustrate the use of
a parametrically modified lookahead policy.» This is designed to handle a nonstationary problem with
rolling forecasts. This is just what you do when you plan your path with google maps.
» The lookahead policy is modified with a set of parameters that factor the forecasts. These parameters have to be tuned using the basic methods of policy search.
© 2018 W.B. Powell
![Page 118: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/118.jpg)
Parametric cost function approximation
An energy storage problem:
![Page 119: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/119.jpg)
Parametric cost function approximationForecasts evolve over time as new information arrives:
Actual
Rolling forecasts, updated each hour.
Forecast made at midnight:
![Page 120: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/120.jpg)
Parametric cost function approximation
Benchmark policy – Deterministic lookahead
' ' ' '
' ' '
' ' '
max' ' '
' ' 'arg
' 'arg
' '
wd rd gd Dtt tt tt tt
gd gr Gtt tt tt
rd rgtt tt tt
wr grtt tt ttwr wd Ett tt ttwr gr ch ett ttrd rg disch ett tt
x x x f
x x f
x x R
x x R R
x x f
x x
x x
b
g
g
+ + £
+ £
+ £
+ £ -
+ £
+ £
+ £
![Page 121: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/121.jpg)
Lookahead policies
Lookahead policies peek into the future» Optimize over deterministic lookahead model
The real process
. . . .
The
look
ahea
dm
odel
t 1t 2t 3t
© 2018 W.B. Powell
![Page 122: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/122.jpg)
Lookahead policies
Lookahead policies peek into the future» Optimize over deterministic lookahead model
The real process
. . . .
The
look
ahea
dm
odel
t 1t 2t 3t
© 2018 W.B. Powell
![Page 123: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/123.jpg)
Lookahead policies
Lookahead policies peek into the future» Optimize over deterministic lookahead model
The real process
. . . .
The
look
ahea
dm
odel
t 1t 2t 3t
© 2018 W.B. Powell
![Page 124: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/124.jpg)
Lookahead policies
Lookahead policies peek into the future» Optimize over deterministic lookahead model
The real process
. . . .
The
look
ahea
dm
odel
t 1t 2t 3t
© 2018 W.B. Powell
![Page 125: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/125.jpg)
Parametric cost function approximation
Benchmark policy – Deterministic lookahead
' ' ' '
' ' '
' ' '
max' ' '
' ' 'arg
' 'arg
' '
wd rd gd Dtt tt tt tt
gd gr Gtt tt tt
rd rgtt tt tt
wr grtt tt ttwr wd Ett tt ttwr gr ch ett ttrd rg disch ett tt
x x x f
x x f
x x R
x x R R
x x f
x x
x x
b
g
g
+ + £
+ £
+ £
+ £ -
+ £
+ £
+ £
![Page 126: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/126.jpg)
Parametric cost function approximation
Parametric cost function approximations» Replace the constraint
with:» Lookup table modified forecasts (one adjustment term for
each time in the future):
» We can simulate the performance of a parameterized policy
» The challenge is to optimize the parameters:
't t
' ' ' 'wr wd Ett tt t t ttx x f
'wrttx '
wdttx
0
( , ) ( ), ( ( ) | )T
t t tt
F C S X S
0
min , ( | )T
t t tt
C S X S
![Page 127: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/127.jpg)
Parametric cost function approximation
One-dimensional contour plots – perfect forecastfor i=1,…, 8 hours into the future.
q0.0 0.2 0.4 0.6 0.8 1.0 1.2
* 1 for perfect forecasts.iq =
![Page 128: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/128.jpg)
Parametric cost function approximation
One-dimensional contour plots-uncertain forecastfor i=9,…, 15 hours into the future
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4
q
![Page 129: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/129.jpg)
Parametric cost function approximation
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
1q
10q0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
![Page 130: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/130.jpg)
Parametric cost function approximation
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
1q
2q0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
![Page 131: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/131.jpg)
Parametric cost function approximation
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
2q
3q0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
![Page 132: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/132.jpg)
Parametric cost function approximation
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
3q
5q0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
![Page 133: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/133.jpg)
Numerical derivatives
Simultaneous perturbation stochastic approximation» Let:
• 𝑥 be a p dimensional vector.• 𝛿 be a scalar perturbation• 𝑍 be a p dimensional vector, with each element drawn from a normal (0,1)
distribution.
» We can obtain a sampled estimate of the gradient 𝛻 𝐹 𝑥 , 𝑊using two function evaluations: 𝐹 𝑥 𝛿 𝑍 and 𝐹 𝑥 𝛿 𝑍
1
12
( ) ( )2
( ) ( )2( , )
( ) ( )2
n n n n n n
n n
n n n n n n
n nn nx
n n n n n n
n np
F x Z F x ZZ
F x Z F x ZZF x W
F x Z F x ZZ
d dd
d dd
d dd
+
é ù+ - +ê úê úê úê ú+ - +ê úê ú = ê úê úê úê úê ú+ - +ê úê úë û
![Page 134: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/134.jpg)
Numerical derivatives
Finite differences» We use finite differences in MOLTE-DB:
• We wish to optimize the decision of when to charge or discharge a battery
charge
charge discharge
charge
1 if ( | ) 0 if
1 if
t
t t
t
pX S p
p
© 2019 Warren Powell
![Page 135: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/135.jpg)
Numerical derivatives
Finding the best policy» We need to maximize
» We cannot compute the expectation, so we run simulations:
DischargeCharge
0
max ( ) , ( | )T
tt t t
tF C S X S
© 2019 Warren Powell Slide 135
![Page 136: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/136.jpg)
Cost function approximation
© 2019 Warren Powell Slide 136
Stochastic shortest path
![Page 137: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/137.jpg)
Stochastic shortest path problemA stochastic network, costs revealed as we arrive to a node:
1
2
3
4
5
6
7
8
9
10
11
3.6
8.1
13.5
8.9
12.712.6
8.4
9.2
15.9
© 2018 W.B. Powell
![Page 138: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/138.jpg)
Stochastic shortest path problem
Modeling:» State variable (iteration n, time t)
• 𝑅 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑛𝑜𝑑𝑒 𝑖 during pass 𝑛• 𝐼 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑐̅ 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑠 𝑜𝑓 𝑐𝑜𝑠𝑡𝑠 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑡
during pass 𝑛• 𝑆 𝑅 , 𝐼 )
» Decisions
, , =1 if we traverse at time • We want a policy 𝑋 𝑆 𝑥 , , for all links 𝑖, 𝑗.
» Costs• 𝑐 Actual realization of costs at time 𝑡 to traverse 𝑖, 𝑗
© 2018 W.B. Powell
![Page 139: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/139.jpg)
Stochastic shortest path problemModeling:» Objective function
• Cost per periodC S , X 𝑆 𝑋 𝑆 𝑐
𝑥 , ,,
𝑐
Costs incurred at time t.• Total costs:
min𝔼 ∑ C S , X 𝑆
» This is the base model.
© 2018 W.B. Powell
![Page 140: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/140.jpg)
Stochastic shortest path problemA policy based on a lookahead model» At each time 𝑡 we are going to optimize over an estimate of the
network that we are going to call the lookahead model.» Notation: all variables in the lookahead model have tilde’s, and
two time indices.• First time index, 𝑡, is the time at which we are making a decision.
This determines the information content of all parameters (e.g. costs) and decisions.
• A second time index, 𝑡′, is the time within the lookahead model.» Decisions
• 𝑥 1 if we plan on traversing link 𝑖, 𝑗 in the lookahead model.• 𝑐 Estimated cost at time 𝑡 of traversing link 𝑖, 𝑗 in the
lookahead model.
© 2018 W.B. Powell
![Page 141: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/141.jpg)
Stochastic shortest path problem
© 2018 W.B. Powell
![Page 142: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/142.jpg)
Stochastic shortest path problem
© 2018 W.B. Powell
![Page 143: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/143.jpg)
Stochastic shortest path problemA static, deterministic network
1
2
3
4
5
6
7
8
9
10
11
12.6
8.4
9.2 3.6
8.1
17.4
15.9
16.5 20.2
13.5
8.912.7
15.9
2.34.5
7.3
9.65.7
© 2018 W.B. Powell
![Page 144: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/144.jpg)
Stochastic shortest path problem
t
t + 1
t + 2
t + 3
© 2018 W.B. Powell
A time-dependent, deterministic network
![Page 145: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/145.jpg)
The base model
The
look
ahea
dm
odel
t 1t 2t 3t
. . . .
Stochastic shortest path problem
© 2018 W.B. Powell
A time-dependent, deterministic lookahead network
't t
' 1t t
' 2t t
' 3t t
' 4t t
![Page 146: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/146.jpg)
The base model
t 1t 2t 3t
. . . .
Stochastic shortest path problem
© 2018 W.B. Powell
A time-dependent, deterministic lookahead network
The
look
ahea
dm
odel
't t
' 1t t
' 2t t
' 3t t
' 4t t
![Page 147: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/147.jpg)
The base model
t 1t 2t 3t
. . . .
Stochastic shortest path problem
© 2018 W.B. Powell
A time-dependent, deterministic lookahead network
The
look
ahea
dm
odel
't t
' 1t t
' 2t t
' 3t t
' 4t t
![Page 148: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/148.jpg)
Stochastic shortest path problem
Imagine that the lookahead is just a black box:» Solve the optimization problem
» subject to
» This is a deterministic shortest path problem that we could solve using Bellman’s equation, but for now we will just view it as a black box optimization problem.
( ) arg min i
n nt t tij tij
i N j N
X S c xp
+Î Î
= åå
,
,
, ,
1 Flow out of current node where we are located
1 Flow into destination node
0 for all other nodes.
nti j
j
i ri
i j j ki k
x
x r
x x
=
=
- =
å
å
å å
© 2018 W.B. Powell
![Page 149: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/149.jpg)
Stochastic shortest path problem
Simulating a lookahead policy
, ,0 ,
, ', 1, ', 2, ',
We would like to compute
ˆ ( )
but this is intractable.Let be a sample realization of costs
ˆ ˆ ˆ ( ), ( ), ( ),...
Now simulate the policy
T
t ij t t ijt i j
t t ij t t ij t t ij
F X S c
c c c
p p
ww w w
=
+ +
= åå
, ,0 ,
1
ˆ ˆ( ) ( ( )) ( )
Finally, get the average performance1 ˆ ( )
Tn n n
t ij t t ijt i j
Nn
n
F X S c
F FN
p p
p p
w w w
w
=
=
=
=
åå
å
Talk through how this works.
© 2018 W.B. Powell
![Page 150: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/150.jpg)
Stochastic shortest path problem
Notes:» The deterministic lookahead is still a policy for a
stochastic problem.» Can we make it better?
Idea:» Instead of using the expected cost, what about using a
percentile.» Use pdf of to find percentile (e.g. ). Let
The percentile of
» Which means
© 2018 W.B. Powell
![Page 151: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/151.jpg)
Stochastic shortest path problem
The percentile policy.» Solve the linear program (shortest path problem):
» subject to
» This is a deterministic shortest path problem that we could solve using Bellman’s equation, but for now we will just view it as a black box optimization problem.
( | ) arg min ( ) (Vector with 1 if decision is to take ( , ))i
n pt t tij tij tij
i N j N
X S c x x i jp q q+Î Î
= =åå
, ,1 Flow out of current node where we are located
1 Flow into destination node
0 for all other nodes.
ntt i j
j
tiri
tij tjki k
x
x r
x x
=
=
- =
å
å
å å
© 2018 W.B. Powell
![Page 152: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/152.jpg)
Stochastic shortest path problem
Simulating a lookahead policy
, ', 1, ', 2, ',
, ',0 ,
Let be a sample realization of costs ˆ ˆ ˆ( ), ( ), ( ),...
Now simulate the policy
ˆ ˆ( ) ( ) ( ( ) | )
Finally, get the average performance1 ˆ( ) (
t t ij t t ij t t ij
Tn n
t t ij t tt i j
n
c c c
F c X S
F FN
p p
p p
ww w w
w w w q
q w
+ +
=
=
=
åå
1
)N
n=å
© 2018 W.B. Powell
![Page 153: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/153.jpg)
Stochastic shortest path problem
Policy tuning» Cost vs. lateness (risk)
© 2018 W.B. Powell
![Page 154: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/154.jpg)
Notes on cost function approximations
© 2019 Warren Powell Slide 154
![Page 155: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/155.jpg)
Cost function approximations
Notes:» CFAs are the dirty secret of real-world stochastic
optimization. They are easy to understand, easier to solve, and make it possible to incorporate problem structure.
» The first challenge is to identify an effective parameterization. This requires having insights into the problem.
» The second challenge is to optimize over the parameters.
» There are close parallels between policy search and machine learning.
© 2019 Warren Powell
![Page 156: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/156.jpg)
Cost function approximations
Tuning the parameters» Let
Tunable policy» The simulated performance of the policy is given by
• Final reward:– Let 𝑥 , 𝜃 be the solution produced by following
policy 𝜋 parameterized by 𝜃. The performance of the policy is given by
,
• Cumulative reward
𝐹 𝜃, 𝑊 𝐶 𝑆 , 𝑋 𝑆 𝜃
© 2019 Warren Powell
![Page 157: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/157.jpg)
Cost function approximations
Tuning the parameters» We can write any parameter tuning problem as
» This is a basic stochastic learning problem. We can approach it using:
• Derivative-based stochastic search – Will probably need to use numerical derivatives.
• Derivative-free stochastic search – We might represent the set of possible values of 𝜃 ∈ 𝜃 , … , 𝜃 and then search over this finite set.
© 2019 Warren Powell
![Page 158: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/158.jpg)
Learning and tuning policies
Slide 158
![Page 159: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/159.jpg)
Learning policiesSimulating a policy» Regardless of our belief model (frequentist or Bayesian), a learning
process looks like
» Our observations might come from𝑊 𝜇 𝜀
or perhaps𝐹 𝐹 𝑥 , 𝑊
There are different ways to interpret our “W” variable.» Imagine that we have three policies 𝜋 , 𝜋 , and 𝜋 . Let Π 𝜋 ,
𝜋 , 𝜋 ). » In offline learning, we only care about our final answer which we
call:
0 1 10 0 1 1 1 2 1 1( , , , , , ,..., , , , )N
N N N Nx x x
S x W S x W S x W S-- -
, maxN Nx xxp m=
![Page 160: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/160.jpg)
Learning policiesLearning the best policy» Imagine that we have three policies 𝜋 , 𝜋 , and 𝜋 . Let Π 𝜋 ,
𝜋 , 𝜋 ). » In offline (final reward) learning, we will optimize the final reward
max 𝔼 𝔼 ,…, | 𝜇
ormax 𝔼 𝔼 ,…, | 𝔼 | 𝐹 𝑥 , , 𝑊
» In online (cumulative reward) learning, we optimize the cumulative rewards:
max 𝔼 𝔼 ,…, | 𝔼 | 𝐹 𝑋 𝑆 𝜃 , 𝑊
» x
![Page 161: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/161.jpg)
Learning policiesFinding a policy» Imagine that we have three classes of policies 𝜋 , 𝜋 , and 𝜋 . Let
Π 𝜋 , 𝜋 , 𝜋 ). » For each class of policy, we might have a set of parameters 𝜃 ∈ Θ
(for example) that have to be tuned.» In offline learning, we only care about our final answer which we
call:𝑥 , max 𝜇
Note that the policy 𝜋 is implicit in the estimate 𝜇 . We make it explicit when we write the final design 𝑥 , .» Now we wish to solve the optimization problem that finds the best
policy (best class, and best within the class).
![Page 162: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/162.jpg)
Learning policiesNotes» Buried in the estimate of 𝜇 is:
• The truth 𝜇
• The set of observations:
𝑊 , 𝑊 , …, 𝑊
» …where the choices 𝑥 , 𝑥 , … , 𝑥 are determined by our learning policy 𝑥 𝑋 𝑆 . This is the reason we label
with the policy 𝜋.
, maxN Nx xxp m=
![Page 163: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/163.jpg)
Learning policiesEvaluating a policy» The final design 𝑥 , is a random variable since it depends on:
• The true 𝜇 (in our Bayesian model, we treat this as a random variable)
• The observations 𝑊 , 𝑊 , … , 𝑊» We can evaluate a policy 𝑋 𝑆 using
» Finally, we write our optimization problem as
» It is useful to express what the expectations are over, so we write
, , (these are equivalent - more on this later).N NN
x xF p p
p m m= =
,max NNxpp m
1 2 ,, ,..., |max N N
NW W W xpp m m
m
![Page 164: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/164.jpg)
Learning policiesSimulating a policy» We cannot compute expectations, so we have to learn how to
simulate:
» Let 𝜇 𝜓 be a sampled truth, and let 𝑊 𝜔 , … , 𝑊 𝜔 be a sample path of function observations.
» Assume we have truths 𝜓ℓ, ℓ 1, … , 𝐿, and sample paths 𝜔 , … , 𝜔 .
» Let 𝜇 , 𝜓ℓ, 𝜔 be the estimate of 𝜇 when the truth is 𝜓ℓ and the sample path of realizations is 𝜔 , while following experimental policy 𝜋.
» Now evaluate the policy using
𝐹 max1𝐿
1𝐾 𝜇 , 𝜓ℓ, 𝜔
ℓ
1 2 ,, ,..., |max N N
NW W W xpp m m
m
![Page 165: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/165.jpg)
“Contextual” learning
Slide 165
![Page 166: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/166.jpg)
Contextual learningWhat if we are given information before making a decision?» We see the attributes of a patient before prescribing treatment.» We are given a weather report before having to decide how many
newspapers to put in our kiosk.» Let’s call this an “environmental state” variable (this is not
standard, but there is not a standard name for this).
Example: » Newsvendor problem with a dynamic state variable (the price):
» Further assume that 𝑝 is independent of any previous prices, and that it is revealed before we make our decision 𝑥.
The price 𝑝 is known as a “context.” This would be called a “contextual learning problem” or a “contextual bandit problem.”
© 2019 Warren Powell
0max ( , , ) min( , )
x tC S x W p x W cx
![Page 167: ORF 544 Stochastic Optimization and Learning · Derivative-free stochastic search Notes: » The material for this week and next will all be drawn from chapter 7. » Chapter 7 is 60](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f08d25f7e708231d423e2d6/html5/thumbnails/167.jpg)
Contextual learning
Why is this special?» Without the dynamic information, we would be
searching for the best » With the contextual information, we are now looking
for where is the revealed price. This means that instead of looking for a deterministic variable (or vector), we are now looking for a function.
» When we are looking for a decision as a function of a state variable (even part of a state variable), then this is what we call a policy.
» The learning literature makes a big deal about “contextual bandit problems.” For us, there is nothing different about this than any state-dependent problem.
© 2019 Warren Powell