sequential off-line learning with knowledge gradients peter frazier warren powell savas dayanik...

36
Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial Engineering Princeton University

Upload: alexander-little

Post on 20-Jan-2018

216 views

Category:

Documents


0 download

DESCRIPTION

3 Measurement Phase Alternative 1 N opportunities to do experiments Alternative M Experimental outcomes Alternative Alternative 2

TRANSCRIPT

Page 1: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

Sequential Off-line Learning with Knowledge Gradients

Peter FrazierWarren PowellSavas Dayanik

Department of Operations Research and Financial EngineeringPrinceton University

Page 2: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

2

Overview

• Problem Formulation• Knowledge Gradient Policy• Theoretical Results• Numerical Results

Page 3: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

3

Measurement Phase

Alternative 1

N opportunities todo experiments

Alternative M

+1+.2

-1 +1

Experimental outcomes

Alternative 3 -.2

Alternative 2 -1

Page 4: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

4

Implementation Phase

Alternative 1

Alternative M

RewardAlternative 3

Alternative 2

Page 5: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

5

Reward StructureOn-line Off-line

Sequential XBatchS

ampl

ing

Taxonomy of Sampling Problems

Page 6: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

6

Example 1

• One common experimental design is to spread measurements equally across the alternatives.

xn = n (mod M ) +1

Alternative

Quality

Page 7: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

7

Example 1

xn = n (mod M ) +1Round Robin Exploration

Page 8: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

8

• How might we improve round-robin exploration for use with this prior?

Example 2

xn = argmaxx ¾nx

Page 9: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

9

Example 2

xn = argmaxx ¾nxLargest Variance Exploration

Page 10: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

10

Example 3

Exploitation: xn = argmaxx ¹ nx

Page 11: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

11

Model

• xn is the alternative tested at time n.• Measure the value of alternative xn

• At time n, , independent• Error n+1 is independent N(0,()2).• At time N, choose an alternative.• Maximize

yn+1 = Yxn +"n+1

Yx » N (¹ nx ; (¾n

x )2)

IE [YxN ]= IE£maxx ¹ Nx¤

Page 12: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

12

State Transition

• At time n measure alternative xn. We update our estimate of based on the measurement

• Estimates of other Yx do not change.

Yxn yn+1

¹ n+1x = (¾n

x )¡ 2¹ nx +(¾²)¡ 2yn+1

(¾nx )¡ 2 +(¾²)¡ 2

(¾n+1x )¡ 2 = (¾n

x )¡ 2 +(¾²)¡ 2for x = xn

¹ n+1x = ¹ n

x for x 6= xn

¾n+1x = ¾n

x for x 6= xn for x 6= xn

Page 13: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

13

• At time n, n+1x is a normal random variable

with mean nx and variance satisfying

uncertainty about Yx before the measurement

uncertainty about Yx after the measurement

change in best estimate of Yx due to the measurement

(~¾n+1x )2

(¾n+1x )2 = (¾n

x )2 ¡ (~¾n+1x )2

VN (¹ N ;¾N ) = maxx

¹ Nx

Vn(¹ n ;¾n) = maxxn IE£Vn+1(¹ n+1;¾n+1)¤n

• The value of the optimal policy satisfies Bellman’s equation

Page 14: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

14

Utility of Information

Consider our “utility of information”,U(¹ n;¾n) := maxx ¹ n

x and consider the random change in utility

due to a measurement at time n¢Un+1 := U(¹ n+1;¾n+1) ¡ U(¹ n;¾n)

Page 15: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

15

Knowledge Gradient Definition

• The knowledge gradient policy chooses the measurement that maximizes this expected increase in utility,

xn = argmaxx IEn£¢Un+1¤

Page 16: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

16

Knowledge Gradient

• We may compute the knowledge gradient policy via

IEn£¢Un+1¤ := IEn

£U(¹ n+1;¾n+1) ¡ U(¹ n;¾n)¤

= IEnh³

maxx

¹ n+1x

´¡ max

x¹ n

xi

= IEn

· µ¹ n+1

xn _ maxx6=xn

¹ nx

¶¡ max

x¹ n

x

¸:

which is the expectation of the maximum of a normal and a constant.

Page 17: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

17

Knowledge Gradient

~¾(s) := s2=p

s2 +(¾²)2³nx := ¡ j¹ n

x ¡ maxx06=x

¹ nx0j=~¾(¾n

x )f (z) := z©(z) + ' (z)

The computation becomesargmaxx ~¾(¾n

x )f (³nx )

where is the normal cdf, is the normal pdf, and

Page 18: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

18

Optimality Results

• If our measurement budget allows only one measurement (N=1), the knowledge gradient policy is optimal.

Page 19: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

19

Optimality Results

• The knowledge gradient policy is optimal in the limit as the measurement budget N grows to infinity.

• This is really a convergence result.

Page 20: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

20

Optimality Results• The knowledge gradient policy has sub-optimality

bounded by

where VKG,n gives the value of the knowledge gradient policy and Vn the value of the optimal policy.

· (2¼)¡ 1=2(N ¡ n ¡ 1)maxx ~¾(¾nx )

Vn(¹ n;¾n) ¡ VK G;n(¹ n;¾n)

Page 21: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

21

Optimality Results

• If there are exactly 2 alternatives (M=2), the knowledge gradient policy is optimal.

• In this case, the optimal policy reduces toxn = argmaxx ¾n

x

Page 22: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

22

Optimality Results

• If there is no measurement noise and alternatives may be reordered so that

then the knowledge gradient policy is optimal.

¹ 01 ¸ ¹ 0

2 ¸ :: : ¸ ¹ 0M

¾01 ¸ ¾0

2 ¸ :: : ¸ ¾0M

Page 23: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

23

Numerical Experiments

• 100 randomly generated problems• M Uniform {1,...100}• N Uniform {M, 3M, 10M}• 0

x Uniform [-1,1]

• (0x)2 = 1 with probability 0.9

= 10-3 with probability 0.1• = 1

Page 24: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

24

Numerical Experiments

Page 25: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

25

• Compare alternatives via a linear combination of mean and standard deviation.

• The parameter z/2 controls the tradeoff between exploration and exploitation.

Interval Estimation

xn = argmaxx ¹ nx +z®=2¾n

x

Page 26: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

26

KG / IE Comparison

Page 27: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

27

KG / IE Comparison

Val

ue o

f KG

– V

alue

of I

E

Page 28: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

28

IE and “Sticking”

¹ = [2:5;0;: : : ;0], ¾= [0;1;: : : ;1]

Alternative 1is known perfectly

xn = argmaxx ¹ nx +z®=2¾n

x ; z®=2 = 2:5

Page 29: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

29

IE and “Sticking”

V I E = 2:5; V = IE[max(2:5;Z1; : : : ;ZM )]% 1 as M % 1

Page 30: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

30

Thank YouAny Questions?

Page 31: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

31

Numerical Example 1

xn = argmaxx ~¾nx f (³n

x ) = 1

x ¹ x ¾2x ~¾x = ¾2

x=p

¾2x +(¾²)2 ¢ x = j¹ x ¡ maxx06=x

¹ x0j ³x = ¡ ¢ x=~¾x ~¾xf (³x)1 0 4 1.789 0 0 0.7142 0 2 1.155 0 0 0.4613 0 1 0.707 0 0 0.282

Page 32: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

32

Numerical Example 2

xn = argmaxx ~¾nx f (³n

x ) = 2

x ¹ x ¾2x ~¾x = ¾2

x=p

¾2x +(¾²)2 ¢ x = j¹ x ¡ maxx06=x

¹ x0j ³x = ¡ ¢ x=~¾x ~¾xf (³x)1 1 0 0 1 ¡ 1 02 0 1 0.707 1 -1.414 0.0253 -1 1 0.707 2 -2.828 0.001

Page 33: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

33

Numerical Example 3

xn = argmaxx ~¾nx f (³n

x ) = 2

x ¹ x ¾2x ~¾x = ¾2

x=p

¾2x +(¾²)2 ¢ x = j¹ x ¡ maxx06=x

¹ x0j ³x = ¡ ¢ x=~¾x ~¾xf (³x)1 1 1 0.707 1 -1.41 0.02512 0 4 1.789 1 -0.56 0.32233 -1 1 0.707 2 -2.83 0.0005

Page 34: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

34

Knowledge Gradient Example

Page 35: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

35

Interval Estimation Example

xn = argmaxx ¹ nx +3¾n

x

Page 36: Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department…

36

Boltzmann Exploration

Parameterized by a declining sequence of temperatures (T0,...TN-1).

IPn f xn = xg= exp(¹ nx =T n )P M

x 0=1 exp(¹ nx 0=T n )