dynamic optimization and learning for renewal systems

Dynamic Optimization and Learning for Renewal Systems

Michael J. Neely, University of Southern CaliforniaAsilomar Conference on Signals, Systems, and Computers, Nov. 2010

PDF of paper at: http://ee.usc.edu/stochastic-nets/docs/renewal-systems-asilomar2010.pdfSponsored in part by the NSF Career CCF-0747525, ARL Network Science Collaborative Tech. Alliance

tT/R

T/R

T/R

T/R

T/RNetwork

Coordinator

Task 1

Task 2

Task 3

T[0] T[1] T[2]

http://www-rcf.usc.edu/~mjneely

A General Renewal System

tT[0] T[1] T[2]

y[2]y[1]y[0]

•Renewal Frames r in {0, 1, 2, …}.•π[r] = Policy chosen on frame r.•P = Abstract policy space (π[r] in P for all r).•Policy π[r] affects frame size and penalty vector on frame r.

π[r]•y[r] = [y0(π[r]), y1(π[r]), …, yL(π[r])]

•T[r] = T(π[r]) = Frame Duration


tT[0] T[1] T[2]

y[2]y[1]y[0]

•Renewal Frames r in {0, 1, 2, …}.•π[r] = Policy chosen on frame r.•P = Abstract policy space (π[r] in P for all r).•Policy π[r] affects frame size and penalty vector on frame r. These are random functions of π[r] (distribution depends on π[r]):

π[r]•y[r] = [y0(π[r]), y1(π[r]), …, yL(π[r])]

•T[r] = T(π[r]) = Frame Duration


tT[0] T[1] T[2]

y[2]y[1]y[0]


π[r]•y[r] = [1.2, 1.8, …, 0.4]

•T[r] = 8.1 = Frame Duration


tT[0] T[1] T[2]

y[2]y[1]y[0]


π[r]•y[r] = [0.0, 3.8, …, -2.0]



tT[0] T[1] T[2]

y[2]y[1]y[0]


π[r]•y[r] = [1.7, 2.2, …, 0.9]


Example 1: Opportunistic Scheduling

S[r] = (S1[r], S2[r], S3[r])

•All Frames = 1 Slot•S[r] = (S1[r], S2[r], S3[r]) = Channel States for Slot r•Policy p[r]: On frame r: First observe S[r], then choose a channel to serve (i.,e, {1, 2, 3}).•Example Objectives: thruput, energy, fairness, etc.

Example 2: Markov Decision Problems

•M(t) = Recurrent Markov Chain (continuous or discrete)•Renewals are defined as recurrences to state 1.•T[r] = random inter-renewal frame size (frame r).•y[r] = penalties incurred over frame r.•π[r] = policy that affects transition probs over frame r.

•Objective: Minimize time average of one penalty subj. to time average constraints on others.

1

2

3

4

Example 3: Task Processing over Networks

T/R

T/R

T/R

T/R

T/R

Network Coordinator

•Infinite Sequence of Tasks.•E.g.: Query sensors and/or perform computations.•Renewal Frame r = Processing Time for Frame r.•Policy Types:• Low Level: {Specify Transmission Decisions over Net}• High Level: {Backpressure1, Backpressure2, Shortest Path}

•Example Objective: Maximize quality of information per unit time subject to per-node power constraints.

Task 1Task 2Task 3T/R

Quick Review of Renewal-Reward Theory(Pop Quiz Next Slide!)

Define the frame-average for y0[r]:

The time-average for y0[r] is then:

*If i.i.d. over frames, by LLN this is the same as E{y0}/E{T}.

Pop Quiz: (10 points)

•Let y0[r] = Energy Expended on frame r.•Time avg. power = (Total Energy Use)/(Total Time)•Suppose (for simplicity) behavior is i.i.d. over frames.

To minimize time average power, which one should we minimize?

(a) (b)

Two General Problem Types:

1) Minimize time average subject to time average constraints:

2) Maximize concave function φ(x1, …, xL) of time average:

Solving the Problem (Type 1):

Define a “Virtual Queue” for each inequality constraint:

Zl[r] clT[r]yl[r]

Zl[r+1] = max[Zl[r] – clT[r] + yl[r], 0]

Lyapunov Function and “Drift-Plus-Penalty Ratio”:

Z2(t)

Z1(t)

L[r] = Z1[r]2 + Z2[r]2 + … + ZL[r]2

Δ(Z[r]) = E{L[r+1] – L[r] | Z[r]} = “Frame-Based Lyap. Drift”

•Scalar measure of queue sizes:

•Algorithm Technique: Every frame r, observe Z1[r], …, ZL[r]. Then choose a policy π[r] in P to minimize:

Δ(Z[r]) + VE{y0[r]|Z[r]}

E{T|Z[r]}“Drift-Plus-Penalty Ratio” =

The Algorithm Becomes:

•Observe Z[r] = (Z1[r], …, ZL[r]). Choose π[r] in P to solve:

•Then update virtual queues:

Δ(Z[r]) + VE{y0[r]|Z[r]}

E{T|Z[r]}

Zl[r+1] = max[Zl[r] – clT[r] + yl[r], 0]

Theorem: Assume the constraints are feasible. Then under this algorithm, we achieve:

Δ(Z[r]) + VE{y0[r]|Z[r]}

E{T|Z[r]}DPP Ratio:

(a)

(b)

For all frames r in {1, 2, 3, …}

Solving the Problem (Type 2):

We reduce it to a problem with the structure of Type 1 via:• Auxiliary Variables γ[r] = (γ1[r], …, γL[r]).• The following variation on Jensen’s Inequality:

For any concave function φ(x1, .., xL) and any (arbitrarily correlated) vector of random variables (x1, x2, …, xL, T), where T>0, we have:

E{Tφ(X1, …, XL)}

E{T}E{T(X1, …, XL)}

E{T}φ( )≤

The Algorithm (type 2) Becomes:

•On frame r, observe Z[r] = (Z1[r], …, ZL[r]).•(Auxiliary Variables) Choose γ1[r], …, γL[r] to max the below deterministic problem:

•(Policy Selection) Choose π[r] in P to minimize:

•Then update virtual queues:Zl[r+1] = max[Zl[r] – clT[r] + yl[r], 0], Gl[r+1] = max[Gl[r] + γl[r]T[r] - yl[r], 0]

Example Problem – Task Processing:

T/R T/R

T/R

T/R

T/R

Network Coordinator

Task 1Task 2Task 3

•Every Task reveals random task parameters η[r]: η[r] = [(qual1[r], T1[r]), (qual2[r], T2[r]), …, (qual5[r], T5[r])]•Choose π[r] = [which node to transmit, how much idle] in {1,2,3,4,5} X [0, Imax] •Transmissions incur power•We use a quality distribution that tends to be better for higher-numbered nodes.•Maximize quality/time subject to pav≤ 0.25 for all nodes.

Setup Transmit Idle I[r]Frame r

Minimizing the Drift-Plus-Penalty Ratio:

•Minimizing a pure expectation, rather than a ratio, is typically easier (see Bertsekas, Tsitsiklis Neuro-DP).

•Define:

•“Bisection Lemma”:

Learning via Sampling from the past:

•Suppose randomness characterized by: {η1, η2, ..., ηW} (past random samples)

•Want to compute (over unknown random distribution of η):

•Approximate this via W samples from the past:

Simulation:

Sample Size W

Qua

lity

of In

form

ation

/ U

nit T

ime

Drift-Plus-Penalty Ratio Alg. With Bisection

Alternative Alg. With Time Averaging

Concluding Sims (values for W=10):

Quick Advertisement: New Book: M. J. Neely, Stochastic Network Optimization with Application to Communication and Queueing Systems. Morgan & Claypool, 2010.

http://www.morganclaypool.com/doi/abs/10.2200/S00271ED1V01Y201006CNT007

• PDF also available from “Synthesis Lecture Series” (on digital library)• Lyapunov Optimization theory (including these renewal system problems)• Detailed Examples and Problem Set Questions.

dynamic optimization and learning for renewal systems

Documents

policy r

renewal frame r

abstract policy space

frame duration example

ylr tr

policy types

tr quick review of renewal

penalty vector