robbins, monro: a stochastic approximation method 2017. 10. 13. · robbins,monro: a stochastic...

Robbins, Monro: A Stochastic ApproximationMethod

Robert Bassett

University of California Davis

Student Run Optimization SeminarOct 10, 2017

Motivation

You are a carrot farmer. The amount of carrots that you plantplays a part in how much carrots cost in the store, and hence howmuch profit you make on your harvest.

I Too many carrots planted → glut in the market and low profit.

I Too few carrots planted → unexploited demand for carrotsand low profit.

The connection between carrots planted and profit is complicated!We will model this as a stochastic optimization problem, in anattempt to deal with the random components of this model(weather, economic climate, transportation costs, etc.).

Motivation

Goal: Find a carrot planting decision θ∗ which minimizes yourexpected loss.

I Model loss as a random function of your planting decision θ.

Loss ∼ q(x , θ) known

I x : random vector of external factors. Independent of θ.

x ∼ F (x) unknown

I Carrot Planter’s Problem (CPP)

minθ∈Θ

Q(θ) = E[q(x , θ)]

I q : Rm × Rn → RI Θ ⊆ Rn, set of possible decisions.

x ∼ F (x) unknown

minθ∈Θ

Q(θ) = E[q(x , θ)]

x ∼ F (x) unknown

minθ∈Θ

Q(θ) = E[q(x , θ)]

minθ∈Θ

Q(θ) = E[q(x , θ)] (CPP)

Assumptions

1. With probability 1, q(x , θ) strongly convex in θ, with uniformmodulus K . That is,

〈θ1 − θ2,∇q(x , θ1)−∇q(x , θ2)〉 ≥ K ||θ1 − θ2||2 x a.s.

2. We have access to a random variable y ∼ P(y , θ) with theproperty that

E[y |θ] = ∇Q(θ).

3. y has second moment uniformly bounded in θ.∫Y||y ||2 dP(y , θ) ≤ C 2 ∀θ ∈ Θ.

4. Q is differentiable, so (CPP) is equivalent to solving∇Q(θ) = 0.

minθ∈Θ

Q(θ) = E[q(x , θ)] (CPP)

Assumptions

〈θ1 − θ2,∇q(x , θ1)−∇q(x , θ2)〉 ≥ K ||θ1 − θ2||2 x a.s.

E[y |θ] = ∇Q(θ).

minθ∈Θ

Q(θ) = E[q(x , θ)] (CPP)

Assumptions

〈θ1 − θ2,∇q(x , θ1)−∇q(x , θ2)〉 ≥ K ||θ1 − θ2||2 x a.s.

E[y |θ] = ∇Q(θ).

minθ∈Θ

Q(θ) = E[q(x , θ)] (CPP)

Assumptions

〈θ1 − θ2,∇q(x , θ1)−∇q(x , θ2)〉 ≥ K ||θ1 − θ2||2 x a.s.

E[y |θ] = ∇Q(θ).

minθ∈Θ

Q(θ) = E[q(x , θ)] (CPP)

Assumptions

〈θ1 − θ2,∇q(x , θ1)−∇q(x , θ2)〉 ≥ K ||θ1 − θ2||2 x a.s.

E[y |θ] = ∇Q(θ).

minθ∈Θ

Q(θ) = E[q(x , θ)] (CPP)

Assumptions

〈θ1 − θ2,∇q(x , θ1)−∇q(x , θ2)〉 ≥ K ||θ1 − θ2||2 x a.s.

E[y |θ] = ∇Q(θ).

(2) says we must have access to an unbiased estimator of ourgradient.

For many common loss functions this is true, e.g. squared loss:

q(x , θ) =1

2E[||Aθ − x ||2]

⇒y ∼ ∇q(x , θ) = AT (Aθ − x) satisfies

E[y |θ] = E[∇q(x , θ)|θ] = E[AT (Aθ − x)|θ] = ∇Q(θ)

The last equality follows from the mild regularity assumptions thatallow us to interchange derivative with integral. Know them!

(2) says we must have access to an unbiased estimator of ourgradient.For many common loss functions this is true, e.g. squared loss:

q(x , θ) =1

2E[||Aθ − x ||2]

⇒y ∼ ∇q(x , θ) = AT (Aθ − x) satisfies

E[y |θ] = E[∇q(x , θ)|θ] = E[AT (Aθ − x)|θ] = ∇Q(θ)

The last equality follows from the mild regularity assumptions thatallow us to interchange derivative with integral. Know them!

I We would like to find a minimizer of Q(θ) without everattempting to calculate the expectation that defines it.

I Why? It involves random factors that are nasty!

I Instead, we will build a sequence θn which depends on randomsamples from y ∼ P(y , θ) for different θ.

I Hope to generate iterates from samples which allow us tominimize the expectation.

I We want θn to be a consistent estimator of θ∗. That is

∀ε ∃N s.t. n ≥ N ⇒ P(||θn − θ∗||2 > ε) < ε

i.e. θn → θ in probability.

I We would like to find a minimizer of Q(θ) without everattempting to calculate the expectation that defines it.

I Why? It involves random factors that are nasty!

I Instead, we will build a sequence θn which depends on randomsamples from y ∼ P(y , θ) for different θ.

I Hope to generate iterates from samples which allow us tominimize the expectation.

I We want θn to be a consistent estimator of θ∗. That is

∀ε ∃N s.t. n ≥ N ⇒ P(||θn − θ∗||2 > ε) < ε

i.e. θn → θ in probability.

Facts of Life Give references!

1. L2-convergence implies convergence in probability. That is,

E[||θn − θ||2]→ 0⇒ ||θn − θ||2 → 0 in probability.

2. A convex function defines a monotone operator. That is, iff : Rn → R is convex, then

〈∇f (x)−∇f (y), x − y〉 ≥ 0

3. f (ξ, x) convex in x for some ξ almost everywhere implies

F (x) = E[f (ξ, x)]

is convex.

〈∇f (x)−∇f (y), x − y〉 ≥ 0

F (x) = E[f (ξ, x)]

is convex.

〈∇f (x)−∇f (y), x − y〉 ≥ 0

F (x) = E[f (ξ, x)]

is convex.

TheoremLet an be a positive sequence which satisfies:

I∞∑n=0

a2n <∞

I∞∑n=0

ana0 + ...+ an−1

For some initial θ0, define the sequence

θn+1 = θn − anyn

Where yn ∼ P(y |θn). Then θn → θ∗ in probability.

Proof:

Because of F.O.L. 1, we will instead show L2 convergence.

Proof:Because of F.O.L. 1, we will instead show L2 convergence.

Proof:

Define bn = E[||θn − θ∗||2]. We want to show that bn → 0.Have that:

bn+1 = E[||θn+1 − θ∗||2]

= E[E[||θn+1 − θ∗||2 |θn]]

= E[∫

Y||(θn − θ∗)− any ||2 dP(y |θn)

= bn + a2nE[∫

Y||y ||2 dP(y |θn)

]− 2anE[〈θn − θ∗,∇Q(θn)〉]

Proof:Define bn = E[||θn − θ∗||2]. We want to show that bn → 0.

Have that:bn+1 = E[||θn+1 − θ∗||2]

= E[E[||θn+1 − θ∗||2 |θn]]

= E[∫

Y||(θn − θ∗)− any ||2 dP(y |θn)

= bn + a2nE[∫

Y||y ||2 dP(y |θn)

]− 2anE[〈θn − θ∗,∇Q(θn)〉]

Proof:Define bn = E[||θn − θ∗||2]. We want to show that bn → 0.Have that:

bn+1 = E[||θn+1 − θ∗||2]

= E[E[||θn+1 − θ∗||2 |θn]]

= E[∫

Y||(θn − θ∗)− any ||2 dP(y |θn)

= bn + a2nE[∫

Y||y ||2 dP(y |θn)

]− 2anE[〈θn − θ∗,∇Q(θn)〉]

bn+1 = E[||θn+1 − θ∗||2]

= E[E[||θn+1 − θ∗||2 |θn]]

= E[∫

Y||(θn − θ∗)− any ||2 dP(y |θn)

= bn + a2nE[∫

Y||y ||2 dP(y |θn)

]− 2anE[〈θn − θ∗,∇Q(θn)〉]

bn+1 = E[||θn+1 − θ∗||2]

= E[E[||θn+1 − θ∗||2 |θn]]

= E[∫

Y||(θn − θ∗)− any ||2 dP(y |θn)

= bn + a2nE[∫

Y||y ||2 dP(y |θn)

]− 2anE[〈θn − θ∗,∇Q(θn)〉]

bn+1 = E[||θn+1 − θ∗||2]

= E[E[||θn+1 − θ∗||2 |θn]]

= E[∫

Y||(θn − θ∗)− any ||2 dP(y |θn)

= bn + a2nE[∫

Y||y ||2 dP(y |θn)

]− 2anE[〈θn − θ∗,∇Q(θn)〉]

= bn + a2n E[∫

Y||y ||2 dP(y |θn)

]︸︷︷︸−2anE[〈θn − θ∗,∇Q(θn)〉]

Y||y ||2 dP(y |θn)

]< C 2

by Assumption 3

= bn + a2n E[∫

Y||y ||2 dP(y |θn)

]︸︷︷︸−2anE[〈θn − θ∗,∇Q(θn)〉]

Y||y ||2 dP(y |θn)

]< C 2

by Assumption 3

≤ bn + a2nC

2 − 2an E[〈θn − θ∗,∇Q(θn)〉]︸︷︷︸

Because θ∗ ∈ argminθ Q(θ), ∇Q(θ∗) = 0.

E[〈θn − θ∗,∇Q(θn)〉] = E[〈θn − θ∗,∇Q(θn)−∇Q(θ∗)〉]

by F.O.L. 2 and 3

≤ bn + a2nC

2 − 2an E[〈θn − θ∗,∇Q(θn)〉]︸︷︷︸Because θ∗ ∈ argminθ Q(θ), ∇Q(θ∗) = 0.

by F.O.L. 2 and 3

≤ bn + a2nC

by F.O.L. 2 and 3

≤ bn + a2nC

by F.O.L. 2 and 3

0 ≤ bn+1 ≤ bn + a2nC

2 − 2anE[〈θn − θ∗,∇Q(θn)〉]

0 ≤ bn+1 ≤ b0 + C 2n∑

a2n − 2

n∑i=0

aiE[〈θi − θ∗,∇Q(θi )〉]

0 ≤n∑

aiE[〈θi − θ∗,∇Q(θi )〉] ≤1

(b0 + C 2

n∑i=1

Taking the limit as n→∞, gives

0 ≤ limn→∞

bn <∞

0 ≤ bn+1 ≤ bn + a2nC

2 − 2anE[〈θn − θ∗,∇Q(θn)〉]

0 ≤ bn+1 ≤ b0 + C 2n∑

a2n − 2

n∑i=0

0 ≤n∑

aiE[〈θi − θ∗,∇Q(θi )〉] ≤1

(b0 + C 2

n∑i=1

0 ≤ limn→∞

bn <∞

0 ≤ bn+1 ≤ bn + a2nC

2 − 2anE[〈θn − θ∗,∇Q(θn)〉]

0 ≤ bn+1 ≤ b0 + C 2n∑

a2n − 2

n∑i=0

0 ≤n∑

aiE[〈θi − θ∗,∇Q(θi )〉] ≤1

(b0 + C 2

n∑i=1

0 ≤ limn→∞

bn <∞

0 ≤ bn+1 ≤ bn + a2nC

2 − 2anE[〈θn − θ∗,∇Q(θn)〉]

0 ≤ bn+1 ≤ b0 + C 2n∑

a2n − 2

n∑i=0

0 ≤n∑

aiE[〈θi − θ∗,∇Q(θi )〉] ≤1

(b0 + C 2

n∑i=1

0 ≤ limn→∞

bn <∞

Great! This shows that limn→∞ bn exists. But is it equal to 0?

Consider the sequence

a1 + ...+ an−1

We have that

E[〈θn − θ∗,∇Q(θn)〉] ≥ KE[||θn − θ∗||2]

By strong convexity.Rewriting, we then have

≥ Kbn ≥ knbn

for large enough n.We proved previously that summing the an times inner productabove gives a convergent sequence! This gives that, since kn andbn are both positive,

∞∑n

anknbn <∞.

Great! This shows that limn→∞ bn exists. But is it equal to 0?Consider the sequence

a1 + ...+ an−1

We have that

E[〈θn − θ∗,∇Q(θn)〉] ≥ KE[||θn − θ∗||2]

≥ Kbn ≥ knbn

∞∑n

anknbn <∞.

a1 + ...+ an−1

We have that

E[〈θn − θ∗,∇Q(θn)〉] ≥ KE[||θn − θ∗||2]

By strong convexity.

Rewriting, we then have

≥ Kbn ≥ knbn

∞∑n

anknbn <∞.

a1 + ...+ an−1

We have that

E[〈θn − θ∗,∇Q(θn)〉] ≥ KE[||θn − θ∗||2]

≥ Kbn ≥ knbn

for large enough n.

We proved previously that summing the an times inner productabove gives a convergent sequence! This gives that, since kn andbn are both positive,

∞∑n

anknbn <∞.

a1 + ...+ an−1

We have that

E[〈θn − θ∗,∇Q(θn)〉] ≥ KE[||θn − θ∗||2]

≥ Kbn ≥ knbn

∞∑n

anknbn <∞.

By assumption on an, we have also that

ankn =Kan

a1 + ...+ an−1→∞

ankn →∞.

By assumption on an, we have also that

ankn =Kan

a1 + ...+ an−1→∞

ankn →∞.

We have established

1.∞∑i=1

aikibi <∞

2.∞∑i=1

aiki →∞

bn → 0!

We have established

1.∞∑i=1

aikibi <∞

2.∞∑i=1

aiki →∞

bn → 0!

We have established

1.∞∑i=1

aikibi <∞

2.∞∑i=1

aiki →∞

bn → 0!

We have established

1.∞∑i=1

aikibi <∞

2.∞∑i=1

aiki →∞

bn → 0!

We have established

1.∞∑i=1

aikibi <∞

2.∞∑i=1

aiki →∞

bn → 0!

So are there any sequences that satisfy the conditions in thetheorem?

I∞∑n=0

a2n <∞

I∞∑n=0

ana1 + ...+ an−1

Take an = 1n . Square summable and

∞∑n=0

11 + ...+ 1

≈∞∑n=0

n ln(n − 1)≥∞∑n=0

n ln n→∞

I∞∑n=0

a2n <∞

I∞∑n=0

ana1 + ...+ an−1

∞∑n=0

11 + ...+ 1

≈∞∑n=0

n ln(n − 1)≥∞∑n=0

n ln n→∞

I∞∑n=0

a2n <∞

I∞∑n=0

ana1 + ...+ an−1

∞∑n=0

11 + ...+ 1

≈∞∑n=0

n ln(n − 1)≥∞∑n=0

n ln n→∞

Extensions and loose ends

I Actually converges with probability 1 (Blum).I Convergence with rates

I E[Q(θn)− Q(θ∗)] ∈ O(n−1) (with strong convexity)I E[Q(θn)− Q(θ∗)] ∈ O(n−

12 ) (without strong convexity)

I θn−θ∗√n

is asymptotically normal. (Sacks)

I√n rate cannot be beat for general convex case. (Nemirovski

et al)

robbins, monro: a stochastic approximation method 2017. 10. 13. · robbins,monro: a stochastic...

Documents

a robbins-monro algorithm for with markov-switching ... ·...

1 mirecki parent/teacher conference · pdf...

high-dimensional exploratory item factor analysis … ·...

5 cent as copy vaughn monro teo pla ayt prom

stochastic gradient and quasi-newton methodsthe second...

john monro catacombs of rome and naples letter

powerpoint presentation · 1. 2. 3. 1. 1. 6. everything...

revenue management -...

dale a robbins · dale a robbins

singer’s singer matt · pdf filejames bond theme song ......

marco maggis – curriculum vitae · ˝ . ... title an...

allerton conference, september...

henry alfred home monro - auckland war memorial … ·...

driving optimisation through user experience - rick monro

stochastic approximation, cooperative dynamics and ... ·...

adaptive optimal scaling of metropolis-hastings algorithms...

matt monro jukebox

· pdf filesanyo sf-hd65 (pskf) pionner miyota sin...

a uni ed framework for stochastic optimization ·...

the robbins commitmenttoyou robbins brochure.pdf · robbins...