experiments in robust optimization - columbia …dano/talks/orc06.pdfmotivation robust optimization...

Experiments in Robust Optimization

Daniel Bienstock

Columbia University

10-26-06

Daniel Bienstock ( Columbia University) Experiments in Robust Optimization 10-26-06 1 / 52

Motivation

Robust Optimization

Optimization under parameter (data) uncertainty

Ben-Tal and Nemirovsky, El Ghaoui et al

Bertsimas et al

Uncertainty is modeled by assuming that data is not knownprecisely, and will instead lie in known sets.

Example: a coefficient ai is uncertain. We allow ai ∈ [l i , u i ].

Typically, a minimization problem becomes a min-max problem.

Motivation

Robust Optimization

Bertsimas et al

Motivation

Robust Optimization

Bertsimas et al

Motivation

Robust Optimization

Bertsimas et al

Motivation

Robust Optimization

Bertsimas et al

Motivation

Robust Optimization

Bertsimas et al

Motivation

Example: Linear Programs with Row-Wise uncertainty

Ben-Tal and Nemirovsky, 1999

min ctxSubject to:

Ax ≥ b for all A ∈ Ux ∈ X

U = uncertainty set→ the i th row of A belongs to an ellipsoidal set Ei

e.g.∑

j α2ij (aij − aij )

2 ≤ 1

→ can be solved using SOCP techniques

Motivation

Other forms of optimization under uncertainty

Stochastic programming

Adversarial queueing, online optimization

“Risk-aware” optimization

Optimization of utility functions as a substitute for handlinginfeasibilities

Motivation

Scenario I: Stability

Data is fairly accurate, though possibly noisy – small errors arepossible

��

→ Idiosyncratic decisions and small changes in data could havemajor impact

Motivation

Scenario I: Stability

Data is fairly accurate, though possibly noisy – small errors arepossible

��

� � ��

� � � � � � � � ��

� � � � ��

→ Idiosyncratic decisions and small changes in data could havemajor impact

Motivation

Scenario II: Hedging

Significant, but within order-of-magnitude, data uncertainty

Example:A certain parameter, α, is volatile. Its long-term average is 1.5 but itwe could expect changes of the order of .3.

Possibly more than just noise

Could use deviations to our advantage, especially if there areseveral uncertain parameters that act “correlated”

Are we guarding against risk or are we hedging?

Motivation

Scenario III: InsuranceReal world data can exhibit undesirable and unexpected behavior

Classical goal: how can we protect without becoming too riskaverse

Need to clearly spell out desired tradeoff between risk andperformanceMagnitude and geometry of risk are not the same

Motivation

Scenario III: InsuranceReal world data can exhibit undesirable and unexpected behavior

Classical goal: how can we protect without becoming too riskaverse

Need to clearly spell out desired tradeoff between risk andperformanceMagnitude and geometry of risk are not the same

Motivation

Application: Portfolio Optimization

min λx T Qx − µT x

Subject to:

Ax ≥ b

µ = vector of “returns”, Q = “covariance” matrix

x = vector of “asset weights”

Ax ≥ b : general linear constraints

λ ≥ 0 = “risk-aversion” multiplier

Motivation

Subject to:

Ax ≥ b

Motivation

Subject to:

Ax ≥ b

Motivation

Subject to:

Ax ≥ b

Motivation

Subject to:

Ax ≥ b

Motivation

Robust Portfolio Optimization

Goldfarb and Iyengar, 2001

→ Q and µ are uncertain

Robust Problem

min x{

maxQ∈Q λx T Qx − min µ∈E µT x}

Subject to:∑j x j = 1, x ≥ 0

→ When Q is an ellipsoid and E is a product of intervals the robustproblem can be solved as an SOCP

Motivation

Linear Programs with Row-Wise uncertainty

Bertsimas and Sim, 2002

max ctxSubject to:

Ax ≤ b for all A ∈ Ux ≥ 0

→ in every row i at most Γi coefficients can change:aij − δij ≤ aij ≤ aij + δij

→ for all other coefficients: aij = aij .

Robust problem can be formulated as a (larger) linear program.

Motivation

max ctxSubject to:

Motivation

max ctxSubject to:

Motivation

max ctxSubject to:

Motivation

Equivalent formulation

max ctxSubject to:

→ in every row i exactly Γi coefficients change: aij = aij + δij

Data-driven Model definition

A different uncertainty model

→ Want to model that deviations of the returns µj from their nominalvalues are rare but could be significant

A simple example

Parameters: 0 ≤ γ ≤ 1, integer N ≥ 0,for each asset j :µj = expected return, 0 ≤ δj small (possibly zero)

Well-behaved asset j: µj − δj ≤ µj ≤ µj + δj

Misbehaving asset j: (1 − γ)µj ≤ µj ≤ µj

At most N assets misbehaveDaniel Bienstock ( Columbia University) Experiments in Robust Optimization 10-26-06 12 / 52

A simple example

A more comprehensive setting

Parameters: 0 ≤ γ1 ≤ γ2 ≤ . . . ≤ γK ≤ 1,integers 0 ≤ n i ≤ Ni , 1 ≤ i ≤ K

for each asset j : µj = expected return

between n i and Ni assets j satisfy:

(1 − γi )µj ≤ µj ≤ (1 − γi−1)µj , for each i ≥ 1 ( γ0 = 0)

Parameters: 0 ≤ γ1 ≤ γ2 ≤ . . . ≤ γK ≤ 1,integers 0 ≤ n i ≤ Ni , 1 ≤ i ≤ K

for each asset j : µj = expected return

between n i and Ni assets j satisfy:

(1 − γi )µj ≤ µj ≤ (1 − γi−1)µj , for each i ≥ 1 ( γ0 = 0)

Parameters: 0 ≤ γ1 ≤ γ2 ≤ . . . ≤ γK ≤ 1,integers 0 ≤ n i ≤ Ni , 1 ≤ i ≤ Kfor each asset j : µj = expected return

between n i and Ni assets j satisfy:(1 − γi )µj ≤ µj ≤ (1 − γi−1)µj∑

j µj ≥ Γ∑

j µj ; Γ > 0 a parameter

(R. Tutuncu) For 1 ≤ h ≤ H,

a set (“tier”) Th of assets, and a parameter Γh > 0

for each h,∑

j∈Thµj ≥ Γh

∑j∈Sh

Note: only downwards changes are modeledDaniel Bienstock ( Columbia University) Experiments in Robust Optimization 10-26-06 14 / 52

j µj ≥ Γ∑

for each h,∑

j∈Thµj ≥ Γh

∑j∈Sh

j µj ≥ Γ∑

for each h,∑

j∈Thµj ≥ Γh

∑j∈Sh

j µj ≥ Γ∑

for each h,∑

j∈Thµj ≥ Γh

∑j∈Sh

j µj ≥ Γ∑

for each h,∑

j∈Thµj ≥ Γh

∑j∈Sh

Data-driven Benders

General methodology:

Benders’ decomposition (= cutting-plane algorithm)

Generic problem: min x∈X maxd∈D f (x , d )

→ Maintain a finite subset D of D (a “model” )

1 Implementor: solve min x∈X maxd∈D f (x , d ),with solution x ∗

2 Adversary: solve maxd∈D f (x ∗, d ), with solution d

3 Add d to D, and go to 1.

Data-driven Benders

Why this approach

Decoupling of implementor and adversary yields considerablysimpler, and smaller, problems

Decoupling allows us to use more sophisticated uncertaintymodels

If number of iterations is small, implementor’s problem is a small“convex” problem

Most progress will be achieved in initial iterations – permits “soft”termination criteria

Data-driven Benders

Why this approach

Data-driven Benders

Why this approach

Data-driven Benders

Why this approach

Data-driven Benders

Why this approach

Data-driven Algorithm

Implementor’s problem

A convex quadratic program

At iteration m, solve

min λx T Qx − r

Subject to:

Ax ≥ b

r ≤ µT(i )x , i = 1, . . . , m

Here, µ(1), . . . , µ(m) are given return vectors

Data-driven Algorithm

Adversarial problem: A mixed-integer program

x ∗ = given asset weights

min∑

j x ∗j µj

Subject to:

µj (1 −∑

i γi−1 y ij ) ≤ µj ≤ µj (1 −∑

i γi y ij ) ∀i ≥ 1∑i y ij ≤ 1, ∀ j (each asset in at most one segment)

n i ≤∑

j y ij ≤ Ni , 1 ≤ i ≤ K (segment cardinalities)∑j∈Th

µj ≥ Γh∑

j∈Thµj , 1 ≤ h ≤ H (tier ineqs.)

µj free, y ij = 0 or 1 , all i, j

Data-driven Experiments

Example: 2464 assets, 152-factor model. CPU time: 500 seconds

10 segments (a: “heavy tail”)6 tiers: the top five deciles lose at most 10% each, total loss ≤ 5%

Same run

2464 assets, 152 factors;10 segments, 6 tiers

Summary of average problems with 3-4 segments, 2-3 tiers

columns rows iterations time imp. time adv. time(sec.)

1 500 20 47 1.85 1.34 0.462 500 20 3 0.09 0.01 0.033 703 108 1 0.29 0.13 0.044 499 140 3 3.12 2.65 0.055 499 20 19 0.42 0.21 0.176 1338 81 7 0.45 0.17 0.087 2019 140 8 41.53 39.6 0.368 2443 153 2 12.32 9.91 0.079 2464 153 111 100.81 60.93 36.78

Data-driven Approximation

Why the adversarial problem is “easy”

( x ∗ = given asset weights)

min∑

j x ∗j µj

Subject to:

µj (1 −∑

i γi−1 y ij ) ≤ µj ≤ µj (1 −∑

i γi y ij )∑i y ij ≤ 1, ∀ j (each asset in at most one segment)

n i ≤∑

j y ij ≤ Ni , ∀i (segment cardinalities)∑j∈Th

µj ≥ Γh (∑

j∈Thµj ), ∀h (tier inequalities)

µj free, y ij = 0 or 1 , all i, j

Why the adversarial problem is “easy”

( K = no. of segments, H = no. of tiers)

Theorem . For every fixed K and H, and for every ε > 0, there is analgorithm that finds a solution to the adversarial problem with optimalityrelative error ≤ ε, in time polynomial in ε−1 and n (= no. of assets).

The simplest case

max∑

j x ∗j δj

Subject to:∑j δj ≤ Γ

0 ≤ δj ≤ u j y j , y j = 0 or 1 , all j∑j y j ≤ N

· · · a cardinality constrained knapsack problemB. (1995), DeFarias and Nemhauser (2004)

Data-driven More experiments

What is the impact of the uncertainty model

All runs on the same data set with 1338 columns and 81 rows

1 segment: (200, 0.5)robust random return = 4.57, 157 assets

2 segments: (200, 0.25), (100, 0.5)robust random return = 4.57, 186 assets

1 segment: (100, 1.0)robust random return = 1.24, 281 assets

VAR Definition

Ambiguous chance-constrained models

1 The implementor chooses a vector x ∗ of assets

2 The adversary chooses a probability distribution P for the returnsvector

3 A random returns vector µ is drawn from P

→ Implementor wants to choose x∗ so as to minimize value-at-risk(conditional value at risk, etc.)

Erdogan and Iyengar (2004), Calafiore and Campi (2004)

→ We want to model correlated errors in the returns

VAR Definition

Uncertainty set

Given a vector x ∗ of assets, the adversary

1 Chooses a vector w ∈ Rn (n = no. of assets) with 0 ≤ w j ≤ 1for all j.

2 Chooses a random variable 0 ≤ δ ≤ 1

→ Random return: µj = µj (1 − δw j ) (µ = nominal returns).

Definition: Given reals ν and 0 ≤ θ ≤ 1 the value-at-risk of x∗ is thereal ρ ≥ 0 such that

Prob(ν − µT x ∗ ≥ ρ) ≥ θ

→ The adversary wants to maximize VAR

VAR Definition

Uncertainty set

VAR Definition

Uncertainty set

VAR Definition

Uncertainty set

VAR Definition

Uncertainty set

VAR Definition

Uncertainty set

VAR Definition

OR ...

1 Chooses a vector w ∈ Rn (n = no. of assets) with 0 ≤ w j ≤ Wfor all j.

→ Random return: µj = µj − δw j (µ = nominal returns).

VAR Definition

OR ...

1 Chooses a vector w ∈ Rn (n = no. of assets) with 0 ≤ w j ≤ Wfor all j.

→ Random return: µj = µj − δw j (µ = nominal returns).

VAR Definition

The classical factor model for returns

µ = µ + V T f + ε

µ = expected return,

V = “factor exposure matrix”,

f = a bounded random variable,

ε = residual errors

V is r × n with r << n.

VAR Adversarial problem

→ Random returnj = µj (1 − δw j ) where 0 ≤ w j ≤ 1 ∀ j , and0 ≤ δ ≤ 1 is a random variable.

A discrete distribution:

We are given fixed values 0 = δ0 ≤ δ2 ≤ ... ≤ δK = 1example: δi = i

Adversary chooses πi = Prob(δ = δi ), 0 ≤ i ≤ K

The πi are constrained: we have fixed bounds, πli ≤ πi ≤ πu

i(and possibly other constraints)

Tier constraints: for sets (“tiers”) Th of assets, 1 ≤ h ≤ H, werequire:E(δ

∑j∈Th

w j ) ≤ Γh (given)

or, (∑

i δi πi )∑

j∈Thw j ≤ Γh

Cardinality constraint: w j > 0 for at most N indices j

∑j∈Th

or, (∑

i δi πi )∑

j∈Thw j ≤ Γh

∑j∈Th

or, (∑

i δi πi )∑

j∈Thw j ≤ Γh

∑j∈Th

or, (∑

i δi πi )∑

j∈Thw j ≤ Γh

∑j∈Th

or, (∑

i δi πi )∑

j∈Thw j ≤ Γh

∑j∈Th

or, (∑

i δi πi )∑

j∈Thw j ≤ Γh

∑j∈Th

or, (∑

i δi πi )∑

j∈Thw j ≤ Γh

∑j∈Th

or, (∑

i δi πi )∑

j∈Thw j ≤ Γh

The adversarial problem is “easy”

K = no. of points in discrete distribution, H = no. of tiers

Theorem

Without the cardinality constraint, for each fixed K and H theadversarial problem can be solved as a polynomial number oflinear programs.

With the cardinality constraint, for each fixed K and H theadversarial problem can be solved as a polynomial number ofknapsack problems.

Theorem

Adversarial problem as an MIP

Recall: random returnj µj = µj (1 − δw j )where δ = δi (given) with probability πi (chosen by adversary),0 ≤ δ0 ≤ δ1 ≤ . . . ≤ δK = 1 and 0 ≤ w

min π,w ,V min 1≤i≤k Vi

Subject to

0 ≤ w j ≤ 1, all j, πli ≤ πi ≤ πu

i , all i,∑i πi = 1,

Vi =∑

j µj (1 − δi w j )x ∗j , if πi + πi+1 + . . . + πK ≥ 1 − θ

Vi = M (large ), otherwise

i δi πi )∑

j∈Thw j ≤ Γh , for each tier h

Subject to

0 ≤ w j ≤ 1, all j, πli ≤ πi ≤ πu

Vi =∑

i δi πi )∑

Subject to

0 ≤ w j ≤ 1, all j, πli ≤ πi ≤ πu

Vi =∑

i δi πi )∑

Subject to

0 ≤ w j ≤ 1, all j, πli ≤ πi ≤ πu

Vi =∑

i δi πi )∑

Approximation

i δi πi )∑

j∈Thw j ≤ Γh , for each tier h (∗)

Let N > 0 be an integer. For 1 ≤ k ≤ N , write

∑j∈Th

w j ≤ Γh + M (1 − zhk ), where

zhk = 1 if k −1N <

∑i δi πi ≤ k

zhk = 0 otherwise∑k zhk = 1

and M is large

Lemma. Under reasonable conditions, replacing (∗) with this systemchanges the value of the problem by at most a factor of (1 + 1

Approximation

i δi πi )∑

∑j∈Th

zhk = 1 if k −1N <

∑i δi πi ≤ k

and M is large

Approximation

i δi πi )∑

∑j∈Th

zhk = 1 if k −1N <

∑i δi πi ≤ k

and M is large

Find a near-optimal solution with minimum value-at-risk

Nominal problem:

v ∗ = min x λx T Qx − µT x

Subject to:

Ax ≥ b

Nominal problem:

v ∗ = min x λx T Qx − µT x

Subject to:

Ax ≥ b

Given asset weights x , we have:

value-at-risk ≥ ρ, if the adversary can produce a return vector µwith

Prob(ν − µT x ≥ ρ) ≥ θ

where ν is a fixed reference value.

Implementor’s problem at iteration r:

Subject to:

λx T Qx − µT x ≤ (1 + ε)v ∗

Ax ≥ b

V ≥ ν −∑

(1 − δi (t )w

)x j , t = 1, 2, . . . , r − 1

Here, δi (t ) and w (t ) are the adversary’s output at iteration t < r .

Implementor’s problem at iteration r:

Subject to:

λx T Qx − µT x ≤ (1 + ε)v ∗

Ax ≥ b

V ≥ ν −∑

(1 − δi (t )w

)x j , t = 1, 2, . . . , r − 1

VAR Runs

First set of experiments

1338 assets, 41 factors, 81 rows

problem: find a “near optimal” solution with minimum value-at-risk,for a given threshold probability θ

experiment: investigate different values of θ

“near optimal”: want solutions that are at most 1% more expensivethan optimalrandom variable δ:

VAR Runs

First set of experiments

1338 assets, 41 factors, 81 rows

problem: find a “near optimal” solution with minimum value-at-risk,for a given threshold probability θ

experiment: investigate different values of θ

“near optimal”: want solutions that are at most 1% more expensivethan optimalrandom variable δ:

VAR Runs

1338 assets, 41 factors, 81 rows, ≤ 1% suboptimality

θ time iters. VAR VAR as %(sec.)

.89 1.18 2 3.43131 71.50

.90 1.42 3 3.74498 78.04

.91 1.42 3 3.74498 78.04

.92 3.47 11 4.05669 84.53

.93 6.35 29 4.05721 84.54

.94 6.37 20 4.05721 84.54

.95 26.59 51 4.35481 90.74

.96 26.25 51 4.35481 90.74

.97 26.20 51 4.35481 90.74

.98 33.07 58 4.63938 96.67

.99 33.11 58 4.63938 96.67

VAR Runs

Second set of experiments

Fix θ = 0.90 but vary suboptimality criterion

VAR Runs

Typical convergence behavior

VAR Runs

Heavy tail, proportional error (100 points):

Heavy tail, constant error (100 points):

VAR Runs

Heavy tail, proportional error (100 points):

Heavy tail, constant error (100 points):

VAR Runs

Proportional Constantθ Time Its Time Its

.85 4.79 8 11.84 11

.86 2.32 3 8.27 8

.87 6.40 10 9.55 11

.88 4.34 4 18.10 24

.89 8.00 14 5.85 6

.90 2.58 4 13.54 20

.91 4.79 9 16.31 23

.92 7.99 15 13.13 22

.93 13.43 27 22.47 40

.94 10.04 15 21.99 40

.95 9.59 16 11.90 25

.96 6.63 17 29.89 54

.97 48.43 110 16.45 35

.98 20.25 53 20.25 45

.99 22.02 52 21.89 47

VAR Harder case

A difficult case

2464 columns, 152 factors, 3 tiers

time = 6191 seconds

258 iterations

implementor time = 6123 seconds,adversarial time = 20 seconds

VAR Harder case

A difficult case

time = 6191 seconds

258 iterations

VAR Harder case

A difficult case

time = 6191 seconds

258 iterations

VAR Harder case

A difficult case

time = 6191 seconds

258 iterations

VAR Harder case

A difficult case

time = 6191 seconds

258 iterations

VAR Harder case

Implementor runtime

VAR Harder case

Implementor’s problem at iteration r

Subject to:

λx T Qx − µT x ≤ (1 + ε)v ∗

Ax ≥ b

V ≥ ν −∑

(1 − δi (t )w

)x j , t = 1, 2, . . . , r − 1

VAR Harder case

Implementor’s problem at iteration r

Approximate version

Subject to:

2λ x T(k )Qx − λ x T

(k )Qx(k ) − µT x ≤ (1 + ε)v ∗, ∀k < r

Ax ≥ b

V ≥ ν −∑

(1 − δi (k )w

)x j , ∀ k < r

Here, δi (k ) and w (k ) are the adversary’s output at iteration k < r ,and x(k ) is the implementor’s output at iteration k .

VAR Harder case

Does it work?

Before: 258 iterations, 6191 seconds

Linearized: 1776 iterations, 3969 seconds

VAR Harder case

Does it work?

Before: 258 iterations, 6191 seconds

VAR Harder case

Averaging

x(k ) is the implementor’s output at iteration k .

Define y(1) = x(1)