mathematical programming: introduction for ita …...while mathematical programming is second only...

Mathematical Programming:Introduction for ITA Programmers

Carl de Marcken

Contents

1 Introduction 3

1.1 Matrix notation and review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Examples 5

2.1 Computer purchase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Re-accommodation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Consumer preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Linear Programs: Definition and Theory 12

3.1 Standard form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Interpretation of standard form . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4 Optimizing a linear function over a polyhedron . . . . . . . . . . . . . . . . . . . 17

3.5 Piecewise-linear functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 The Simplex Algorithm 20

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 Extreme points and the basis set . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3 Simplex overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.4 Computational details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.5 Intialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.6 Other algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1

5 Special Case: Equalities 25

5.1 Projection matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.2 More on regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6 Sensitivity and Duality 27

6.1 Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6.2 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6.2.1 Network Revenue Management with Bid Prices . . . . . . . . . . . . . . . 30

7 Integer Programming 31

7.1 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7.2 Encoding combinatorial problems as integer programs . . . . . . . . . . . . . . . 32

7.2.1 Logic gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.2.2 One-time costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7.2.3 Configuration representation . . . . . . . . . . . . . . . . . . . . . . . . . 34

7.3 Solving integer programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

8 Relationship to Dynamic Programming 37

8.1 Optimal decoding of a sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

8.2 Dynamic programming and graphical structure . . . . . . . . . . . . . . . . . . . 38

8.3 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

8.4 Dynamic programming for continuous variables . . . . . . . . . . . . . . . . . . . 41

9 Solving Large Structured Problems 43

9.1 Delayed column generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

9.1.1 Crew scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

9.1.2 Column generation guided by reduced costs . . . . . . . . . . . . . . . . . 44

9.1.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

9.2 Dantzig-Wolfe decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

10 References 48

2

1 Introduction

Convex mathematical programs, ecompassing such variations as linear programs (LPs), integerlinear programs (IPs) and quadratic programs (QPs), are an extremely convenient formalism forexpressing optimization problems:

• Many important optimization problems can be expressed as mathematical programs.

• Efficient algorithms exist for solving LPs and QPs, and small IPs.

• There are high-quality software packages (solvers) that accept and solve mathematicalprograms in standardized formats.

• There are many optimization problems that do not conveniently fit into the dynamic pro-gramming (DP) framework that can be expressed as mathematical programs. (However,DP is usually more efficient when it can be applied.)

While mathematical programming is second only to DP in terms of most useful algorithmictechniques to know, most introductory algorithms courses don’t cover it at all, perhaps becausethe topic would simply consume too much time. In contrast, the operations research (OR) com-munity that focusses on planning and optimization problems for industry (particularly airlines),seems to know no other way to solve a problem than to formulate it as an IP and feed it to acommercial solver.

For very big problems, or problems that need to be solved in microseconds, or problems for whichdynamic programming can be applied, mathematical programming is usually not the best choice.But for many optimization and planning problems, especially one-offs where development timeis more expensive than run time, there’s no better tactic than to construct an LP or IP or QPand run an off-the-shelf (OTS) solver on it. This is especially true for (the large majority of)programmers that lack the capability to develop efficient tailored algorithms for a problem.

1.1 Matrix notation and review

An m× n matrix A has m rows and n columns; Aij refers to the entry at the i-th row and j-thcolumn. Aj is the j-th column of A and Ai is the i-th row of A. The transpose AT is the n×mmatrix formed by swapping the rows and columns of A, and A−1 refers to the matrix inverse ofA.

A =

A11 · · · A1n

.... . .

...Am1 · · · Amn

=

[

A1 · · · An

]

=

A1

...Am

AT =

A11 · · · A1m

.... . .

...A1m · · · Anm

=

[

(A1)T · · · (Am)T]

=

(A1)T

...(An)T

3

A vector x = x1 . . . xm is treated as an m × 1 column matrix. Typically single letter lower-case variables without subscripts refer to vectors and subscripts are used to index the vectorcomponents.

x =

x1

...xm

xT = [x1 . . . xm]

x = 0 and x ≥ 0 are shorthands for ∀k, xk = 0 and ∀k, xk ≥ 0.

The inner product of two vectors x and y, commonly expressed x · y or xT y, is the component-wise sum

∑

k xkyk. The product of an m × k matrix B with a k × n matrix C is the m × nmatrix A = BC where Aij is the inner product of the row vector Bi with the column vector Cj .

A =

(B1 · C1) · · · (B1 · Cn)...

. . ....

(Bm · C1) · · · (Bm · Cn)

=

(∑

k B1k · Ck1) · · · (∑

k B1k · Ckn)...

. . ....

(∑

k Bmk · Ck1) · · · (∑

k Bmk · Ckn)

It takes approximately mnk multiplications and additions to compute the product of an m × kand a k× n matrix; matrix multiplication is associative ((AB)C = A(BC)) but for dissimilarly-shaped matrices the order does affect the amount of computation required.

The matrix equation Ax = b expresses m linear constraints of the form Aix =∑

j Aijxj = bi.

Each of these restricts x to an (n − 1)-dimensional plane defined by Ai and b, where Ai is thenormal vector perpendicular to the plane and b controls the distance of the plane from the origin,as measured along the normal line.

The matrix inequality Ax ≤ b expresses m linear inequalities of the form Aix =∑

j Aijxj ≤ bi.Each of these restricts x to a half-space. Collectively, these inequalities define a polyhedral regionof space, bounded on all sides by planes; this polyhedron will typically be called the feasible set.In 2 dimensions these planes are lines. For example,

1 1−1 2−1

−1

[

xy

]

≤

6401

defines the feasible set pictured in figure 1.

4

−2 −1 0 1 2 3 4 5 6 7 8−2

−1

0

1

2

3

4

5

6

7

8

x + y <= 6

−x + 2y <= 4

−x <= 0

−y <= 1

Figure 1: Gray area is feasible set defined by 4 linear inequalities.

2 Examples

2.1 Computer purchase

ITA Ops needs to purchase computers sufficient for processing a certain peak number of QPXqueries per second Q without exceeding data center cooling limits P (measured in watts). Twotypes of computers are available for purchase, a fast, hot, expensive machine and a slower,cooler, cheaper machine. Assuming the goal is to minimize the price of computers necessary forsatisfying these requirements, the problem of selecting computer quantities can be expressed as

minimize: c1x1 + c2x2

variables: x1, x2

subject to: q1x1 + q2x2 ≥ Qp1x1 + p2x2 ≤ Px1 ≥ 0

x2 ≥ 0

where xi is the number of computers of type i purchased, ci is the per-computer price, qi is thenumber of queries per second that can be performed by one computer of type i, and pi is theper-computer heat production. Notice the constraints to enforce that the number of purchasedcomputers is non-negative.

This in a linear program (LP): a vector of real-valued decision variables x = [x1x2], a cost vector

5

c = [c1c2] that defines the objective function, and set of linear constraints in the variables, eitherequalities or inequalities. Linear programs can be efficiently solved to find the assignment to thevariables that minimizes or maximizes the objective function cT x and lies within the feasible setdefined by the linear constraints.

The most general formulation of a linear program is:

minimize (or maximize): cT x =∑

cixi

variables: x = x1 . . . xn

subject to: A′x = b′

A′′x ≤ b′′

A′′′x ≥ b′′′

If for the computer purchase problem the first computer type costs $3000, takes 10 seconds toanswer a QPX query, and generates 150 watts of heat, and the second type costs $8000, takes2.5 seconds to answer a query, and generates 950 watts of heat, and the constraints are at least100 QPX queries per second and no more than 200 kilowatts of heat, then the LP is:

x =

[

x1

x2

]

c =

[

30008000

]

A′ = ∅ b′ = ∅

A′′ =[

150 950]

b′′ =[

200, 000]

A′′′ =

.1 .41

1

b′′′ =

10000

The optimal solution to this LP is to spend $2,428,571.43 purchasing 428.57 computers of type1 and 142.86 of type 2. That exactlies satisfy both constraints. Because of the favorable priceto QPX-query ratio of type 2, it is best to buy as many as possible up to the cooling capacity,which unfortunately restricts ops to buying mostly the more power efficient type 1 computers.Figure 2 depicts the solution.

This optimal “solution” is fractional. One might guess that given such large numbers, the optimalinteger-valued solution is close by, and indeed it is: 432 of type 1, 142 of type 2, just slightlymore expensive at $2,432,000. A linear program constrained to integer values is called an integerlinear program (ILP), commonly abbreviated to integer program (IP). Integer programs can bemuch harder to solve than linear programs, though for this small problem simple enumerationof candidates would have been easy.

6

0 500 1000 1500

0

100

200

300

x1

x 2

cooling constraintquery capacity constraintcost optimization direction

Figure 2: Gray area is feasible set; blue dot is optimal solution.

Suppose instead of a fixed cooling budget, ops can purchase air-conditioners with cooling capacitypa for price ca; let the number purchased be xa. Then the problem becomes

minimize: c1x1 + c2x2 + caxa

variables: x1, x2, xa

subject to: q1x1 + q2x2 ≥ Qp1x1 + p2x2 − paxa ≤ 0x1 ≥ 0

x2 ≥ 0xa ≥ 0

Clearly, the cost increases with the number of air conditioners purchased, so it is cheapest to neverpurchase any more air conditioners than is necessary given the computer heat output. Thus, it ishighly likely that the solution to this problem has a fractional number of air conditioners. Thatisn’t physically realizable, but if the number of air conditioners is rounded up or down, that hassignificant consequences for computer counts, and thus for this problem the optimal solution tothe linear program is unlikely to be close to the optimal integer-valued solution. Indeed, if thecost ca of an air-conditioning unit is $300,000 and each provides pa = 53 kilowatts of cooling,then the optimal solutions with and without the integer constraint are substantially different:

continuous integerxa 4.48 4.00x1 0.00 292.00x2 250.00 177.00

7

2.2 Re-accommodation

After a flight is cancelled, an airline needs to re-accommodate the displaced passengers. Each ofthese passengers may have a variety of alternative routes available to them, each with its ownconvenience factors and costs to the airline (for reduction in the capacity available for otherpaying passengers or for payments to other airlines, for example). Furthermore, there are seatcapacity constraints on the alternative flights.

Suppose ITA has to implement an automated re-accommodation engine for an airline. QPX isused to calculate for each displaced passinger i a set of routes {rij}. Each route is a sequence offlights; let eijf = 1 if flight f is part of route rij , 0 otherwise.

ITA queries airline inventory for each flight f to determine the number of remaining free seats,sf and the airline’s revenue management system to determine the (incremental displacement)cost cf to the airline for each seat allocated on flight f (if the airline uses a bid-price based RMsystem this should be available1).

Each route rij for customer i has a certain (in)convenience value calculated by some arbitraryfunction based on such factors as number of flights and transfers, their duration, cabin-class orother seating differences, and how much later the arrival is than originally intended. This inturn is adjusted by the customer’s value to the airline, based on loyalty program information, togenerate an aggregate value cij that the airline believes reflects the loyalty cost of using route jto accommodate passenger i.

Let us summarize as an integer program. Let xij be a 0-1 decision variable for whether passenger iis re-accommodated on route j. Every passenger must be re-accommodated on exactly one route,resulting in constraints

∑

j xij = 1. The number of seats allocated on flight f is xf =∑

ij eijf xij ,subject to the capacity constraint xf ≤ sf . The total cost, including both loyalty and lostrevenue, is

∑

f cfxf +∑

ij cijxij . In conclusion:2

minimize:∑

f cfxf +∑

ij cijxij

variables: {xf}, {xij}

subject to: xf −∑

ij eijf xij = 0 (∀f)

xf ≤ sf (∀f)∑

j xij = 1 (∀i)

xij ≥ 0 (∀i, j)xij ≤ 1 (∀i, j)

For 500 passengers and 50 routes per passenger there are 25,000 route decision variables xij anda smaller number of flight allocation variables xf . The constraint matrices are quite sparse. Asa linear program this is easily within the range of solvability, but there is the possibility that

1Q: What if each additional seat usage on a flight has an incrementally greater cost? Can this still be expressedas a linear program? How? (RM systems can usually calculate non-linear bid prices for multiple sales.)

2It should be clear that the decision variables xf could be eliminated from this formulation, as they are directlydetermined by the xij , and this might reduce solver computation time, but they are notationally convenient.

8

the result will be fractional. Imposing the integer constraint, the difficulty depends on how closean approximation the linear relaxation is to the integer problem, as will be discussed further insection 7.

2.3 Consumer preferences

Suppose each time a customer buys a ticket ITA records the purchased solution as well as somedisplayed alternatives that were not chosen. ITA’s goal is to learn a function that summarizesthe customer’s preferences, so that we can better choose what solutions to show this customerin the future.

Let each solution s be characterized by a set of numerical features s1 . . . sn, such as duration,time on each of the major airlines, number of flights, minimum layover, maximum layover, etcetera. For a given purchasing decision t, let st be the purchased solution and U t be the setof unpurchased solutions. ITA assumes that each customer has a vector of weights w such thaty(s) =

∑

wisi characterizes the value they assign solution s.3 Thus, for every purchase t, forevery s′ ∈ U t, we expect y(st) ≥ y(s′). Define mt the per-purchase margin and M the totalmargin as follows:

mt = mins′∈Ut

y(st) − y(s′)

= mins′∈Ut

∑

i

wi(sti − s′i).

M =∑

t

mt

=∑

t

mins′∈Ut

∑

i

wi(sti − s′i).

Thus for decision t, mt is the difference in valuation between the purchased solution and the nextbest alternative, and M is that margin summed over all purchasing decisions; see figure 3. Onecan fit the customer’s weights w to their purchasing decisions by finding the w that maximizesM , thus separating as greatly as possible the purchases from the alternatives. Is this formulationa linear program?

Not as stated. M is not a linear function of w, because of the presense of the min function inthe definition of mt. But this problem can be transformed into a linear program by replacingthe non-linear definition of mt with a set of inequalities. For each s′ ∈ U t an inequality is added:

mt ≤∑

i

wi(sti − s′i).

3Q: Suppose the customer’s preferences are non-linear in the solution features - that is, they might depend onthe square of the duration or the product of times on different airlines or even the cosine of the sum of the flightnumbers. Can these non-linear terms be accommodated in this framework? (Yes, easily: how?)

9

0 5 10 150

5

10

15

w

Figure 3: The margin for one purchasing decision t: the blue dots are elements of U t that werenot purchased, the red dot is the purchased st, and the black line indicates the direction of w.The margin mt is the distance between the two green lines drawn perpendicular to w. The goalis to find the w that maximizes the sum of margins over all purchasing decisions.

It should be clear that these individual constraints collectively ensure

mt ≤ mins′∈Ut

∑

i

wi(sti − s′i).

There is no need for an additional mechanism to enforce equality because the goal is to maximizeM , which already ensures that the optimal answer will have each mt as great as is allowed bythe constaints. Thus, the final linear program is:

maximize:∑

t mt

variables: {mt}, {wi}

subject to: mt ≤∑

i wi(sti − s′i) (∀t, ∀s′i ∈ U t)

Solving this LP generates a set of per-purchase margins mt and an optimal weight vector wthat can be used to model this customer’s preferences (for example, to predict future purchasedecisions).

2.3.1 Discussion

These comments are not relevant to the broader discussion and can be skipped.

In practice, one would seek to avoid overfitting the weights to the particularities of the data,especially given that the number of training purchases for each customer is likely to be small.

10

To this end, one can either prevent individual weights from exceeding some threshold k (byimposing constraints −k ≤ wi ≤ k), or even better, introduce regularization terms that penalizehigh weights. A good solution (with a lot of theory behind it) is to regularize by penalizing thesum of squares of weights, by adding to the objective function a term −β||w||2 = −β

∑

i w2i for

some constant β. This changes the problem from a linear program to a quadratic program, butit is still readily solved by OTS algorithms.

It should be clear from figure 3 that the margin mt for each purchasing decision is determined bythe distance between the best unpurchased solution (as defined by the weights) and the purchasedsolution. It is possible to re-formulate the learning problem from that of fitting weights to thatof selecting the unpurchased solutions that define the margins; this is the dual problem in thesense defined in section 6. The classification technique named support vector machines (SVMs)is based on the concept of maximizing margins and this duality between weights and trainingdata points.

Solving this problem using the LP formulation given above might well work, but there arebetter ways to learn these kind of decision and preference functions from examples. For furtherinformation, look at a book on machine learning, particularly on the subject of classification.A particular problem with the formulation given here is that it doesn’t aggregate informationacross passengers, which could be a problem given the small number of data points likely to beavailable about each. A better scheme might, for example, assign passengers into broader classessuch as business-like or leisure-like.

11

3 Linear Programs: Definition and Theory

3.1 Standard form

When constructing a linear program, it is convenient to use the general form:

minimize (or maximize): cT xvariables: x = x1 . . . xn

subject to: A′x = b′

A′′x ≤ b′′

A′′′x ≥ b′′′

but to reason about linear programs and write algorithms to solve them, it is convenient torestrict attention to a simpler standard form:

minimize: cT xvariables: x = x1 . . . xn

subject to: Ax = bx ≥ 0

Any linear program can be converted to standard form, through various simple transformations:

• If the original goal is to maximize cT x, instead minimize (−c)T x.

• For each constraint of form∑

j Aijxj ≤ bi, introduce a slack variable si:

∑

Aijxj + si = bi

si ≥ 0

• For each constraint of form∑

j Aijxj ≥ bi, introduce a surplus variable si:

∑

Aijxj − si = bi

si ≥ 0

• For each variable xj not already restricted to be non-negative, introduce two new variablesx+

j ≥ 0 and x−j ≥ 0; wherever xj appeared in the original problem, replace with x+

j − x−j .

Thus, the program

12

maximize: c1x1 + c2x2

variables: x1, x2

subject to: q1x1 + q2x2 ≥ Qp1x1 + p2x2 ≤ P

becomes

minimize: (−c1)x+1 − (−c1)x

−1 + (−c2)x

+2 − (−c2)x

−2

variables: x+1 , x−

1 , x+2 , x−

2 , sq , sp

subject to: q1x+1 − q1x

−1 + q2x

+2 − q2x

−2 − sq = Q

p1x+1 − p1x

−1 + p2x

+2 − p2x

−2 + sp = P

x+1 , x−

1 , x+2 , x−

2 , sq , sp ≥ 0

The solution to the original problem can easily be recovered from the solution to the standard-form problem.

Notice that every inequality in the general form increases the dimensionality n of the problem,but the dimensionality of the feasible set doesn’t change because an inequality is changed to anequality. Consider the one dimensional problem of minimizing x subject to x ≥ 1. The standardform introduces a surplus variable: minimize x subject to x − s = 1, x ≥ 0, s ≥ 0. Graphically,we have figure 4.

−1 0 1 2 3−1

−0.5

0

0.5

1

x

general form: x >= 1

−1 0 1 2 3−1

−0.5

0

0.5

1

x

s

standard form: x − s = 1, x,s >= 0

Figure 4: Standard-form dimensionality change. In both cases the feasible set is the solid portionof the blue line.

3.2 Interpretation of standard form

The standard form of a linear program is: minimize cT x subject to Ax = b, x ≥ 0. The m × nmatrix A (m rows, n columns) and m-dimensional vector b encode m constraints on the feasibleset of solutions over n variables, and the n-dimensional vector of weights c encodes the directionof optimization.

13

Each row i is a linear constraint Aix =∑

j Aij = bi. In an n-dimensional space, such a linearconstraint restricts the set of solutions to an (n − 1)-dimensional planar subspace. (With twovariables, the constraint is a line; with three variables, a 2-dimensional plane.) In n dimensionsit takes n (linearly independent) constraints to uniquely determine a point.

In most linear programs, n > m: there are more decision variables than constraints. Thus, theconstraints Ax = b do not uniquely determine a solution. That’s vital. Without such flexibility,there’d be no way to adjust x to minimize the objective cT x. (Section 5 discusses the specialsituations n = m and n < m.)

Considering the columns Aj of A, we have in standard form that

∑

j

Ajxj = b.

That is, the sum of the columns as weighted by the decision variables must equal the columnvector b. Each of the n columns of A can be thought of as a bundle of m different kinds ofresources that can be purchased as a unit. In this column-space view, Aj is a resource bundleand xj is the number of that bundle that are purchased. The set of feasible solutions is the setof ways to purchase bundles such that the total set of resources purchased is exactly b.

This interpretation is fairly clear in the computer purchase problem of section 2.1: each columnis the bundle of QPX query throughput, power consumption and heat production associatedwith one type of computer.

Because in the normal case the number of bundles (columns) n exceeds the number of resources(rows) m, the columns are not linearly independent and therefore there are many bundle pur-chases that satisfy the requirements b. In fact, the set of feasible solutions forms an (n − m)-dimensional linear subspace.

To summarize: each row of the matrix A can be thought of as a linear constraint related to theproduction (typically ≥) or consumption (typically ≤) of a single resource, and each column ofthe matrix A can be thought of as a bundle of resources that can be purchased/consumed as agroup, with the variables x determining how many of each bundle (column) to purchase.

3.3 Convexity

Efficiently optimizing a function y = f(x) over a feasible set x ∈ X is only possible if the functionand sets are restricted to particularly simple forms. Mathematical programming algorithmsgenerally require that both the function and the feasible set are convex.

A convex set X is one where if two points x1 and x2 are in X then any convex combination of thetwo is also in X . Mathematically, x1, x2 ∈ X ⇒ λx1 +(1−λ)x2 ∈ X, ∀0 ≤ λ ≤ 1. Intuitively thismeans that a line segment drawn from x1 to x2 lies entirely within X : the points can see eachother. All polyhedra, regions defined as the intersection of half-spaces (Ax ≤ b), are convex.4

4In mathematics this is more precisely a polytope, but it’s easier to just use the common words polyhedron

14

Balls and ellipsoids are convex. The intersection of convex sets is convex.

An extreme point of a convex set X is a point that can not be expressed as a convex combinationof two other points in X . For a polyhedron, the extreme points are the corners defined by theintersection of boundary planes. Every point in the interior of a convex set can be expressed asa convex combination of extreme points. Thus, two equally valid representations of polyhedraare as the intersection of a finite set of half-spaces; and as the set of convex combinations of afinite set of extreme points.5

A convex function f is defined similarly to a convex set. f is convex if f(λx1 + (1 − λ)x2) ≤λf(x1) + (1−λ)f(x2) for all 0 ≤ λ ≤ 1. That is, f is convex if the set {x, y|y ≥ f(x)} is convex.Intuitively, the function does not have local minima, and thus it is possible to minimize overthe set by moving successively downhill, without the risk of getting trapped at a sub-optimalpoint. Linear functions are convex. Quadratic functions are convex if the quadratic coefficientsare positive (the parabola opens up). Sums of convex functions are convex. The maximum of aset of convex functions is convex. The absolute value function is convex.

A function f(x) is concave if −f(x) is convex. Just as convex functions allow for gradient-basedminimization procedures, concave functions allow for gradient-based maximization procedures.Figure 5 presents examples of convex and concave functions.

If the optimization function is differentiable, a sufficient condition for convexity is that the secondderivative is everywhere ≥ 0 (and likewise, for concavity, ≤ 0).

Linear programs are those with a linear objective function cT x optimized over a feasible setthat is a polyhedron, defined by boundaries that are linear equations. Quadratic programs(QPs) are those with a polyhedral feasible set but a quadratic convex objective function, y =cT x + xT Qx, where Q is positive-definite, meaning that xT Qx ≥ 0 for all x - ensuring theparabaloid opens up. An example of a quadratic objective function is the distance squaredbetween two points. Quadratic programs are nicely behaved but generally more expensive tosolve than linear programs, and unlike LPs the optimal value is not always on the boundaryof the feasible set. QCQPs are quadratically-constrained quadratic programs, which allow forquadratically-define boundaries of the (convex) feasible set; an example of such a problem wouldbe minimizing a distance function over the interior of a sphere or ellipsoid. Again, QCQPs canbe solved, but less efficiently than LPs. There are more general forms of convex mathematicalprograms that can be solved, such as semi-definite programs (SDPs), and in fact general gradient-descent based procedures can optimize arbitrary convex functions over convex sets, but usuallyoffer no guarantees of search efficiency.

Example: Consider the problem of minimizing the distance between two points v and w confinedto two polyhedra defined by Av ≤ c and Bw ≤ d. The distance between v and w is r =√

(v − w)T (v − w) which is neither linear nor quadratic. However minimizing a distance is thesame as minimizing the distance squared, so one can minimize (v − w)T (v − w), which is a

(singular) and polyhedra (plural).5The number of extreme points of a polyhedron can be exponentially greater than the number of boundary

planes. Consider the d-dimensional hypercube defined by ∀i = 1 . . . d, 0 ≤ xi ≤ 1: there are 2d boundary planesbut 2d extreme points (corners). For this reason most linear programming algorithms represent the feasible setusing linear constraints, but an exception is discussed in section 9.2.

15

−5 0 5−5

0

5

−5 0 5−5

0

5

−5 0 5−5

0

5

−5 0 5−5

0

5

−5 0 5−5

0

5

−5 0 5−5

0

5

−5 0 5−5

0

5

−5 0 5−5

0

5

−5 0 5−5

0

5

−5 0 5−5

0

5

Figure 5: The top line contains examples of convex functions. The bottom line contains examplesof non-convex functions; those which are tinged green are concave, those which are red are neitherconcave nor convex. The defining quality of convexity is whether the shaded area is a convexset; for concavity, the unshaded region must be convex. The lower left and lower right functionsare quasiconvex.

16

quadratic function. As a distance squared, it is clearly always positive, so this is a quadraticprogram and is easily solved by computer programs.6 Now consider the case of maximizing thedistance, or equivalently, minimizing the negative distance. It seems like it should be easy tosolve for convex sets, but it isn’t a convex problem, as clear from the fact that the 2nd derivativeis always negative. To understand the difficulty in optimizing this problem, consider the regionsdefined in figure 6.

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

Figure 6: The red points are farther apart than the green points, and the green points are fartherapart than the blue points. Since the blue are a convex combination of red and green but havea worse value this is not a convex problem, and a gradient-descent based algorithm that startsat green will never reach the optimum at red.

Quasiconvexity is a slight generalization of convexity for functions; a function f is quasiconvex if∀0 ≤ λ ≤ 1, f(λx1 +(1−λ)x2) ≤ max(f(x1), f(x2)). Quasiconvex problems can not be solved bythe most widely used LP algorithms but many less-specialized convex optimization techniquescan accommodate them.

3.4 Optimizing a linear function over a polyhedron

A linear function optimized over a polyhedral feasible set always attains its minimum or maxi-mum at an extreme point defined by the intersection of boundary planes.7 That fairly intuitivefact, easily understood by examining the diagram in figure 7, is the basis of the simplex algorithmfor LPs that explores only the extreme points of the feasible set, never the interior or midpointsof edges.

In contrast, quadratic functions optimized over a polyhedron may take their optimum at the

6Actually, this pairwise minimizing over any two convex sets can also be solved by alternating minimization:first move point v to the feasible point closest to point w, then move point w to the feasible point closest to pointv, and so forth.

7This and most discussion for simplicity assumes the feasible set is bounded in all directions. It is possible toremove this restriction, in which case the optimum solution to a problem may include a set of rays, directionsunder which the solution can be extended indefinitely to further improve the objective function.

17

−2 −1 0 1 2 3 4 5 6 7 8−2

−1

0

1

2

3

4

5

6

7

8

x+2y

Figure 7: Maximizing x+2y over the feasible set. The black parallel lines are contour lines ofthe objective function x + 2y. The maximum over the feasible set occurs at the extreme pointdefined by the blue and green boundary planes. The yellow cones show the range of directionsof optimization for which each extreme point is the optimum; they are the cones define by thenormals to the boundary planes, or in other words, the rows of the matrix A.

midpoint of an edge or in the interior (see figure 8), and therefore QPs and other nonlinearconvex problems are usually solved by algorithms that explore the interior of the feasible setrather than just the boundary.

3.5 Piecewise-linear functions

Frequently one wishes to optimize a convex function that is not linear but is piecewise-linear,or can be reasonably approximated by a piecewise-linear convex function. An example of apiecewise-linear convex function is the absolute value y = abs(x) = max(x,−x). Any piecewise-linear convex function can, like the absolute value function, be expressed as the maximum of aset of linear functions. Given equations for each linear function (indexed by i) in the form Y ix,the objective function can be encoded in a linear program by minimizing a variable y subject toy ≥ Y ix, as shown in figure 9.

18

−2 −1 0 1 2 3 4 5 6 7 8−2

−1

0

1

2

3

4

5

6

7

8

Figure 8: Minimizing the quadratic function (x− 6)2 + (y − 4)2 over the feasible set. The blackcircles are contour lines of the objective function. In contrast to linear objective functions, inthis case the minimum over the feasible set occurs at a midpoint of a boundary plane ratherthan a vertex.

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Figure 9: Left: abs(x) as maximum of −x and x; right: piecewise-linear approximation to x2

expressed as the maximum over 5 tangents to the curve.

19

4 The Simplex Algorithm

4.1 Introduction

There are many different classes of algorithms for solving linear programs. The simplest effectiveprocedure is known as the simplex algorithm, developed in the 1940s and enhanced over timewith various optimizations. The simplex algorithm moves from extreme point to extreme pointaround the edge of the feasible set, ever in a direction that reduces the objective function, until itcan move no further. The basic concepts involved in the simplex algorithm are described below,with the goal of developing intuitions rather than enabling an implementation.

In practice, for general linear programming problems, simplex-based solvers are rarely the bestchoice, though a good simplex solver should be competitive with more sophisticated algorithmsfor small and medium-sized problems. For decades simplex was the most practical algorithmfor solving linear programs, but it has since been overtaken by various other algorithms, whichmay have provably better worst-case performance or exhibit better performance on medium andlarge-scale problems.

One reason the simplex algorithm is important is that for problems with special structure, suchas many network problems, the operations of the algorithm take especially simple or efficientforms and thus the simplex algorithm is a starting point for the development of highly efficientalgorithms targeted at the problem.

There are many, many details to get right in a high-quality implementation of simplex or otherLP algorithms; understanding the tricks of the trade requires mathematical sophistication anda lot of experience with computer memory hierarchies.

4.2 Extreme points and the basis set

The simplex algorithm is most easily implemented and understood for problems in standardform, Ax = b, x ≥ 0.

Each row of the matrix A is a linear constraint that restricts the feasible set to an (n − 1)-dimensional subspace; the basic theory of linear algebra tells us that given n such linearly-independent constraints, a single point would be determined. (For example, in two dimensions,the intersection of two one-dimensional lines is a points; in three dimensions, the intersection ofthree two-dimensional planes is a point.)

We know from convexity theory that the optimal solution to a linear program must lie at anextreme point of the feasible set, defined by the intersection of boundary conditions. But thereare only m linear conditions in Ax = b and n are needed to determine a point. Where do theremaining n − m conditions come from?

The answer is they come from the non-negativity conditions x ≥ 0. Each condition xi ≥ 0 definesa boundary plane to the feasible set, but because these are inequalities rather than equalities,the constraints xi = 0 do not necessarily hold at the optimum. However, since we know the

20

extreme points of the feasible set must be defined by the intersection of n constraints, it must betrue that each extreme point results from setting some choice of n − m variables to zero. Or toput it another way: at any extreme point, only m of the n variables can be non-zero8. Variablesthat are non-zero at a particular extreme point are called the basic variables at that point andform the basis set.

As there are n-choose-m = n!m!(n−m)! possible choices for basic variables at extreme points, the

number of extreme points can grow exponentially with the dimension. It is almost impossible tovisualize this extrapolating from simple 2 and 3-dimensional drawings, but can be understood byconsidering the d-dimensional hypercube, defined by 2d constraints but with 2d extreme points.

Notice that given a set of m basic variables with indices v1 . . . vm, it is straightforward to deter-mine the corresponding extreme point of the feasible set. The n − m non-basic variables havevalue 0. Given this, the constraints Ax = b simplify because the columns of A corresponding tonon-basic variables drop out (the resource bundles are unpurchased). Let the basis matrix B beformed by the columns of A corresponding to the basic variables:

B = [Av1|Av2

| . . . |Avm]

.

B is square (m ×m) and if the columns are linearly indpendent uniquely determines the valuesof the m basic variables:

Bx = b

x = B−1b

This leads to a naive algorithm for solving linear programs: iterate over all the n-choose-m setsof m basic variables. For each set construct the basis matrix B and solve Bx = b for x. If asolution x exists (the columns of B are linearly independent) and x ≥ 0 then an extreme pointof the feasible set has been found. Over all such points, return the one that minimizes cT x. Thissimple algorithm is quite expensive: the number of basis sets is exponential in n and solving forx in each iteration is an Θ(n3) operation.

4.3 Simplex overview

The simplex algorithm is a modification of the above procedure that replaces the brute-forceiteration over all sets of m basic variables with a directed search for the best set. Simplexmaintains a candidate set of basic variables along with the corresponding basis matrix B (usuallystored in its inverted form, B−1). At each step a swap of one non-basic variable for an existing

8If the idea that decision variables must be zero seems strange, remember that in the standard-form manyof the variables are slack and surplus variables, so it may still be that case that all of the original general-formdecision variables are non-zero)

21

basic variable is considered. If the result improves cT x, it is accepted. If no such swap exists,the solution is optimal (this follows from the nature of convex sets defined by linear constraints).

Each iteration of the simplex algorithm corresponds to swapping out one column of B for acolumn of A that is not currently in B; call this column Aj . Equivalently, the iteration movesfrom one extreme point x with xj = 0 to another where xj = α > 0. Let us write the newextreme point x? = x + αej + y where ej is the indicator vector with 1 in position j and zeroelsewhere, and the vector y is any other change necessary to account for the difference betweenx? and x; note that yj = 0. Since x? is an extreme point, it remains within the feasible set andit must be the case that Ax? = b. Thus,

Ax? = bA(x + αej + y) = bA(αej + y) = 0 (because Ax = b)αAj + Ay = 0αAj + By = 0 (because yj = 0)By = −αAj

y = −αB−1Aj

This means that if x is modified by raising xj from 0 to α, there must be a compensating movein the direction y = −αB−1Aj to keep x? within the feasible set9. The aggregate direction of themove is dj = ej − B−1Aj : x? = x + αdj . The derivative of the objective function under a unitmove in the direction dj is known as the reduced cost κj for variable j: κj = cT dj = cj−cT

BB−1Aj ,where cT

B = [cv1. . . cvm

].

If the reduced cost κj is negative, swapping xj into the basis set will improve the objectivefunction. But how far can the simplex algorithm move in direction dj before it leaves thefeasible set? Each move is a swap of basic variables, so the simplex algorithm moves until thefirst currely basic variable becomes non-basic, that is, becomes zero. The length of a move fromx in direction dj that would zero variable k is −xk

dj

k

. So α = mink −xk

dj

k

, and if k? is the index of

the variable that minimizes, k? leaves the basis set when j enters.10

To summarize: at every step, the simplex algorithm maintains a set of m basis variables v1 . . . vm

and the basis matrix B = [Av1|Av2

| . . . |Avm], in the form of its inverse B−1. From this, it

calculates the reduced cost of κj = cj − cTBB−1Aj for every variable j not in the basis set11. If

no negative reduced cost exists, the algorithm terminates with x = B−1b. Otherwise it choosesa variable j from among all those with negative reduced costs and then calculates the length ofthe move in direction dj = ej + B−1Aj by determining the first point at which a basic variablek becomes zero. Then, having determined j and k, it updates the basis set by adding j andremoving k, which involves swapping Aj for Ak in B, and thus an update to B−1.

9In linear algebra terms, this is projecting αej back into the null-space of A.10Some care is needed how ties are broken among variables entering and leaving the basis set, to avoid cycles.11Exercise: How expensive is this? Hint: what order does one do the matrix multiplication in?

22

4.4 Computational details

Computationally, there are two expensive operations in the simplex algorithm. The first iscalculating the reduced costs, so often tricks are used such as stopping the search after thefirst negative κj is found rather than finding all and using steepest-descent heuristics to choosebetween them. The second expensive operation is updating the basis matrix B after the enteringcolumn Aj and exiting column Ak have been determined. Because B is only used in inverseform B−1, and computing a matrix inverse is an Θ(m3) operation, most implementations ofthe simplex algorithm actually only maintain B−1 in memory, and use various linear algebraidentities to update B−1 to reflect the column change in time O(m2), considerably less than itwould take to update B and re-invert. A good explanation of how this is accomplished is found inthe Bertsimas and Tsitsiklis textbook. However, while A (and B) may be sparse, allowing bothto be stored in space much smaller then mn, B−1 is typically not sparse and the m2 memoryrequirement can be a limiting factor. Without going into implementation details, a single stepof the simplex algorithm takes best-case time Θ(m2) (if the update to the B−1 matrix is thedominant term) and worst-case Θ(mn) (if Θ(n) columns of A have to be scanned before findingone with reduced cost).

The simplex algorithm moves from extreme point to extreme point, around the edge of thefeasible set. There are worst-case an exponential number of such extreme points. How manysteps does the algorithm take in practice? The answer is that it is possible to construct perniciousexamples where the simplex algorithm takes an exponential number of steps to arrive at thesolution, but widespread experience on practical problems is that the number of steps growspolynomially and worst-case behavior is not relevant. A good rule of thumb is that the numberof steps usually grows linearly with the number of constraints m, suggesting an overall practicalcomplexity of O(m3) or O(m2n). Again, this is not a formal result, as the number of stepscannot be easily predicted and depends on problem structure.

4.5 Intialization

The simplex algorithm works by iteratively updating the set of basic variables. What basis setdoes it start from? Not every selection of m basic variables results in a feasible solution, sointialization is a serious issue: how does one find an initial extreme point of the feasible set andthe basis set it corresponds to?

In many problems expressed in general form Ax ≤ b, x ≥ 0, the value x = 0 is a feasible solution,corresponding to a standard form solution where the original variables are non-basic and theslack variables are all basic (Ax + s = b, x ≥ 0, s ≥ 0 → x = 0, s = b). Thus, the m columnscorresponding to slack variables are valid starting point for the algorithm (this is particularlyefficient because B and B−1 are both the identity matrix).

Simplex-based LP solvers usually either allow the user to specify an initial set of basis columns(for efficiency) or have an initialization subroutine that finds an initial extreme point of thefeasible set and corresponding basis set. One way this can be accomplished is by solving adifferent LP with additional variables for which the initialization problem is trivial. For anoriginal problem in standard form Ax = b, x ≥ 0, initialize by solving Ax + y = b, x ≥ 0, y ≥ 0

23

minimizing y1 + . . . + ym. This revised problem has a feasible starting point of x = 0, y = bwith the slack variables y forming the initial basis set. If the original problem’s feasible set isnon-empty, the optimal solution to the revised problem will have value y = 0 with x a feasibleextreme point of the original problem, and the m basic columns can be directly transfered toform the starting basis matrix B.

There are many circumstances where a set of similar LP problems each need to be solved andthe solution to one can serve as the initialization point of the next, greatly reducing the numberof steps the simplex algorithm will take. An example of this is some implementations of delayedcolumn generation, explained in section 9.1.

4.6 Other algorithms

There are a wide variety of algorithms for solving linear programs beyond simplex. Typical wordsin their names include: primal-dual, barrier, path-following, interior-point, ellipsoid, predictor-correcter, et cetera. Many of these have worst-case or expected-case behavior that improves oversimplex, especially for large problems or problems with particular structures, and no “industrial”solvers use plain simplex as their general-purpose linear solver.

The simplex algorithm moves between the vertices at the boundary of the feasibility set. So-calledinterior-point algorithms move within the interior of the set, often asymptotically approaching theoptimum solution.12 Many interior-point algorithms are adaptable to more general non-linearconvex optimization (including the special cases of quadratic programming and semi-definiteprogramming).

A number of linear programming algorithms have been proven to be polynomial time, thoughsome of the early examples of “provably efficient” algorithms are slower than simplex in practice.The ellipsoid algorithm for solving LPs is well known in theoretical computer science becauseit has running time polynomial in the number of variables n but independent of the numberof constraints m if there is an efficient way to test whether a point x is in the feasible set,and if not, find a separating hyperplane. Thus, there are certain LPs with exponential numberof constraints (m = Θ(2n)) that are nevertheless provably solvable in time O(nc). Althoughnot provably able to solve problems in polynomial time, the simplex algorithm augmented withdelayed column generation (section 9.1) is a method for dealing with cases where the number ofvariables is exponentially greater than the number of constraints.

12The Bertsimas and Tsitsiklis book has a readable introduction to interior point methods and the ellipsoidalgorithm.

24

5 Special Case: Equalities

If all constraints in a linear program are equalities, Ax = b, then the feasible set takes ona particularly simple form and there are substantially simpler and faster ways to solve theprogram than running an LP solver. Interestingly, in each case the best solution x? can be foundby computing a pseudoinverse projection matrix A+ from A. Then, the optimal answer canbe computed by a simple projection: x? = A+b. Therefore re-computing the solution for otherconstraint vectors b is a fast matrix-vector multiply operation.

−5 0 5−5

0

5case 1: m = n

−5 0 5−5

0

5case 2: m > n

−5 0 5−5

0

5case 3: m < n

Figure 10: The three cases of equality constraints. Left, m = n has a unique solution. Middle,m > n is overconstrained and solved by minimizing the sum of squared residues (black lines).Right, m < n is underconstrained but solved by minimizing the magnitude of the solution.

There are three important cases, as summarized in figure 10:

1. m = n: There are exactly as many (linearly independent) constraints as variables. In thiscase, x is uniquely determined as A−1b and can be found in time Θ(n3) by simple Gaussianelimination, or by computing the projection matrix A+ = A−1.

2. m > n: There are more (linearly independent) constraints m than variables n. The problemis over-constrained and there does not exist a feasible solution. Nevertheless, one can findthe x? that most closely satisfies the constraints. If this is taken to mean minimizing thesquared magnitude of the residue vector ||Ax − b||2 = (Ax − b)T (Ax − b), then this is theclassic problem known as least-squares fitting, or linear regression with a quadratic lossfunction, discussed further below. The solution can be found in time Θ(n3) by calculatingA+ = (AT A)−1A, though faster computation methods are available.

3. m < n: There are fewer (linearly independent) constraints than variables. Then thefeasible set is an n−m dimensional space and some objective function has to be defined toselect between the points in the feasible set. Given no problem-specific reasons for choosinga different objective function, there are arguments for minimizing ||x||2 = xT x. This isequivalent to ensuring x? has no components other than those necessary to satisfy the mconstraints, and also helps to reduce the sensitivity of x? to any noise in A. For this case,the pseudoinverse matrix is usually obtained from A using a technique called singular valuedecomposition (SVD). Libraries for performing SVD are readily available, and solving this

25

way can be more efficient than solving the minimization as a quadratic program. Thedetails can be found in linear algebra textbooks.

5.1 Projection matrices

The fact that all three cases can be solved using projection matrices is startling! Based on onlythe matrix of constraints A one can construct a matrix A+ that allows one to quickly find theoptimal x? for any b.

For example, consider the case of fitting a linear function to data (linear regression, a form offunction approximation). A set of m data points are received, each an n-dimensional input vectorzi = zi

1 . . . zin and a one-dimensional output yi. The data is approximated by yi = f(zi) = wT zi

using a vector of coefficients w = w1 . . . wn. In matrix terms, letting Zij = zij , the output

vector is y = Zw, and the goal is to find the w that minimizes the sum of squared residues||y − y||2 = (y − Zw)T (y − Zw). This can be solved by computing Z+ = (ZT Z)−1Z and thenthe best-fit weights w? = Z+y. Since Z+ does not depend on y, the computationally difficultpart of linear function approximation depends only on the function inputs, not the outputs, andthe problem can be resolved for different outputs very quickly.

5.2 More on regression

A common problem in linear regression is overfitting of the data, leading to poor predictiveperformance on future test cases. This is especially the case if the dimensionality n of the pointsis high (there are many weights). A solution is to regularize by penalizing the magnitude of theweight vector, minimizing (y−Zw)T (y−Zw)+λ||w||α where λ is an arbitrary coefficient usuallyselected by cross-validation13 and α is either 1 (lasso regression) or 2 (ridge regression).14 Ridgeregression has the disadvantage of favoring many small weights; lasso regression tends to moveweights to 0, making for interpetable and more computationally efficient models. In either casethe regression problem becomes a quadratic program:

minimize:∑

i(yi − yi)2 + λ

∑

j w2j (Ridge)

∑

i(yi − yi)2 + λ

∑

j |wj | (Lasso)

variables: wj , yi

subject to: yi =∑

j wjzij ∀i

However ridge regression can be solved more efficiently than running a general QP solver, againusing a projection matrix. In fact the quadratic penalty function simply changes the projectionmatrix to Z+ = (ZT Z + λI)−1Z.

13That is, the problem is solved for a variety of different λs, generating a set of weight vectors. Each weightvector is evaluated by testing the quality of the function approximation on some held out training data and theweights that perform best are chosen. An equivalent formulation is to replace the objective function penalty witha linear constraint ||w||α ≤ γ and search over γ.

14Choosing 0 < α < 1 would force more coefficients to zero, but the problem would no longer be convex.

26

6 Sensitivity and Duality

6.1 Sensitivity

The optimum value y = cT x? of a linear program is a function of the cost vector c and theconstraint vector b. Calculating ∂y/∂c and ∂y/∂b is a simple form of sensitivity analysis. Muchmore extensive forms of sensitivity analysis are possible, such as calculating sensitivity to theelements of A and the range for each element of A, b and c for which x? and y vary linearly; theHillier and Lieberman book has extensive discussion.

Sensitivity of y to changes in the cost vector c is easily computed as ∂y/∂c = ∂(cT x?)/∂c = x?.However x? can undergo large discontinuous jumps from extreme point to extreme point as theangle of c changes slightly. For small angular changes such jumps are orthogonal to c and thusinconsequential (to the value y). It is possible to put lower bounds on the range c can varywithout moving x? by considering the simplex stopping condition, namely that the reduced costof every column be non-negative: ∀j, cj −cT

BB−1Aj ≥ 0. In other words, x? will remain constantuntil a new column enters the basis set, and ranges on each cj can be derived from that condition.For example, if xj is not in the basis set, then it stays out until cj drops below cT

BB−1Aj .

Sensitivity of y to changes in the constraint vector b can be understood by considering thefeasible set defined by Ax ≤ b. Any infinitesimal change to an element bi expands or contractsthe feasible set by moving one of the boundary planes. If the optimal solution x? lies on thatplane and there are no redundant constraints, the solution will move, in turn altering the valuey = cT x?. Thus, in principal one can calculate the shadow prices vector p = ∂y/∂b, also knownas the row prices or marginal prices.

The simplex algorithm gives a practical way to compute shadow prices. At termination, theoptimum point x? is calculated from the final basis matrix B using x? = B−1b, so y = cT B−1b,and thus p = cT

BB−1.

Shadow prices have many interpretations. In an open marketplace they are the price one shouldpay for a unit of resource i, since they reflect the benefit conferred by being able to alter bi.Below we will see their fundamental role as the solution of a dual problem and an application inairline revenue management.

6.2 Duality

Consider the shortest-path problem: Given a directed graph G = V, E with weighted edges(w(eij) = wij), find the directed path between vertices vs and vt of minimal total edge weight.How can the shortest-path problem be formulated as a linear program? (You might want tothink about this before reading further.)

Here’s one way: create decision variables dv for every vertex in the graph, representing thelength of the shortest path from vs to v. Each edge eij imposes the constraint dj ≤ di + wij , ordj − di ≤ wij . These constraints provide upper bounds on distances to each vertex. If we setds = 0 and maximize dt, then dt will be the largest distance that is no larger than the shortest

27

path - the value we seek.

maximize: dt

variables: {dv}

subject to: ds = 0dj − di ≤ wij (∀eij)

One can visualize the maximization in this problem as building the network from string, fixingvs at the origin, and pulling vt to the right. When the edge strings (constraints) prevent it frommoving any further the position of vt is the length of the shortest path and those strings thatare taut are on a shortest path. The slack variable for edges on the shortest path will be zero.

A different formulation is based on flow through the network. Picture a unit of flow entering atvs and propagating along edges, with the only possible exit vt. Label the flow along every edgefij and assume that there is a cost to transporting the flow across any edge proportional to theedge length wij . The goal is therefore to find the flow that minimizes

∑

eijfijwij . Conservation

of flow constraints are imposed at every vertex vi:∑

eijfij −

∑

ekifki = si, where si is the

amount of flow sourced or sunk at vertex i (1 for vs, -1 for vt, 0 elsewhere). It is not hard to seethat minimizing the cost of this flow is equivalent to finding a shortest path.

minimize:∑

eijwijfij

variables: {fij}

subject to:∑

eijfij −

∑

ekifki = si (∀vi)

fij ≥ 0 (∀eij)

In the distance formulation, the edge weights appear in constraints. In the flow formulation theyappear in the objective function. The distance problem maximizes, the flow minimizes. In thedistance formulation there’s a constraint for every edge and a variable for every vertex; in theflow formulation a constraint for every vertex and a variable for every edge. But the optimalanswer is the same for both.

The phenomenon is known as duality. In fact, there is a simple procedure for generating a dualLP from every primal LP by swapping costs for constraint values (c ↔ b), and constraint rows forvariable columns (A ↔ AT ). It is possible to prove that the optimal objective function values forthe primal and the dual are the same, so the two formulations are in a strong sense equivalent,though in some problems one formulation seems conceptually simpler and offers a more efficientimplementation. Taking the dual of the dual brings one back to the primal. A number of linearprogramming algorithms are based on considering both the primal and dual simultaneously. Formany problems it is very educational and suggestive to consider the physical interpretation ofthe dual.

Figure 11 provides a generic recipe for converting from a primal to a dual program. The proofof this construction is most easily understood for standard-form LPs:

28

primal dualminimize: cT x maximize: pT bvariables: x variables: p

subject to: Aix ≥ bi subject to: pi ≥ 0Aix ≤ bi pi ≤ 0Aix = bi pi unboundedxj ≥ 0 pT Aj ≤ cj

xj ≤ 0 pT Aj ≥ cj

xj unbounded pT Aj = cj

Figure 11: Recipe for constructing a dual program from a primal. Ai indicates the i-th row ofA, Aj the j-th column.

primal dualminimize: cT x maximize: pT bvariables: x variables: p

subject to: Aix = bi subject to: pT Aj ≤ cj

xj ≥ 0

Consider the simplex stopping condition that all reduced costs be non-negative: c−cTBB−1A ≥ 0.

If we define a vector pT = cTBB−1 then at an optimal feasible solution pT A ≤ c. Thus, the dual

constraints pT A ≤ c are equivalent to the objective of the primal problem. The dual objectivefunction pT b is equal to cT x when x is a solution to the primal: pT b = cT

BB−1b = cTBx = cT x.

Hence, we have strong duality: if there is an optimal solution to the primal then there is anoptimal solution to the dual with the same objective function value.15

It is no coincidence that dual problem decision variables are denoted by the letter p: the valuespT = cT

BB−1 that optimize the dual are the shadow prices of the primal problem! In standardform Ax = b, duality creates an equivalence between the price of a resource bundle (cj) and theprice of a primitive resource (pi), requiring that the value of the resource bundle b be the samewhether composed from column bundles (cT x) or atomic resources (pT b).

Duality for linear programs is deeper than the above analysis suggests. For example, a slack con-straint in the primal corresponds to a zero shadow price and a non-zero shadow price correspondsto a tight constraint in the primal; this is known as complementary slackness. A consequence ofcomplementary slackness is that each basis set in the primal has a corresponding basis set in thedual, and vice versa, and the only case where both are feasible simultaneously is at the optimumpoint where cT x = pT b.

15This result requires in addition a proof that the dual objective pT b achieves as its maximum the minimumof cT x, a result that follows from a lemma pT b ≤ cT x known as weak duality.

29

6.2.1 Network Revenue Management with Bid Prices

An example of use of the dual shadow prices in the airline industry is in network revenuemanagement via bid prices. The discussion below is a substantial simplification of actual practice.

Suppose an airline with a flight network wants to maximize profits. The airline operates a set offlights {f}. Each flight has a maximum capacity cf . For RM purposes they collect commonlyflown sequences of flights into routes {r}; let erf = 1 if flight f is part of route r. For each routethe airline offers multiple “products” at different prices, each associated with a booking-codeb; in reality the products might all be physically the same seat, but they are differentiated byprice. For a given route and booking code, the airline has a price table prb (also known as themarket value table, which may in practice depend only on the endpoints of the route and the dateand not on the specific flight sequence). From a database of historical purchasing patterns, theairline estimates an expected demand drb for each route and product - the most people willingto buy that product.16

The airline’s goal is to maximize revenue by deciding (when a customer inquires) whether tooffer a given product for sale. One way to accomplish this is to solve the following linear programwith decision variables nrb, the number of products b to sell on route r:

maximize:∑

rb nrbprb

variables: {nrb}subject to:

∑

rb nrberf ≤ cf ∀f (flight capacity)nrb ≤ drb ∀r, b (demand)nrb ≥ 0 ∀r, b

Solving this LP produces a table of how many of each product to sell, but that is not terribly easyto use in actual sales. Instead, the shadow prices πf for each capacity constraint

∑

rb nrberf ≤ cf

provide an estimate of the cost to the airline of selling a seat on the flight, in the sense that if aproduct is sold on that flight one less seat will be available for other products and πf is the lostfuture revenue (∂y/∂cf ) due to that seat being unavailable. The shadow prices are called bidprices in this context. So the argument goes that it is worth selling a product rb if the revenueexceeds the sum of the component bid prices: prb ≥

∑

f erfπf .

There are a number of substantial deficiencies of this simplistic model. If the capacity constraintfor a flight f is slack (demand is insufficient to pressure the constraint), then πf = 0, so itis always worth selling a product, regardless of the price. In reality one would like to chargeas much as the customer would be willing to pay. And the expected demand estimate drb

is deterministic; in reality demand is better modeled as a random variable, in fact a randomvariable with joint probability distribution over all products simultaneously (demand for differentproducts is correlated). Furthermore, the model must be re-computed after every sale (capacityreduction) and as time passes (since reduced time to departure changes the demand forecast) towork well. More sophisticated models that fix some of these shortcomings are described in theTalluri and Van Ryzin book.

16Estimating demand is extremely difficult: past purchase patterns for a flight will have been under a differentenvironment (prices, competition, general economic conditions, et cetera) and values may have been truncated

because insufficient seats were available to service the demand, or the airline refused sales.

30

7 Integer Programming

Many optimization problems naturally involve either integer or boolean decision variables, rep-resenting whole numbers of products or resources, truth or falsity of a logical proposition, ordiscrete choices. Such problems are often called combinatorial and can be expressed by addingrestrictions to a linear program that limit decision variables to integer values (frequently, tojust the binary case of 0-1). Good examples are the computer purchasing problem and the re-accommodation problem. A linear problem constrained to integer-valued variables is called aninteger program (IP); one that mixes integer and real-valued variables is often called a mixedinteger program (MIP).

7.1 Computational complexity

Combinatorial problems are in general much harder to solve than convex continuous problems;linear programs can be solved in polynomial time but integer programming is NP-complete.Linear programs with millions of variables and constraints can be solved if there is sufficientsparsity in the constraint matrix but hard integer programs with just a hundred variables canbe utterly impractical to solve.

In practice, how much harder an IP is to solve than an LP depends entirely on the problemstructure. In the computer purchasing scenario with fixed cooling capacity it seems likely thatthe optimum integer solution will lie close to the optimum relaxed LP solution, both in variablevalues and objective function value, though that is not certain. It is in fact fairly common forthe best integer variables to lie far from the LP relaxation solution but for the objective functionvalue to be very close, because for many problems there are a wide range of LP solutions ofapproximately equal value. However in the case where additional air conditioning resourcescan be purchased, it seems likely that the integer condition on the number of air conditionersmay radically alter the structure of the problem and the relaxation may provide little usefulinformation.

A classic demonstration of the worst-case difficulty of IP vs. LP is the expression of the NP-complete 3-SAT problem as an IP. Recall that 3-SAT is the problem of finding a satisfyingassignment of true or false to a set of variables x1 . . . xm subject to a set of k “or” clauses oversubsets of 3 variables or their negations. For example, (x1∨x2∨x3)∧(x1∨x2∨x3)∧(x1∨x2∨x3).If each 3-SAT logical variable is represented by a zero-one decision variable, any 3-SAT problemcan be translated to an integer program as a set of k inequalities, such as (for this problem):

x1 + x2 + x3 ≥ 1(1 − x1) + (1 − x2) + x3 ≥ 1(1 − x1) + x2 + (1 − x3) ≥ 1

This demonstrates that finding a feasible solution for an integer program is NP-complete. Butnotice that in this case solving the LP relaxation is not only easy but trivial: simply assigneach variable the value 1

2 , ensuring every clause is satisfied (with sum 32 ). This relaxed solution

31

provides no guidance whatsoever as to how to solve the original combinatorial problem, whichsuggests programs with no weights to favor one relaxed solution over another may be especiallyhard for IP solvers.

There are some problems for which the LP relaxation of an IP problem is guaranteed to produceinteger-valued solutions. Network-flow problems, including therefore shortest-path problems,where the values of A and b are all integral are guaranteed to have integer-valued optimalsolutions. Most LP textbooks explain the unimodularity conditions on A necessary for this niceproperty to hold.

7.2 Encoding combinatorial problems as integer programs

The following sections present various useful techniques for encoding combinatorial problems asinteger programs.

7.2.1 Logic gates

Often integer program encodings of combinatorial problems involve some number of logical ex-pressions over zero-one variables. Consider solving a shortest-path problem over flights subjectto ticketing restrictions that prevent flights of various pairs of different airlines from co-occurring,a so-called network flow problem with side-constraints (constraints that can’t be encoded in thenetwork-flow framework). For example, no Air Apple flights on the same ticket as Air Bananaflights. Using the flow formulation of the shortest-path problem from section 6, this could beexpressed as a quadratic number of constraints xa +xb ≤ 1 for every pair of flights a ∈ A, b ∈ B.It would be more efficient to reduce the number of constraints by introducing a single zero-onevariable uA to represent whether there are any Air Apple flights on the ticket, similarly uB forAir Banana. Then a single restriction uA +uB ≤ 1 suffices. What is required is a way to expressthat uA is the logical inclusive-or over the activation of all Air Apple flights a ∈ A. The followingconstraints accomplish this:

uA ≤∑

a∈A xa

uA ≥ xa (∀a ∈ A)

It is not hard to construct linear constraints that enforce any logical relationship between zero-one variables; implementations of the basic logic gates NOT, AND and IOR (inclusive OR)are given in figure 12. Since the combination of NOT and AND (or IOR) forms a universalset of logic gates, it is clear that any logical constraint over a set of binary variables can beconstructed, though in many cases auxilliary variables must be introduced. For example, toencode z = XOR(x, y):

z = IOR(a, b)

a = AND(x, 1 − y)

b = AND(1 − x, y)

32

A = B a = bA ⇒ B a ≤ bA ⇐ B a ≥ b

A = NOT(B) a + b = 1A ⇒ NOT(B) a + b ≤ 1A ⇐ NOT(B) a + b ≥ 1

A = IOR(B1 . . . Bk) a − b1 . . . − bk ≤ 0a ≥ b1

......

...a ≥ bk

A ⇒ IOR(B1 . . . Bk) a − b1 . . . − bk ≤ 0A ⇐ IOR(B1 . . . Bk) a ≥ b1

......

...a ≥ bk

A = AND(B1 . . . Bk) a − b1 . . . − bk ≥ 1 − ka ≤ b1

......

...a ≤ bk

A ⇒ AND(B1 . . . Bk) a ≤ b1

......

...a ≤ bk

A ⇐ AND(B1 . . . Bk) a − b1 . . . − bk ≥ 1 − k

Figure 12: Linear constraints to enforce common logical relations between zero-one variables.

However it is important to understand that the constraints in figure 12 do not themselves restrictvariables to zero-one values. In fact they allow a convex feasible set of intermediate variables, asseen in figure 13. The logical constraint emerges only when decision variables are restricted tointeger values. For this reason, linear relaxations of logical constraints can be quite uninforma-tive, as in the 3-SAT example, and the manner in which logical constraints are encoded is quiteimportant.

7.2.2 One-time costs

Many optimization problems involve a one-time cost for any non-zero use of a resource, plusa linear term. Mathematically, f(0) = 0, f(x > 0) = b + cx. The one-time cost b might bemachine set-up time or the purchase price of equipment necessary for some process. This formof non-convex objective function does not fit into the linear programming framework, but canbe accommodated in an integer program so long as the range of x is bounded 0 ≤ x ≤ X . Anauxilliary variable s is introduced, s = 0 if x = 0, s = 1 if x > 0, and the cost is rewritten

33

0 0.5 100.5

10

0.2

0.4

0.6

0.8

1

xy

IOR

(x,y

)

0 0.5 100.5

10

0.2

0.4

0.6

0.8

1

xy

AN

D(x

,y)

Figure 13: Feasible sets enforcing the IOR and AND logic gates, for 0-1 variables. The validcombinations of logical input and output values are found at the (red) extreme points.

f(x) = bs + cx. The relation between x and s can be expressed using linear inequalities:s ∈ {0, 1}, x

X≤ s ≤ x

εfor small ε, as diagrammed in figure 14. Only s needs to be confined to

integral values; the formulation works whether x is an integer or a real variable.

−1 0 1 2 3 4 5 6−1

−0.5

0

0.5

1

1.5

2

x

s

Figure 14: Linear constraints used to enforce s = 0 ⇔ x = 0, s = 1 ⇔ x > 0 over the domain0 ≤ x ≤ 5. Red dots are integral elements of the feasible set.

However in this formulation the feasible set of the linear relaxation differs markedly from theunrelaxed case. For example, if X = 5 and x = 1, the relaxed solution will have s = .2 insteadof s = 1, assessing only 20% of the fixed cost.

7.2.3 Configuration representation

In some problems logical constraints between integral decision variables are sufficiently complexthat it is awkward to construct linear constraint equations of the sort in figure 12, and the resultis often weak in the sense that the linear relaxation is uninformative. A possible solution is toconsider a “dual” formulation where logical constraints are replaced by decision variables.

Consider a logical constraint among a subset x1 . . . xn of all the decision variables in an IP,

34

that constrains the consistent configurations (assignments) for these variables to a set R ⊂X1× . . .×Xn. If R is sufficiently small it is practical to introduce new binary activation decisionvariables r1 . . . r|R|, one per consistent assignment. Let ρij = z if xi = z in assignment rj , 0otherwise. Then xi =

∑

j rjρij ; such an equation is introduced for every relation a variable xi

appears in, and enforces consistency of assignments to a variable across the relations.

A well-known application of this idea is in error-correcting codes (coding for reliable data trans-mission over a noisy communication channel). Suppose one wishes to reliably transmit k1 databits over a communication channel that independently flips bits with probability p. To provideredundancy, k2 extra parity-check bits are transmitted. Each parity-check bit is the XOR ofa subset of the data bits, perhaps a small randomly chosen subset of size r; such transmissionschemes are called Low Density Parity Check (LDPC) codes. The decoding problem is to, uponreceipt of k = k1 +k2 bits, find a consistent decoding of maximal probability (one that minimizesthe number of bit flips between the decoding and the received message). It is not difficult toformulate the decoding process as an integer program with k binary decision variables corre-sponding to the decoded message. The cost function penalizes decoded bits that differ from thereceived message. Logic gates of the sort in figure 12 can be used to encode the XOR restrictonbetween the decoded parity and message decision variables.

The problem with this formulation is that the decoding process is quite expensive, since it requiressolving a possibly large (depending on k) integer program. Simply solving the linear relaxation isnot enough, as the linear constraints that encode the XOR gates will be insufficiently restrictivewithout the 0-1 constraint for the result to be informative.

The configuration representation solves this problem by replacing each XOR gate with 2r booleandecision variables representing the consistent configurations to the r+1 bits in each gate (r dataand one parity). It turns out that with this formulation the linear relaxation is much moreinformative, and simply solving the linear problem and rounding the decision variables is a quiteeffective decoding scheme.

One advantage of the configuration representation is that it is clear how to improve the fidelityof the linear relaxation, at the expense of an exponentially growing problem size: one simplyexpands the subsets of variables configurations are computed over. In the limit, configurations areover all original decision variables, and the set R corresponds to the set of consistent assignmentsto the original integer problem. In this case the LP relaxation is trivially guaranteed to find theoptimum answer, though of course for problems of any size this is impractical as it correspondsto brute-force enumeration of assignments.

Another advantage of the configuration representation is that it can handle non-linear objectivefunction terms within the LP framework, if the domain of the nonlinearity is entirely within aconfiguration.

TODO: A simple diagram of LDPC encoding: 4 message bits, 3 parity bits connected (maybe3-way each). Lines from underlying bits going out to transmission bits. Variables names on allbits. Actual IP for configuration representation for this problem.

35

7.3 Solving integer programs

Combinatorial problems with various special structures can be solved rapidly using dynamicprogramming or other techniques tailored to the problem. For all other cases, one must use ageneral purpose IP solver. Usually these are implemented on top of an LP solver base.

IP solvers typically first relax integer contraints (for example, replacing a x ∈ {0, 1} constraintwith 0 ≤ x ≤ 1), then solve the resulting linear program. If there is no solution or the solution isintegral, the LP answer is returned. Otherwise some restriction must be added to the problemthat eliminates the fractional solution but no integral ones. A simple implementation of this isto pick a variable x with fractional value v in the LP solution and then divide the space of thatvariable into two pieces, x ≤ bvc and x ≥ dve. Adding these constraints creates two LP sub-problems, which collectively cover the feasible set of the original integer problem, but which haveeliminated some fractional solutions. The system can then recursively solve the sub-problems,eventually achieving an integral solution, if there is any. However solving the entire tree ofsub-problems would be prohibitively expensive, so a branch-and-bound technique is used. Foreach sub-problem the LP solution value provides a lower bound (assuming minimization) on thebest IP solution that can result. Thus, sub-problems are explored in order of their LP value. Insome cases it is possible to stop an expensive search early and extract a (possibly sub-optimal)solution via a technique like rounding.

There are more sophisticated techniques for dividing the search space than simply breaking asingle variable into two regions, and most IP solvers provide the user with a wide variety ofcontrols over the form of boundary planes introduced, stopping criteria, and other aspects of thesearch process.

For problems with particular structures it may be possible to derive an upper bound on thevalue of a sub-problem, in turn enabling the search procedure to prune other sub-problems withhigher-valued lower-bounds, but this is difficult to implement in general; it depends on a fastprocedure to verify that a sub-problem contains at least one feasible (integer-valued) answer, anNP-complete problem in the general case.

It is clear that the efficiency of general-purpose IP solvers built on top of LP solvers depends onhow close an approximation the LP relaxation is to the true problem, and the detailed structureof the problem. Worst-case, an exponential number of LP sub-problems must be solved, limitingproblem sizes to the tens of variables. Best-case, problems with tens of thousands or morevariables can be explored to a provably optimal solution.

The details of how problems are formulated is important. Often there are many equivalentways to express a problem as an integer program, some with substantially fewer variables andconstraints than others. Some LP and IP solvers have a presolver that looks for obvious sim-plications, such as a factorable problem, variables that are fixed or multiples of another, andso forth. For some problems pre-solvers offer substantial performance benefits, for others theydon’t.

36

8 Relationship to Dynamic Programming

Dynamic programming (DP) and mathematical programming complement each other, as is be-fitting for the two most important optimization techniques in computer science. DP tends tobe highly efficient when the decision variables of a problem, connected by their constraints,form linear or tree-like structures. But DP breaks down when the variables form more generalstructures. For those cases, linear and integer programming are often the best alternative. It isimportant to understand when each method is most appropriate.

8.1 Optimal decoding of a sequence

Consider the following problem: a traveler specifies various qualities of each of a sequence offlights, such as the flight number, airline, date, or endpoints, but not enough about each touniquely determine the flights; decode this input to the most likely17 consistent flight sequence.

Assume that given the input, subroutines return the number of flights n and a sequence of setsof matching flights F 1 . . . F n as well as sets of feasible connections C1 . . . Cn−1. To simplifynotation in the discussion, introduce artificial initial flight α and final flight ω.18

α ω

Figure 15: The flight network.

This decoding problem can be expressed using linear programming as that of finding a minimum-weight network flow, with a unit sourced at α and flowing to ω through flights and connections:see figure 15. Create decision variables xt

i , i ∈ F t for flights flight and ytij , ij ∈ Ct for connections.

Using weights wti and wt

ij to express the objective, and inflow and outflow constraints to relatethe flights x to the connections y, the LP is:

minimize:∑

wtix

ti +

∑

wtijy

tij

variables: {xti}, {y

tij}

subject to: x0α = 1

xtj −

∑

ij∈Ct−1 yt−1ij = 0 ∀t = 1 . . . n + 1, j ∈ F t (flow into xt

j)

xti −

∑

ij∈Ct ytij = 0 ∀t = 0 . . . n, i ∈ F t (flow out of xt

i)

ytij ≥ 0

17Perhaps defined as shortest total duration.18F 0 = {α}, C0 = {αi|i ∈ F 1}, F n+1 = {ω}, Cn = {iω|i ∈ F n}.

37

The optimal solution can be found using a generic LP solver like simplex but the computationalexpense is likely to be superlinear in n (recall that the simplex algorithm is often cubic in thenumber of constraints). Contrast this with the Viterbi algorithm, an Θ(n) dynamic program-ming algorithm for decoding sequences. The Viterbi algorithm incrementally constructs a tableBestPath[t, i] of the lowest-weight flight subsequence ending at time t with flight i:19

BestPath[0, α] = 〈0, {}〉

for t = 1 . . . n + 1for j ∈ F t

let 〈c?, p?〉 = 〈∞, {}〉for i ∈ F t−1

let 〈c, p〉 = BestPath[t − 1, i]c = c + wt−1

ij

if c < c?

〈c?, p?〉 = 〈c, p〉BestPath[t, j] = 〈c? + wt

j , p? + {j}〉

return BestPath[n + 1, ω]

TODO: show diagram of calculation of single value in table from previous entries, by showingtrellis with previous values filled in and particular weights and result.

Clearly the dynamic-programming Viterbi algorithm is preferable to linear programming for thisproblem: not only is the Viterbi algorithm Θ(n), but the run-time constants are very small.

8.2 Dynamic programming and graphical structure

The term dynamic programming is used to cover a wide variety of techniques and algorithms.Most have in common the construction of tables of optimal solutions for subproblems. The keyto efficiency is whether these tables are small, which roughly speaking depends on the numberof decision variables that connect the subproblem to the whole.

Consider a slight abstraction of the sequence decoding problem: the goal is to optimize a functionover an ordered set of decision variables x1 . . . xn, xi ∈ Xi,

20 and the function depends linearlyon adjacent pairs of decision variables. This is solvable via LP and even more efficiently by theViterbi DP algorithm, as the flight decoding example demonstrated. Dynamic programmingon this and most other problems can be thought of as a form of divide-and-conquer. A naivealgorithm that enumerates all assignments to the set of decision variables will be exponential timebecause there are an exponential number of assignments, but this can be reduced to linear timeusing the divide-and-conquer strategy of constructing solutions to subsequences by recursivelypartitioning them and combining the answers to subproblems.

19The Viterbi algorithm can actually be discovered by inspection of the dual of the network-flow LP formulation.20For discussion purposes, assume these are discrete variables over finite sets.

38

xi

xj

xj+1

xk

Figure 16: Divide-and-conquer applied to a generic sequence optimization problem. Decisionvariables form the vertices of the constraint graph; edges connect two variables if they co-occurin a constraint or a term of the objective function. A table T (xi, xk) of best solutions for thesubsequence xi . . . xk is constructed by recursively solving xi . . . xj and xj+1 . . . xk and combiningthe results. The table is over all assignments of values to (red) variables in the node boundary andeach entry is constructed by iterating over assignments to the node boundaries of the subproblems(yellow).

Suppose as per figure 16 the solver’s current task is to find the optimal assignment for thedecision variables xi . . . xk. Eventually this will be combined with assignments for the rest ofthe decision variables, and the value and validity of the combination will depend on the valuesof variables in the node boundary, those variables in subproblem with outside edges, in this casexi and xk (red in figure 16). Therefore the solver must return not one assignment but rather atable T (xi, xk) of the best assignment and associated objective function value for every pair ofvalues in Xi and Xk. This table is constructed by breaking the sequence xi . . . xk between anypair xj and xj+1 and solving the two subproblems to produce tables T1(xi, xj) and T2(xj+1, xk).Finally, for each pair xi and xk the entry T (xi, xk) is computed by minimizing over possibleassignments to variables in the subproblem node boundaries that are not in the parent nodeboundary, namely xj and xj+1 (yellow in figure 16):

T (xi, xk) = minxj∈Xj ,xj+1∈Xj+1

T1(xi, xj) + T2(xj+1, xk) + wxjxj+1.

Regardless of where sequences are split, divide-and-conquer constructs a total of n−1 tables, eachwith |X |2 entries (the size of the cross product of the domains of (red) node boundary variables).The computation of each table entry requires minmizing over the Θ(|X |2) cross product of thedomains of the (yellow) variables that are in sub node boundaries but not the parent nodeboundary. Thus, the entire divide-and-conquer process has time complexity Θ(n|X |4). TheViterbi algorithm is simply the divide-and-conquer algorithm optimized so that the set x1 . . . xn

is always divided into the sets x1 . . . xn−1 and xn, which simplifies the process substantially andreduces computation to Θ(n|X |2).

Consider now the case where each decision variable in the sequence participates in constraintsor terms of the objective function with the two neighbors on either side instead of just one, asdepicted in figure 17.

The size of the node boundaries increases by 2, so the number of entries in each table is multi-plied by a factor of |X |2, as is the computation for each table entry. In general, if each variableis connected to the κ variables on either side then the size of the node boundaries is 2κ and thedivide-and-conquer algorithm has complexity Θ(n|X |4κ), exponential in κ. The Viterbi algo-rithm, because of its more efficient division strategy, has complexity Θ(n|X |κ+1), substantially

39

Figure 17: A more densely connected constraint graph.

better but still exponential in κ. This exponential dependence fundamentally limits the problemsto which dynamic programming can be efficiently applied.

The parameter of efficiency κ can be defined for more general graphs and goes by such names asthe graph’s tree width or induced width. The width depends on how the graph is decomposed intosubproblems, but for many graphs there is no decomposition that results in small width, andfor these dynamic programming is inherently impractical. This is generally the case for denselyconnected graphs of decision variables: see figure 18.

(f)

(c)(b)

(d)

(a)

(e)

Figure 18: Constraint graphs with sample top-level decomposition. Running time of DP isexponential in the number of (red) decision variables in the node boundaries. (a), (b) and (c)are amenable to DP though the practicality of (c) depends on the height being fixed at a smallconstant - compare with (f). Even moderate sized bipartite graphs, cliques and grids usuallyhave too large boundary sets for DP to be practical.

40

8.3 Alternatives

If dynamic programming can be used to solve a combinatoric problem, it usually should be. Forother problems, solving the relaxed form of the problem with linear programming often providesgood bounds in polynomial time and can be a good guide for integer programming methods.Indeed, linear programming is a popular method for solving many combinatoric problems overcomplex constraint graphs, such as image reconstruction (grid graphs) and decoding of error-correcting-codes (sparse bipartite graphs, see section 7.2.3).

Sometimes both DP and LP are too expensive. A wide variety of faster optimization algorithmsare available that work even on graphs of high tree width, but they usually offer few guarantees- they may not converge, or may not converge to the best answer. Good algorithms to explorego under such names as mean field and belief propagation. Typically fast algorithms maintaina value or distribution over values for each decision variable and perform parallel local updatesbased on the values of neighbors in the constraint graph, without any global considerations(contrast with the global information in the B−1 matrix of simplex); they often suffer from localminima problems: they perform well only when the objective function is convex or the startingpoint is near the best answer.

For combinatoric problems of small to medium size other good alternatives to explore are gen-eral graph-search methods such as those typically discussed in the constraint-processing andsatisfiability literature, that are based on brute-force search but with extra cleverness abouthow variable selection and backtracking takes places, early pruning of search options and soforth. AI textbooks often cover such algorithms; a good reference to combinatoric search andthe ramifications of the structure of the constraint graph is Constraint Processing by Dechter.

8.4 Dynamic programming for continuous variables

It may not be immediately obvious how to apply dynamic programming in the case where theconstraint graph has low tree width but the decision variables are continuous. The Viterbiand divide-and-conquer algorithms build a table of the best solution for every assignment tovariables on the node boundary, and the number of assignments would seem to be uncountablefor continuous variables.

However it is not actually necessary to store solutions for all possible assignments to the variablesin the node boundary. In fact all that is necessary is to construct the set of extreme points, thepoints at the corners of the convex hull of the feasible set over the node boundary (and objectivefunction). Any point in the interior of the feasible set can be expressed as a convex combinationof these points. If the number of variables in the node boundary is small, the number of extremepoints is usually also small.

Depending on exactly how this concept is implemented, one arrives at a variety of algorithms,some with names. The Dantzig-Wolfe decomposition algorithm (section 9.2) can be used as adivide-and-conquer algorithm that divides the decision variables of an LP into multiple sets,solves each to find extreme points, and then combines the results. Fourier elimination (akaFourier-Motzkin elimination) is a scheme for iteratively eliminating a variable at a time from an

41

LP and can be thought of as the Viterbi algorithm for LPs. Each time a variable is eliminatedthe constraints it is part of are updated. Fourier elimination is very inefficient in general but canbe quite effective for low-tree-width graphs. The algorithm is simple and can be found in bookson linear programming.

If the decision variables in an LP are sequentially structured then certain general-purpose LPalgorithms are particularly efficient and may substantially outperform simplex; for more infor-mation see p. 440 of Bertsimas and Tsitsiklis.

42

9 Solving Large Structured Problems

Large linear programs can be impractical to solve using generic LP algorithms, and in some caseseven impractical to generate. Various faster algorithms have been developed for problems withspecial structures, and two are particularly important because they can often be applied to largereal-world problems and can result in speedups of several magnitudes.

The first, delayed column generation, is appropriate for problems when there are many columnsin the constraint matrix, with a regular structure to them. An example would be the com-puter purchase problem (section 2.1) if computers could be custom ordered, so that the set ofpossibilities was very large but simply described.

The second, Dantzig-Wolfe decomposition, applies when an LP is “almost” simple, in the sensethat if some constraints or objective function terms were removed there would be a very fastroutine for finding the optimal solution. Dantzig-Wolfe layers an LP solver that accounts for theproblematic constraints on top of the efficient procedure.

9.1 Delayed column generation

9.1.1 Crew scheduling

Consider a much-simplified form of the crew-scheduling problem: an airline needs to allocatepilots to flights, minimizing the total number of pilots they employ. Pilots are hired to live athub cities, but they may overnight away. To simplify operations the airline uses a repeatingtwo-day schedule that ensures each pilot returns to their home at least every other evening withplenty of time to sleep and be ready to handle any route the next day. Thus every day theavailable pilots at a hub either idle, fly a one-day circle, or start a two-day circle.

The total assignment must guarantee that every flight in the airline’s network is captained,though dead-heading (over-coverage) is allowed. Individual routes are subject to a variety ofunion work rules (maximum legs per day, maximum air time, minimum overnight away rest timeand so forth). Routes can be classified by their hub and whether they depart and return on evenor odd days.

Let us formulate this problem as an IP. There are a few different things going on. First, everyplane must be covered, suggesting a constraint for every flight in the two-day cyclic network.Second, individual routes must be selected, suggesting either a boolean indicator variable forevery route or an integral count of how many pilots fly the route (in this example, the onlyreason for multiple pilots to fly the same route would be a case where dead-heading is necessaryto return somebody to their hub). A fixed number of pilots are employed and transfer betweenroutes at hubs, suggesting per-hub flow conservation equations.

Here’s one formulation, which assumes the presense of one-day “idle” routes for each hub andday consisting of no flights:

43

minimize:∑

r(δrde + δrdoδrae)nr total pilots, summing over even-day activities

variables: nr ∀r number flying route r

subject to:∑

r δrfnr ≥ 1 ∀f (flight coverage)∑

r δrbδraonr =∑

r δrbδrdenr ∀b (even night pilot conservation)∑

r δrbδraenr =∑

r δrbδrdonr ∀b (odd night pilot conservation)

nr ∈ {0, 1} ∀r (zero-one IP)

where δrf = 1 if route r includes flight f , 0 otherwiseδrb = 1 if route r departs and arrives hub b, 0 otherwiseδrde = 1 if route r departs on even day, 0 otherwiseδrdo = 1 if route r departs on odd day, 0 otherwiseδrae = 1 if route r arrives on even day, 0 otherwiseδrao = 1 if route r arrives on odd day, 0 otherwise

This is a very common structure for scheduling problems: variables for each unit of scheduleassignment (here, a one or two-day flight route) coupled with coverage constraints (here, for eachflight) and conservation equations (here, for each night). Each column of A is a route, whichcan be thought of as a resource bundle for covering flights.

In this problem, the number of flight constraints should be reasonable (over two days, severalthousand flights perhaps), so m should not be a limiting factor for any algorithm. But what aboutn: how many possible one or two-day circle routes are there that a pilot could fly? Depending onthe details of the airline’s route structure, the number could be in the millions, and exponentiallyhigher if many-day routes are considered! Just generating a list of them could be very expensive,and feeding them to an LP or IP solver is likely impractical, even though the final solution willinclude a very small portion of them.

9.1.2 Column generation guided by reduced costs

At this point, it helps to recall the basic mechanism of the simplex algorithm as presented insection 4.3. Simplex maintains an m × m matrix B−1. The larger m × n matrix A is only usedas a source of possible columns, and columns are only relevant if they allow for an optimizationimprovement - that is, they have negative reduced cost. Furthermore, as a column is an optionalresource bundle, not having it at any stage can reduce the quality of the final solution but neverresult in an answer outside the feasible set.

For large planning problems of this sort, delayed column generation can be used. In delayedcolumn generation, the simplex algorithm is modified so that A exists only implicitly, neverexplicitly. A set of m or more initial columns is generated and used to start the algorithm.Then, at the point when the set of non-basic candidate replacement columns Aj would normallybe rated by their reduced cost, κj = cj −cT

BB−1Aj , the candidate columns are instead generated

44

on-the-fly using some method that knows how to target possibilities that would have negativereduced cost.

The −cTBB−1 term of the reduced cost, known as the row price vector, is a simple m dimensional

vector that assigns a value to every resource in the resource bundle in a manner dependent onthe current solution, reflecting the total cost change if more of that resource were consumed,accounting for other changes to remain within the feasible set. That’s the magic of the matrixB−1: it captures the global ramifications of local changes! In the context of the crew schedulingproblem, the row price vector −cT

BB−1 assigns a value (positive or negative) to every flight (aswell as to route endpoint constraints). If a route can be constructed on the fly that has netnegative cost, expressed as a constant (cj) plus a sum over per-flight and endpoint row prices, itcan be swapped into the basis set; otherwise the algorithm has converged to the best solution.

Thus, the problem of the exponentially large set of columns A is replaced with the problem ofgenerating (in a subroutine) a negative-weight sequence of flights that satisfies work rules, givenan arbitrary weight table. That’s effectively a shortest-path problem, and can be solved easilyand efficiently.

The dual of delayed column generation is delayed constraint generation, appropriate for the casewhere the number of constraints m is very large but few are likely to be relevant. In such a case,the problem can be solved with a small initial set of constraints, and if the solution is infeasibleunder the larger set, one of the violated constraints is added. If delayed constraint generation isnecessary but a solver only supports the incremental addition of columns, not rows, then it maybe best to work in the dual formulation.

Structurally similar but substantially more complicated IP formulations are commonly used tosolve airline crew-scheduling problems, usually with some form of delayed column generation.See the Yu book for more details.

9.1.3 Implementation

Most solvers don’t offer integrated support for delayed column generation in the form of a call-out to a user-supplied procedure that accepts a table of reduced-costs and returns a possiblecolumn. However, many offer the next best thing, namely library calls that can be conjoinedto extract reduced costs from a solution, then add a column to A and resolve starting from theprevious basis matrix. With no initialization costs, this is no less efficient than support for acall-out procedure.

It may seem daunting that the simplex algorithm must be initialized with a consistent basismatrix B, seemingly implying that at least m linearly independent columns must be generatedbefore delayed column generation can be started. But for many problems involving inequalitiesthe starting basis set can consist of the slack variables and B initialized to the identity matrix.In other problems it is trivial to generate a single sub-optimal column capable of satisfying theconstraints, and this can be combined with m − 1 slack variables.

45

9.2 Dantzig-Wolfe decomposition

Consider the problem of finding the shortest path through a network of flights, each of whichhas a price, under the constraint that the total price is less than some fixed constant; this is anexample of a CSPP (constrained shortest-path problem). Solving CSPPs is NP-hard in general.

The global price constraint prevents the application of ordinary fast shortest-path algorithms(price must be included in any dynamic-programming state, and the number of cumulativeprices is exponential). It can be expressed as an LP by converting the shortest-path problem toa network-flow and then adding the price as an additional linear side-constraint, though thereis of course no longer a guarantee that the relaxed solution will be integral. However convertingall the flights in a big network into decision variables would result in a huge LP or IP and seemsa very extravagant way to attack a problem that is just a small variation on shortest-path.

It would be nice to be able to apply linear programming techniques to this problem but maintainthe efficiencies of traditional SP algorithms. The Dantzig-Wolfe decomposition technique lets usdo this.

Suppose (working here in standard form) the constraints of an LP are divided into two sets, arelatively small number of “global” ones Gx = g that make the problem difficult and a potentiallylarger number of “local” ones Lx = l that are manageable:

minimize: cT x =∑

i cixi

variables: x = {xi}subject to: Gx = g

Lx = lx ≥ 0

Dantzig-Wolfe decomposition depends on there being a specialized algorithm for solving thereduced linear program L(κ) that results from dropping the global constraints and replacing thecost vector c with an arbitrary new vector κ. That is, L(κ) is the problem of minimizing κT xover the feasible set L = {x|Lx = l, x ≥ 0}.

In the example, L(κ) is the shortest-path problem without the price constraint; it can be solvedrapidly by any ordinary SP algorithm. The result will be an extreme point of L. Any point inthe interior of a polyhedron can be expressed as a convex combination of extreme points, so ifl1 . . . lz are the extreme points of L, then any feasible x can be expressed x = α1l1 + . . . + αzlz

where each l is an n-dimensional vector. This lets us restate the original problem as an LP overthe convex combination parameters α:

minimize: cT x =∑z

j=1(∑

i cilji )α

j

variables: α1 . . . αz

subject to: G(α1l1 + . . . + αzlz) = g∑

j αj = 1

α ≥ 0

This new master linear program M may have a very large number of decision variables (columns),

46

since the number of extreme points of a linear convex set can be exponential in the number ofdimensions. So what is the advantage of this formulation? Delayed column generation!

Consider the master LP M . The number of columns is very large, but the number of constraintsis just m = rows(G) + 1, potentially very small. Indeed, in the shortest-path example m = 2.Therefore using delayed column generation only 2 extreme points need be maintained at atime and solving the master LP is extremely easy and fast. Using simplex to solve the mastergenerates as a by-product the m-dimensional column of row prices p, p1 . . . pm−1 reflecting theglobal constraints and pm the convexity condition.

From the row prices p the DW method computes a revised cost vector κ to pass to the localproblem solver so that solving L(κ) is equivalent to finding the extreme point with the bestreduced cost for the original problem: κ = c− (p1 . . . pm)G. The specialized local problem solveris run, generating an new column. As with any variation of delayed column generation, if thecolumn does not have negative reduced cost, the algorithm terminates, otherwise the column isadded to the master problem and simplex is continued.

For the shortest-path example, each run of L(κ) generates an integral path. Convex combinationsof these paths are taken to solve the globally constrained problem. The best combination mayor may not be integral, though integer-programming techniques can of course be applied (this isnot as simple as imposing integral constraints on the master problem M ; it is not α that mustbe integral but x).

Dantzig-Wolfe decomposition can be extremely efficient if the local algorithm is fast. In manycases there are ways to take advantage in the local algorithm of the fact that κ may only changeslightly from iteration to iteration.

It would appear that the master problem must be initialized with a set of m+1 extreme points,so that B is well-defined, but there are often ways around this (section 9.1.3).

The decomposition part of the Dantzig-Wolfe decomposition name comes from the fact that thetechnique can be applied equally well if the decision variables can be partitioned into a numberof sets which each have a local solver, coupled by global constraints. An example of such wouldbe generating two otherwise independent paths subject to a constraint that the total path lengthis less than some constant.

47

10 References

There are many, many books about linear and integer programming and more general aspects ofconvex optimization as well as other topics covered in this discussion. Some recommendations:

• Linear Optimization, by Bertsimas and Tsitsiklis. Good easily readible intro to LP and IPincluding theory, interior point methods, network flow. Some discussion of implementationissues.

• Introduction to Operations Research, by Hillier and Lieberman. Standard introductoryoperations research text covers problem formulation in depth with many examples, LP, IP,simplex, network problems, QP and general nonlinear optimization, dynamic programming,decision analysis and other. Good applied text, little mathematics required.

• Convex Optimization, by Boyd and Vandenberghe. Excellent book on more general forms ofconvex optimization including quadratic and semi-definite programming. Lots of examples,but mathematical. On web at www.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf.

• Nonlinear Programming, by Bertsikas. Coverage of more general algorithms for convexoptimization, that do not assume the objective function and constraints are linear orquadratic.

• Introduction to Algorithms, 2nd Ed., by Corman, Leiserson, Rivest, Stein. Brief explana-tion of simplex, coverage of dynamic programming, network flow and matching algorithms.

• Network Flows, by Ahuja, Magnanti, Orlin. Specialized coverage of network flow algo-rithms, spanning trees and related topics.

• Numerical Recipes, by Press, Flannery, Teukolsky, Vetterling. Lots of material on conceptsand implementation of numerical algorithms.

• Constraint Processing, by Dechter. Explanation of combinatoric search algorithms forconstraint problems and conditions for efficient dynamic programming.

• Operations Research in the Airline Industry, edited by Yu. Chapters on many aspects ofOR relating to the airline industry including crew scheduling and revenue management,providing detailed examples of the application of LP and IP.

• The Theory and Practice of Revenue Management, by Talluri and van Ryzin. Comprehen-sive intro to RM.

• The Elements of Statistical Learning, by Hastie, Tibshirani, Friedman. Good overview ofmachine learning and statistics, covering topics related to sections 2.3 and 5.

Likewise, there are almost countless software packages implementing various LP, IP, QP andSDP algorithms, as well as (usually less efficient) algorithms for more general continuous convexproblems. There are many packaged algorithms specialized for particular problem structuressuch as matchings, the transportation problem and general network flows.

48

• CPLEX: The de-facto standard for commercial LP and IP solvers is CPLEX by ILOG,which is for many problems is substantially better than free alternatives (faster, numericallymore stable, capable of handling bigger problem sizes, capable of parallel processing acrossa cluster, et cetera). CPLEX comes with a wide variety of solvers tailored for various usecases as well as analysis and visualization tools.

• GLPK (Gnu Linear Programming Toolkit): A free LP and IP solver with a programmingAPI and a command-line program that reads and writes some standard formats, includinga variant of the AMPL problem-specification language. Easy to install and use as a library;a good choice for the problem set.

• Microsoft Excel: Excel and other commercial spreadsheet programs contain robust andflexible linear and non-linear solvers suitable for small to medium sized problems; Excel cansolve non-linear problems with arbitrary objective functions using gradient-based searchprocedures, though of course convergence is only to a local optimum and thus the systemcan be highly sensitive to starting values.

• MATLAB. Many people like to use MATLAB; MATLAB doesn’t come with LP or IPsolvers by default though the Optimization Toolkit extension offers LP and QP and IPsolvers, as well as more general convex optimizers.

49

mathematical programming: introduction for ita …...while mathematical programming is second only...

Documents