csci 5654 (linear programming, fall 2013)srirams/courses/csci5654...csci5654 (linear programming,...

CSCI5654 (Linear Programming, Fall 2013)Lectures 10-12

Lectures 10,11 Slide# 1

Today’s Lecture

1. Introduction to norms: L1, L2, L∞.

2. Casting absolute value and max operators.

3. Norm minimization problems.

4. Applications: fitting, classification, denoising etc..


Norm

Basic idea: Notion of length/distance between vector.

Definition (Norm): Let ℓ : X 7→ R≥0 be a mapping from a vector spaceto non-negative reals. The function ℓ is a norm iff

1. ℓ(~x) = 0 iff 0,

2. ℓ(a~x) = |a|ℓ(~x)3. ℓ(~x + ~y) ≤ ℓ(~x) + ℓ(~y) (Triangle Inequality).

“Length” of ~x : ℓ(~x).

“Distance” between ~x , ~y is ℓ(~y − ~x)(= ℓ(~x − ~y)).


L2 (Euclidean) norm

Distance between (x1, y1) to (x2, y2) is√

(x1 − x2)2 + (y1 − y2)2.

In general, ||~x ||2 =√~x

t

.~x .Distance between ~x , ~y is given by ||~x − ~y ||2.


Lp norm

For p ≥ 1, we define Lp norm as

||~x ||p = (|x1|p + |x2|p + . . .+ |xn|p)1p .

The Lp norm distance between two vectors is: ||~x − ~y ||pL1 norm: ||~x ||1 = |x1|+ |x2|+ . . .+ |xn|.Class exercise: verify that L1 norm distance is indeed a norm.Q: Why is p ≥ 1 necessary?


L∞ norm

Obtained as the limit limp→∞ ||~x ||p.Definition: For vector ~x , ||~x ||∞ is defined as

||~x ||∞ = max(|x1|, |x2|, . . . , |xn|) .

Ex-1: Can we verify that L∞ is really a norm?Ex-2 (challenge): Prove that the limit definition coincides with theexplicit form of the L∞ distance.


Norm: Exercise

Take vectors ~x = (0, 1,−1), ~y = (2, 1, 1).Write down L1, L2, L∞ norm distances between ~x , ~y .


Norm Spheres

ǫ-Sphere: {~x | d(~x , 0) ≤ ǫ} for norm d . The 1−spheres corresponding tothe norms L1, L∞, L2.

1

L2 L1 L∞

Note: Norm spheres are convex sets.


Norm Minimization problems

General form:minimize ||x ||p

Ax ≤ b

Note-1: We always minimize the norm functions (in this course).Note-2: Maximization of norm is generally a hard (non-convex) problem.


Unconstrained Norm Minimization

Lets first study the following problem:

min. ||Ax − b||p .

1. For p = 2, this problem is called (unconstrained) least squares.We will solve this using calculus.

2. For p = 1,∞, this problem is called L1(L∞) least squares.We will reduce this problem to LP.

3. Applications: solving (noisy) linear systems, regularization, denoising,max. likelihood estimation, regression and so on.


Unconstrained Least Squares

Decision Variables: ~x = (x1, . . . , xn) ∈ Rn.Objective function: min.||A~x − ~b||2.Constraints: No explicit constraint.


Least Squares: Example

Example: min.||(2x + 3y − 1, y)||2.Solution: We will equivalently minimize (2x + 3y − 1)2 + y2. Usingfundamental theorem of calculus, we obtain the conditions:

∂(2x+3y−1)2+y2

∂x= 0

∂(2x+3y−1)2+y2

∂y= 0

In other words, we obtain:

4(2x + 3y − 1) = 06(2x + 3y − 1) + 2y = 0

The optima lies at is y = 0, x = 12 .

Verify minima by computing checking second order derivative (Hessianmatrix).


Unconstrained Least Squares

Problem: minimize ||A~x − ~b||2.Solution: We will use calculus minimum finding using partial derivatives.Recall:

∇f (x1, . . . , xn) = (∂x1f , ∂x2f , . . . , ∂xn f ) .

Criterion for optimal point for f (~x) is that ∇f = (0, . . . , 0).

∇(||A~x − ~b||22) = ∇((A~x − ~b)t

(A~x − ~b))

= ∇(~xt

At

A~x − 2~xt

At~b + ~b

t~b)

= 2At

A~x − 2At~b

= 0

Minima will occur at ~x∗ = (At

A)−1At~b (assume A is “full rank”).

Similar solution for Lp norm for even number p.


L1 norm minimization

Decision Variables: ~x = (x1, . . . , xn) ∈ Rn.Objective function: min.||A~x − ~b||1.Constraints: No explicit constraint.We will reduce this to a linear program.


Example

Example: min.||(2x + 3y − 1, y)||1 = |2x + 3y − 1|+ |y |.Trick: Let t1 ≥ 0, t2 ≥ 0 be s.t.

|2x + 3y − 1| ≤ t1|y | ≤ t2

Consider the following problem:

min. t1 + t2|2x + 3y − 1| ≤ t1

|y | ≤ t2t1, t2 ≥ 0

Goal: convince class that this problem is equivalent to original.Note: |y | ≤ t2 can be rewritten as y ≤ t2 ∧ −y ≤ t2.


Example: LP formulation

min. t1 + t22x + 3y − 1 ≤ t1−2x − 3y + 1 ≤ t1

y ≤ t2−y ≤ t2t1, t2 ≥ 0

Solution: x = 12 , y = 0.

Q: Can we repeat the same trick if |y | ≥ t2 ?N: No. A disjunction of inequalities will be needed.


L1 norm minimizationProblem: min.||Ax − b||1.Solution: Let A be m × n matrix. Add variables t1, . . . , tm correspodingto rows of m.

min.∑m

i=1 ti

|Ai~x − ~b| ≤ tit1, . . . , tm ≥ 0

⇒

min.∑m

i=1 ti

A~x − ~b ≤ ~t

−A~x + ~b ≤ ~t

t1, . . . , tm ≥ 0

We write it in the general form:

max. −~1t

~t

A~x −~t ≤ ~b

−A~x −~t ≤ −~bt1, . . . , tm ≥ 0


L∞ norm minimization

Decision Variables: ~x = (x1, . . . , xn) ∈ Rn.Objective function: min.||A~x − ~b||∞.Constraints: No explicit constraint.We will reduce this to a linear program.


Example

Example: min.||(2x + 3y − 1, y)||∞ = min max.(|2x + 3y − 1|, |y |).Trick: Let t ≥ 0 be s.t.

|2x + 3y − 1| ≤ t

|y | ≤ t

Consider the following problem:

min. t

|2x + 3y − 1| ≤ t

|y | ≤ t

t ≥ 0

Goal: convince class that this problem is equivalent to original.Note: |y | ≤ t can be rewritten as −t ≤ y ≤ t.


Example:LP formulation

min. t

2x + 3y − 1 ≤ t

−2x − 3y + 1 ≤ t

y ≤ t

−y ≤ t

t ≥ 0

Solution: x = 12 , y = 0.


L∞ norm minimizationProblem: min.||Ax − b||∞.Solution: Let A be m × n matrix. Add variables t1, . . . , tm correspodingto rows of m.

min. t

|Ai~x − ~b| ≤ t

t ≥ 0

⇒

min. t

A~x − ~b ≤ t.~1

−A~x + ~b ≤ t.~1t ≥ 0

We write it in the general form:

max. t

A~x − t.~1 ≤ ~b

−A~x − t.~1 ≤ −~bt ≥ 0


Application-1: Estimation of Solutions


Application-1: Solution Estimation

Problem: We are trying to solve the linear equation

A~x = ~b

1. The entries in A,~b could have random errors (noisy measurements).

2. There are many more equations than variables.

A Noise ~b~x A~x


Application: Estimation

Example Problem: We are trying to measure a quantity x using noisymeasurements:

2x = 5.13x = 7.64x = 9.915x = 12.456x = 15.1

Q-1: Is there a value of x that explains the measurement?Q-2: What do we do in this case?


Application: Estimation

Example Problem: We are trying to measure a quantity x using noisymeasurements:

2x = 5.13x = 7.64x = 9.915x = 12.456x = 15.1

Find x that nearly fits all the measurements., i,e,

min.||(2x − 5.1, 3x − 7.6, 4x − 9.91, 5x − 12.45, 6x − 15.1)||p .

We can choose p = 1, 2,∞ and see what answer we get.


Application:Estimation

L2 norm: Apply least squares for

A =

23456

~b =

−5.1−7.6−9.91−12.45−15.1

Solving for x = argmin||Ax − ~b||2, yields x ∼ 2.505.Residual error:

(0.09, 0.08,−.11,−0.08,−.06) .


Application:Estimation

L1 norm: Reduce to linear program using

A =

23456

, ~b =

−5.1−7.6−9.91−12.45−15.1

, .

Solution: x = 2.49.Residual Error:

(−.12,−.13,−0.05, 0.0, .16) .

L∞ norm: Reduce to linear program.Solution: x = 2.501Residual Error:

(−.1,−.1, .1, .05,−.1)


Experiment with Larger Matrices (matlab)

◮ L1 norm: large number of entries with small residuals, spread is larger.

◮ L2 norm: residuals look like a gaussian distribution.

◮ L∞ norm: residuals look more uniform, smaller spread.

Matlab code is available, run your own experiments!


Regularized Approximation

Ordinary least squares can have poor performance (ill conditioningproblem).Tikhonov Regularized Approximation:

||A~x − b||2 + λ ||~x ||2 .

λ ≥ 0 is a tuning parameter.Note-1: This is the same as the following norm minimization:

∣∣∣∣∣

∣∣∣∣∣

[

A

λ12 I

]

~x −[~b

0

]∣∣∣∣∣

∣∣∣∣∣

2

2

.

Note-2: In general, we can also have regularized approximation usingL1, L∞ and other norms.


Penalty Function Approximation

Problem: Solveminimize.φ(A~x − ~b) .

where φ is a penalty function.

If φ = L1, L2, L∞, this is exactly the same as norm minimization.

Note-1: In general, φ need not be a norm.Note-2: φ is sometimes called a loss function.


Penalty Functions: ExampleDeadzone Linear Penalty: Let ~x = (x1, . . . , xn)

t

. The deadzone linearpenalty is sum of penalties on each individual component x1, . . . , xn:

φdz(~x) : φdz(x1) + φdz(x2) + · · ·+ φdz(xn) ,

wherein

φdz(x) =

{0 |u| ≤ a

b(|u| − a) |u| > a

-a a

L1

φdzL2


Deadzone Linear Penalty (cont)

Deadzone vs. L2:

◮ L2 norm imposes large penalty when residual is high.Therefore, outliers in data can affect performance drastically.

◮ Deadzone penalty function is generally less sensitive to outliers.

Q: How do we solve the deadzone penalty approximation problem?A: Apply tricks for L1, L∞ (upcoming assignment).


Other penalty functionsHuber Penalty: The Huber penalty is sum of functions on individualcomponents x1, . . . , xn:

φH(x1) + φH(x2) + · · ·+ φH(xn) ,

wherein

φH(x) =

{u2 |u| ≤ a

a(2|u| − a) |u| > a

-a a

φH L2


Penalty Function Approximation: Summary

General observations:

◮ L2: Natural, easy to solve (using pseudo inverse). However, sensitiveto outliers.

◮ L1: Sparse: ensures that most residuals are very small.Insensitive to outliers.However, spread of error is larger.

◮ L∞ : Non-sparse but minimizes spread. Very sensitive to outliers.

◮ Other loss functions: provide various tradeoffs.

Details: Cf. Boyd’s book.


Application-2: Signal Denoising


Application-2: Signal DenoisingInput: Given samples of a noisy signal:

~y : (~y1, ~y2, . . . , ~yN)t

.

Original signal is unknown (very little data available).Problem: Find new signal ~y∗ by minimizing

mininize. ||~y∗ − ~y ||pp + φs(~y∗) .

φs is a smoothing function.

0 5 10−3

−2

−1

0

1

2

3

4Original Signal

0 5 10−3

−2

−1

0

1

2

3

4Signal With Noise


Smoothing Criteria

◮ Quadratic Smoothing: L2 norm on successive differences.

φq(~y) =N∑

i=2

(yi − yi−1)2 .

◮ Total Variation Smoothing: L1 norm on successive differences.

φtv (~y) =N∑

i=2

|yi − yi−1| .

◮ Maximum Variation Smoothing: L∞ norm on successive differences.

φmax(~y) =N

maxi=2

(|yi+1 − yi |) .


Quadratic Smoothing

Problem: minimize ||~y − ~yn||22 + φq(~y).This can be viewed as a least squares problem:

minimize

∣∣∣∣

∣∣∣∣

[I

∆

]

~y∗ −[~yn0

]∣∣∣∣

∣∣∣∣

2

2

.

Where ∆ is the matrix:

∆ =

−1 1 0 · · · 0 00 −1 1 · · · 0 0

0 0 0. . . 0 0

0 0 0 · · · −1 1


Total/Maximum Variance Smoothing

Total variance smoothing: L1 least squares problem.

minimize

∣∣∣∣

∣∣∣∣

[I

∆

]

~y∗ −[~yn0

]∣∣∣∣

∣∣∣∣1

.

Max. variance smoothing: L1 least squares problem.

minimize

∣∣∣∣

∣∣∣∣

[I

∆

]

~y∗ −[~yn0

]∣∣∣∣

∣∣∣∣∞

.

Where ∆ is the matrix:

∆ =

−1 1 0 · · · 0 00 −1 1 · · · 0 0

0 0 0. . . 0 0

0 0 0 · · · −1 1


Experiment

0 5 10−5

0

5Original Signal

0 5 10−5

0

5Signal With Noise

0 5 10−5

0

5L1 reconstructed

−6 −4 −2 0 20

200

400L1 residuals

0 5 10−5

0

5L2 reconstructed

−4 −2 0 20

200

400L2 residuals

0 5 10−5

0

5L∞ reconstructed

−2 −1 0 1 20

100

200L∞ residuals


Other Smoothing Functions

Consider three successive data points instead of two:

φT :m−1∑

i=2

|2yi − (yi−1 + yi+1)| .

Modify matrix ∆ as follows:

∆ =

−1 2 −1 0 · · · 0 0 00 −1 2 −1 · · · 0 0 0

0 0 0 0. . . 0 0 0

0 0 0 0 · · · −1 2 −1

∆ is called a Topelitz Matrix.


Application-3: Regression


Linear RegressionInputs: Data points ~xi : (xi1, . . . , xin) and outcome yi .

Goal: Find a function

f (~x) = c1x1 + c2x2 + . . .+ cnxn + c0

that best fits the data.

0 1 2 3 4 5 6 7 8 9 10−2

0

2

4

6

8

10

12input vs. output plot for given data.


Linear Regression

Let X be the data matrix and ~y be corresponding outcome matrix.

Note: The i th row Xi represents data point iyi is the corresponding outcome for Xi

If ~ct

~x + c0 is the best line fit, the error for i th data point is:

ǫi : (Xi · ~c) + c0 − yi .

We can write the overall error as:∣∣∣

∣∣∣[X ~1]~c − ~y

∣∣∣

∣∣∣p.


Regression

Inputs: Data X , outcome ~y .Decision Vars: c0, c1, . . . , cn the coefficients of the straight line.

Solution:minimize∣∣∣

∣∣∣[X ~1]~c − ~y

∣∣∣

∣∣∣p.

Note: Instead of the norm, we could use any penalty function.

Ordinary Regression: min∣∣∣

∣∣∣[X ~1]~c − ~y

∣∣∣

∣∣∣

p

p.

Tikhanov Regression: min∣∣∣

∣∣∣[X ~1]~c − ~y

∣∣∣

∣∣∣

2

2+ λ · ||~c ||22.


Experiment

Experiment: Create noisy data set with outliers.

80 regular data points with low noise.5 outlier datapoints with large noise.

Goal: Compare effect of L1, L2, L∞ norm regressions.


Regression with Outliers

0 1 2 3 4 5 6 7 8 9 10−15

−10

−5

0

5

10

15Regression with outliers

L

2

L1

L∞


Regression with Outliers (Error Histogram)

−5 0 5 10 15 200

50

100L1 norm residual

−5 0 5 10 15 200

20

40

60L2 norm residual

−10 −8 −6 −4 −2 0 2 4 6 8 100

20

40L∞ norm residual


Linear Regression With Non-Linear Models

Input: Data X , outcome ~y , functions g1, . . . , gk .Goal: Fit a model of the form

yi = c0 + c1g1(~x) + c2g2(~x) + . . .+ ckgk(~x) .

Solution: Transform the data matrix as follows:

X ′ =

g1(X1) . . . gk(X1)g1(X2) . . . gk(X2)

. . .

g1(Xn) . . . gk(Xn)

.

Apply linear regression on (X ′, ~y).


ExperimentExperiment: Create noisy non-linear data set with outliers.

~y = 0.2sin(x) + 1.5 ∗ cos(x) + 2 ∗ cos(1.5 ∗ x) + .2 ∗ x − 3 + Noise .

80 regular data points with low noise.5 outlier datapoints with large noise.

0 1 2 3 4 5 6 7 8 9 10−6

−4

−2

0

2

4

6

8

10

12

14Regression with non−linear model

Dataoutliers

Goal: Compare effect of L1, L2, L∞ norm regressions.Lectures 10,11 Slide# 51

Experiment: linear regression non-linear model

0 1 2 3 4 5 6 7 8 9 10−6

−4

−2

0

2

4

6

8

10

12

14Regression with outliers

L

1L

2

L∞

Dataoutliers


Experiment: residuals

−20 −15 −10 −5 0 5 100

100

200L1 norm residual

−20 −15 −10 −5 0 5 100

100

200L2 norm residual

−8 −6 −4 −2 0 2 4 6 80

20

40

60L∞ norm residual


Classification

Problem: Given two classes: ~x+1 , . . . , ~x+N and ~x−1 , . . . , ~x−M of data points.Goal: Find separating hyperplane between the classes.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−2

−1.5

−1

−0.5

0

0.5

1

1.5

→ x+

→ x−

classifier


Application-4: Classification


Linear Classification

Problem: Given two classes: ~x+1 , . . . , ~x+N and ~x−1 , . . . , ~x−M of data points.Goal: Find separating hyperplane between the classes.

Q-1: Will such an hyperplane always exist?

A: Not always! Only if data is linearly separable.

1. Assume that such a hyperplane exists.

2. Deal with cases where a hyperplane does not exist.


Linear Classification (cont)

Solution: Use linear programming.Decision Variables: w0,w1,w2, . . . ,wn, representing the classfier:

fw (~x) : w0 + w1x1 + · · ·+ wnxn .

Linear Programming:max. ???

X+~w ≥ 0X−~w ≤ 0

X+ :

~x+1 1~x+2 1~x+3 1... 1

~x+N 1

X− :

~x−1 1~x−2 1... 1

~x−M 1


Classification Using Linear Programs

Objective: Lets try∑

i wi = 1t

~w .

Note: We write⟨

~a,~b⟩

for dot product ~at~b.

Simplified Data representation: Represent two classes +1,−1 for data.

Data:(~x1, y1), . . . , (~xN , yN) ⇒ (X , ~y) ,

where yi = +1 for class I and yi = −1 for class II data.

Linear Program:

min.⟨

~1, ~w⟩

+ w0

yi (〈~xi , ~w〉+ w0) ≤ 0

Verify that this formulation works.


Experiment: ClassificationExperiment: Find a classifier for the following points.

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5Data for classification


Experiment: Classification

Experiment: Outcome.

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5Classifiers

Note: Many different classifiers are possible.Note-2: What is the deal with points that lie on the classifier line?


Dual, Complementary Slackness

Linear Program:

min.⟨

~1, ~w⟩

+ w0

yi (〈~xi , ~w〉+ w0) ≤ 0

Complementary Slackness: Let α1, . . . , αN be the dual variables.

αiyi (〈~xi , ~w〉+ w0) = 0, αi ≥ 0 .

Claim: There are at least n + 1 points ~xj1 , . . . , ~xjn+1 s.t.

〈~xji , ~w〉+ w0 = 0 .

Note: n + 1 pts. fully determine the hyper-plane ~w ;w0.


Classifying Points

Problem: Given a new point ~x , let us find which class it belongs to.Solution-1: Just find out the sign of 〈~w , ~x〉+ w0, we know ~w ;w0 fromour LP solution.

Solution-2: More complicated. Express ~x as a linear combination ofsupport vectors:

~x = λ1~xj1 + λ2~xj2 + · · ·+ λn+1~xn+1

We can write 〈~w , ~x〉 as

〈~w , ~x〉+w0 =∑

i

λi

w0 + 〈~w , ~xji 〉︸︷︷︸

0

+

(

1−∑

i

λi

)

w0 =

(

1−∑

i

λi

)

w0 .

This is impractical for linear programming formulations.


Non-Separable Data

Q: What if data is not linear separable?The linear program will be primal infeasible

In practice, linear separation may hold without noise but not with noise.

Solution: Relax the conditions on half-space w .

ξ1

ξ2


Non-Separable Data: Linear Programming

Classifier: (old) w0 + 〈~w , ~x〉.(new) 〈~w , ~x〉+ 1.

Note: Classifier∑

i wixi + w0 is equivalent to∑

iwi

w0xi + 1 if w0 6= 0.

LP Formulation:

minimize∣∣∣

∣∣∣~ξ∣∣∣

∣∣∣1+ · · ·

⟨~x+i , ~w

⟩+ 1 ≤ ξ+i⟨

~x−j , ~w⟩

+ 1 ≥ −ξ−j

ξ+i , ξ−j ≥ 0

The variable ξ encodes tolerance. We minimize ξ as part of objective.

Exercise: Verify that the support vector interpretation of dual works.


Non Linear Model Classification

Linear Classifier: 〈~w , ~x〉+ 1.

Non-linear functions: ~φ : Rn 7→ R.

〈~w , (φ1(~x), φ2(~x), . . . , φm(~x))〉 .

This is exactly same idea as linear regression with non-linear functions.


csci 5654 (linear programming, fall 2013)srirams/courses/csci5654...csci5654 (linear programming,...

Documents