linear and logistic regressionmarias/teaching/ml/1regression.pdfpseudocode: given ,x,y i initializea...

Linear and Logistic Regression

Marta [email protected]

Dept. LSI, UPC

Fall 2012

Linear regressionSimple case: R2

Here is the idea:

1. Got a bunch of points in R2, {(x i , y i )}.

2. Want to fit a line y = ax + b that describes the trend.

3. We define a cost function that computes the total squarederror of our predictions w.r.t. observed values y i

J (a , b) =∑

(ax i + b − y i )2 that we want to minimize.

4. See it as a function of a and b: compute both derivatives,force them equal to zero, and solve for a and b.

5. The coefficients you get give you the minimum squarederror.

6. Can do this for specific points, or in general and find theformulas.

7. More general version in Rn .


Let h(x ) = ax + b, and J (a , b) =∑

(h(x i ) − y i )2

∂J (a , b)∂a

=∂∑

i (h(xi ) − y i )2

∂a

=∑

i

∂(ax i + b − y i )2

∂a

=∑

i

2(ax i + b − y i )∂(ax i + b − y i )

∂a

= 2∑

i

(ax i + b − y i )∂(ax i )

∂a

= 2∑

i

(ax i + b − y i )x i


Let h(x ) = ax + b, and J (a , b) =∑

(h(x i ) − y i )2

∂J (a , b)∂b

=∂∑

i (h(xi ) − y i )2

∂b

=∑

i

∂(ax i + b − y i )2

∂b

=∑

i

2(ax i + b − y i )∂(ax i + b − y i )

∂b

= 2∑

i

(ax i + b − y i )∂(b)∂b

= 2∑

i

(ax i + b − y i )


Normal equationsGiven {(x i , y i )}i , solve for a , b:∑

i

(ax i + b)x i =∑

i

x iy i

∑i

(ax i + b) =∑

i

y i

Linear regressionGeneral case: Rn

I Now, each xi = 〈x i0 , x

i1 , x

i2 , .., x

in〉, where x i

0 = 1 for all iI Parameters to estimate are a = 〈a0, .., an〉T 1

I For j = 0, ..,n , we have ∂J(a)∂aj

=∑

i (∑n

k=0 akx ik − y i )x i

j

Normal equationsGiven {(xi , y i )}i , solve for a0, a1, .., an :

∑i

(

n∑k=0

akx ik )x

ij =

∑i

x ij y

i (for each j = 0, ..,n)

1Notice a is defined as a column vector.


I Remember a = 〈a0, a1, a2, ..., an〉T

I Let y = 〈y1, y2, ..., ym〉T 2

I Let X =

x1

x2

...xm

=

x 10 x 1

1 ... x 1n

x 20 x 2

1 ... x 2n

......

...xm0 xm

1 .. xmn

where all x i0 = 1

Now, the normal equation∑

i (∑n

k=0 akx ik )x

ij =

∑i x

ij y

i can berewritten as:∑

i

x ij (

n∑k=0

akx ik ) =

∑i

x ij (x

ia) = XTj y

where Xj is the j -th column of X

2Notice y is defined as a column vector.


We have∑

i xij (x

ia) = XTj y for each j = 0, ..,n . Compactly:

XTXa = XTy

which can be solved as

a = (XTX)−1XTy

How to compute parameters in GNU Octave 3

Given X of size m × (n + 1) 4 and given label vector y, you cansolve the least squares regression problem with the singlecommand

pinv(X’ * X) * X’ * y 5

3http://www.gnu.org/software/octave/4Assuming the original data matrix has been prepended an all-1 column.5Equivalent to X \ y using the built-in operator ’\’.

http://www.gnu.org/software/octave/

Linear regressionPractical example with Octave

We have a dataset with data for 20 cities; for each city we haveinformation on:

I Nr. of inhabitantsI Percentage of families’ incomes below 5000 USDI Percentage of unemployedI Number of murders per 106 inhabitants per annum

We wish to perform regression analysis on the number ofmurders based on the other 3 features.


Octave code:

load data.txtn = size(data, 2)m = size(data, 1)X = [ ones(m, 1) data(:,1:n-1) ]y = data(:,n)a = pinv(X’*X) * X’ * y

Result:

a =-3.6765e+017.6294e-071.1922e+004.7198e+00

So, we see that the variable that has the most impact is thepercentage of unemployed.

Linear RegressionWhat if n is too large?

Computing a = (XTX)−1XTy may not be feasible if n is large,since it involves the inverse of a matrix of size n × n (or(n + 1)× (n + 1) if we added the extra “all 1” column)

Gradient descent: an iterative optimization solutionStart with any parameters a, and update a iteratively in orderto minimize J (a). Gradient descent tells us that J (a) shoulddecrease fastest if we follow the direction of the negativegradient of the cost function J (a):

a = a − α∇J (a)

where α is a positive, real-valued parameter dictating how largeeach step is, and ∇J (a) = 〈∂J(a)∂a0

, ∂J(a)∂a1, .., ∂J(a)∂an

〉T .

Gradient descent

Gradient descentAlgorithm, I

Pseudocode: given J , α

I Initialize a to a random non-zero vectorI Repeat until convergence

I for all j = 0, ..,n , do a ′j = aj − α

∂J(a)∂aj

I for all j = 0, ..,n , do aj = a ′j

I Output a

Should be careful with ..

I setting α small enough so that algorithm converges, butnot too small because it may need innecessarily too manyiterations

I perform feature scaling so that all features are “on the samerange” (this is necessary because they share the same α inthe updates)

Gradient descentAlgorithm, II

I m examples {(xi , y i )}i

I example x = 〈x0, x1, .., xn〉I ha(x) = a0x0 + a1x1 + ..+ anxn =

∑nj=0 aj xj = xa

I J (a) = 12m

∑mi=1 (ha(x i ) − y i )2

I ∂J(a)∂aj

= 1m∑m

i=1 x ij (ha(x i ) − y i ) = 1

m XTj (Xa − y)

I ∇J (a) = 1m XT (Xa − y)

Pseudocode: given α, X, y

I Initialize a = 〈1, .., 1〉T

I Normalize XI Repeat until convergence

I a = a − αm XT (Xa − y)

I Output a


Octave code:% X is original m x n matrixa = ones(n, 1) % initial value for parameter vectorX = studentize(X) % normalize XX = [ones(m, 1) X] % prepend all 1s columnfor t = 1:100 % repeat 100 times

D = X*a - ya = a - alpha / m * X’ * D% we store consecutive values of J over time tJ(t) = 1/2/m * D’ * D

Logistic regressionWhat if y i ∈ {0, 1} instead of continuous real value?

Binary classificationNow, datasets are of the form {(x1, 1), (x2, 0), ..}. In this case,linear regression will not do a good job in classifying examplesas positive (y i = 1), or negative (y i = 0).

Logistic regressionHypothesis space

I ha(x) = g(∑n

j=0 aj xj ) = g(xa)I g(z ) = 1

1+e−z is sigmoid function (a.k.a. logistic function)I 0 6 g(z ) 6 1, for all z ∈ RI lim

z→−∞ g(z ) = 0 and limz→+∞ g(z ) = 1

I g(z ) > 0.5 iff z > 0I Given example x

I predict positive iff ha(x) > 0.5 iff g(xa) > 0.5 iff xa > 0

Logistic regressionLeast square minimization for logistic regression

Let us assume thatI P(y = 1|x ;a) = ha(x), and soI P(y = 0|x ;a) = 1− ha(x)

Given m training examples {(xi , y i )}i where y i ∈ {0, 1} wecompute the likelihood (assuming independence of trainingexamples)

L(a) =∏i

p(y i |xi ;a)

=∏i

ha(xi )yi(1− ha(xi ))1−y i

Our strategy will be to maximize the log likelihood

Logistic regression

We will run gradient ascent to maximize the loglikelihood, using:

I for any function f (x ), ∂ log f (x)∂x = 1

f (x)∂f (x)∂x

I for the sigmoid function g(x ),

∂g(x )∂x

=∂

∂x1

1+ e−x

= −1

(1+ e−x )2∂e−x

∂x

=1

(1+ e−x )2e−x

=1

1+ e−x

(1−

11+ e−x

)= g(x )(1− g(x ))

Logistic regressionMaximizing the log likelihood

logL(a) = log∏i

p(y i |xi ;a) =∑

i

log p(y i |xi ;a)

=∑

i

log(ha(xi )y

i(1− ha(xi ))1−y i

)=

∑i

y i log ha(xi ) + (1− y i ) log(1− ha(xi ))

Logistic regressionComputing partial derivatives

∂ logL(a)∂aj

=∑

i∂y i log ha(xi)

∂aj+∂(1−y i) log(1−ha(xi))

∂aj

=∑

i yi ∂ log g(xia)

∂aj+ (1− y i )

∂ log(1−g(xia))∂aj

=∑

iy i

g(xia)∂g(xia)∂aj

−(1−y i)

1−g(xia)∂g(xia)∂aj

=∑

i

(y ia

g(xia) −(1−y i)

1−g(xia)

)∂g(xia)∂aj

=∑

i

(y i

g(xia) −(1−y i)

1−g(xia)

)g(xia)(1− g(xia))∂x

ia∂aj

=∑

i

(y i

g(xia) −(1−y i)

1−g(xia)

)g(xia)(1− g(xia))x i

j

= (y i − g(xia))x ij

= (y i − ha(xi ))x ij

Gradient ascent for logistic regressionAlgorithm, I

Pseudocode: given α, {(xi , y i)}mi=1


I Perform feature scaling on the examples’ attributesI Repeat until convergence

I for each j = 0, ..,n :I a ′

j = aj + α∑

i (yi − ha(xi ))x i

j

I for each j = 0, ..,n :I aj = aj

I Output a

Gradient ascent for logistic regressionAlgorithm, II

I m examples {(xi , y i )}iI g sigmoid function; g its generalization to vectors:

g(〈z1, .., zk 〉) = 〈g(z1), .., g(zk )〉I ha(x) = g(

∑nj=0 aj xj ) = g(xa)

I J (a) = 1m∑

i yi log ha(xi ) + (1− y i ) log(1− ha(xi ))

I ∂J(a)∂aj

= 1m∑m

i=1 x ij (y

i − ha(x i )) = 1m XT

j (y − g(Xa)))I ∇J (a) = 1

m XT (g(Xa) − y))

Pseudocode: given α, X, y


I Normalize XI Repeat until convergence

I a = a + αm XT (y − g(Xa))

I Output a

Logistic regressionPractical example with Octave

Octave code:% X is original m x n matrixa = ones(n, 1) % initial value for parameter vectorX = studentize(X) % normalize XX = [ones(m, 1) X] % prepend all 1s columnfor t = 1:100 % repeat 100 times

D = y - sigmoid(X*a)a = a + alpha / m * X’ * D% we store consecutive values of J over time tG = sigmoid(X*a)J(t) = 1/m * (log(G)’*y + log(1-G)’*(1-y))

linear and logistic regressionmarias/teaching/ml/1regression.pdfpseudocode: given ,x,y i initializea...

Documents