linear and logistic regressionmarias/teaching/ml/1regression.pdfpseudocode: given ,x,y i initializea...

25
Linear and Logistic Regression Marta Arias [email protected] Dept. LSI, UPC Fall 2012

Upload: others

Post on 06-Jan-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Linear and Logistic Regression

Marta [email protected]

Dept. LSI, UPC

Fall 2012

Page 2: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Linear regressionSimple case: R2

Here is the idea:

1. Got a bunch of points in R2, {(x i , y i )}.

2. Want to fit a line y = ax + b that describes the trend.

3. We define a cost function that computes the total squarederror of our predictions w.r.t. observed values y i

J (a , b) =∑

(ax i + b − y i )2 that we want to minimize.

4. See it as a function of a and b: compute both derivatives,force them equal to zero, and solve for a and b.

5. The coefficients you get give you the minimum squarederror.

6. Can do this for specific points, or in general and find theformulas.

7. More general version in Rn .

Page 3: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Linear regressionSimple case: R2

Let h(x ) = ax + b, and J (a , b) =∑

(h(x i ) − y i )2

∂J (a , b)∂a

=∂∑

i (h(xi ) − y i )2

∂a

=∑

i

∂(ax i + b − y i )2

∂a

=∑

i

2(ax i + b − y i )∂(ax i + b − y i )

∂a

= 2∑

i

(ax i + b − y i )∂(ax i )

∂a

= 2∑

i

(ax i + b − y i )x i

Page 4: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Linear regressionSimple case: R2

Let h(x ) = ax + b, and J (a , b) =∑

(h(x i ) − y i )2

∂J (a , b)∂b

=∂∑

i (h(xi ) − y i )2

∂b

=∑

i

∂(ax i + b − y i )2

∂b

=∑

i

2(ax i + b − y i )∂(ax i + b − y i )

∂b

= 2∑

i

(ax i + b − y i )∂(b)∂b

= 2∑

i

(ax i + b − y i )

Page 5: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Linear regressionSimple case: R2

Normal equationsGiven {(x i , y i )}i , solve for a , b:∑

i

(ax i + b)x i =∑

i

x iy i

∑i

(ax i + b) =∑

i

y i

Page 6: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Linear regressionGeneral case: Rn

I Now, each xi = 〈x i0 , x

i1 , x

i2 , .., x

in〉, where x i

0 = 1 for all iI Parameters to estimate are a = 〈a0, .., an〉T 1

I For j = 0, ..,n , we have ∂J(a)∂aj

=∑

i (∑n

k=0 akx ik − y i )x i

j

Normal equationsGiven {(xi , y i )}i , solve for a0, a1, .., an :

∑i

(

n∑k=0

akx ik )x

ij =

∑i

x ij y

i (for each j = 0, ..,n)

1Notice a is defined as a column vector.

Page 7: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Linear regressionGeneral case: Rn

I Remember a = 〈a0, a1, a2, ..., an〉T

I Let y = 〈y1, y2, ..., ym〉T 2

I Let X =

x1

x2

...xm

=

x 10 x 1

1 ... x 1n

x 20 x 2

1 ... x 2n

......

...xm0 xm

1 .. xmn

where all x i0 = 1

Now, the normal equation∑

i (∑n

k=0 akx ik )x

ij =

∑i x

ij y

i can berewritten as:∑

i

x ij (

n∑k=0

akx ik ) =

∑i

x ij (x

ia) = XTj y

where Xj is the j -th column of X

2Notice y is defined as a column vector.

Page 8: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Linear regressionGeneral case: Rn

We have∑

i xij (x

ia) = XTj y for each j = 0, ..,n . Compactly:

XTXa = XTy

which can be solved as

a = (XTX)−1XTy

How to compute parameters in GNU Octave 3

Given X of size m × (n + 1) 4 and given label vector y, you cansolve the least squares regression problem with the singlecommand

pinv(X’ * X) * X’ * y 5

3http://www.gnu.org/software/octave/4Assuming the original data matrix has been prepended an all-1 column.5Equivalent to X \ y using the built-in operator ’\’.

Page 9: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Linear regressionPractical example with Octave

We have a dataset with data for 20 cities; for each city we haveinformation on:

I Nr. of inhabitantsI Percentage of families’ incomes below 5000 USDI Percentage of unemployedI Number of murders per 106 inhabitants per annum

We wish to perform regression analysis on the number ofmurders based on the other 3 features.

Page 10: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Linear regressionPractical example with Octave

Octave code:

load data.txtn = size(data, 2)m = size(data, 1)X = [ ones(m, 1) data(:,1:n-1) ]y = data(:,n)a = pinv(X’*X) * X’ * y

Result:

a =-3.6765e+017.6294e-071.1922e+004.7198e+00

So, we see that the variable that has the most impact is thepercentage of unemployed.

Page 11: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Linear RegressionWhat if n is too large?

Computing a = (XTX)−1XTy may not be feasible if n is large,since it involves the inverse of a matrix of size n × n (or(n + 1)× (n + 1) if we added the extra “all 1” column)

Gradient descent: an iterative optimization solutionStart with any parameters a, and update a iteratively in orderto minimize J (a). Gradient descent tells us that J (a) shoulddecrease fastest if we follow the direction of the negativegradient of the cost function J (a):

a = a − α∇J (a)

where α is a positive, real-valued parameter dictating how largeeach step is, and ∇J (a) = 〈∂J(a)∂a0

, ∂J(a)∂a1, .., ∂J(a)∂an

〉T .

Page 12: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Gradient descent

Page 13: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Gradient descentAlgorithm, I

Pseudocode: given J , α

I Initialize a to a random non-zero vectorI Repeat until convergence

I for all j = 0, ..,n , do a ′j = aj − α

∂J(a)∂aj

I for all j = 0, ..,n , do aj = a ′j

I Output a

Should be careful with ..

I setting α small enough so that algorithm converges, butnot too small because it may need innecessarily too manyiterations

I perform feature scaling so that all features are “on the samerange” (this is necessary because they share the same α inthe updates)

Page 14: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Gradient descentAlgorithm, II

I m examples {(xi , y i )}i

I example x = 〈x0, x1, .., xn〉I ha(x) = a0x0 + a1x1 + ..+ anxn =

∑nj=0 aj xj = xa

I J (a) = 12m

∑mi=1 (ha(x i ) − y i )2

I ∂J(a)∂aj

= 1m∑m

i=1 x ij (ha(x i ) − y i ) = 1

m XTj (Xa − y)

I ∇J (a) = 1m XT (Xa − y)

Pseudocode: given α, X, y

I Initialize a = 〈1, .., 1〉T

I Normalize XI Repeat until convergence

I a = a − αm XT (Xa − y)

I Output a

Page 15: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Gradient descentAlgorithm, II

I m examples {(xi , y i )}i

I example x = 〈x0, x1, .., xn〉I ha(x) = a0x0 + a1x1 + ..+ anxn =

∑nj=0 aj xj = xa

I J (a) = 12m

∑mi=1 (ha(x i ) − y i )2

I ∂J(a)∂aj

= 1m∑m

i=1 x ij (ha(x i ) − y i ) = 1

m XTj (Xa − y)

I ∇J (a) = 1m XT (Xa − y)

Pseudocode: given α, X, y

I Initialize a = 〈1, .., 1〉T

I Normalize XI Repeat until convergence

I a = a − αm XT (Xa − y)

I Output a

Page 16: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Linear regressionPractical example with Octave

Octave code:% X is original m x n matrixa = ones(n, 1) % initial value for parameter vectorX = studentize(X) % normalize XX = [ones(m, 1) X] % prepend all 1s columnfor t = 1:100 % repeat 100 times

D = X*a - ya = a - alpha / m * X’ * D% we store consecutive values of J over time tJ(t) = 1/2/m * D’ * D

Page 17: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Logistic regressionWhat if y i ∈ {0, 1} instead of continuous real value?

Binary classificationNow, datasets are of the form {(x1, 1), (x2, 0), ..}. In this case,linear regression will not do a good job in classifying examplesas positive (y i = 1), or negative (y i = 0).

Page 18: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Logistic regressionHypothesis space

I ha(x) = g(∑n

j=0 aj xj ) = g(xa)I g(z ) = 1

1+e−z is sigmoid function (a.k.a. logistic function)I 0 6 g(z ) 6 1, for all z ∈ RI lim

z→−∞ g(z ) = 0 and limz→+∞ g(z ) = 1

I g(z ) > 0.5 iff z > 0I Given example x

I predict positive iff ha(x) > 0.5 iff g(xa) > 0.5 iff xa > 0

Page 19: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Logistic regressionLeast square minimization for logistic regression

Let us assume thatI P(y = 1|x ;a) = ha(x), and soI P(y = 0|x ;a) = 1− ha(x)

Given m training examples {(xi , y i )}i where y i ∈ {0, 1} wecompute the likelihood (assuming independence of trainingexamples)

L(a) =∏i

p(y i |xi ;a)

=∏i

ha(xi )yi(1− ha(xi ))1−y i

Our strategy will be to maximize the log likelihood

Page 20: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Logistic regression

We will run gradient ascent to maximize the loglikelihood, using:

I for any function f (x ), ∂ log f (x)∂x = 1

f (x)∂f (x)∂x

I for the sigmoid function g(x ),

∂g(x )∂x

=∂

∂x1

1+ e−x

= −1

(1+ e−x )2∂e−x

∂x

=1

(1+ e−x )2e−x

=1

1+ e−x

(1−

11+ e−x

)= g(x )(1− g(x ))

Page 21: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Logistic regressionMaximizing the log likelihood

logL(a) = log∏i

p(y i |xi ;a) =∑

i

log p(y i |xi ;a)

=∑

i

log(ha(xi )y

i(1− ha(xi ))1−y i

)=

∑i

y i log ha(xi ) + (1− y i ) log(1− ha(xi ))

Page 22: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Logistic regressionComputing partial derivatives

∂ logL(a)∂aj

=∑

i∂y i log ha(xi)

∂aj+∂(1−y i) log(1−ha(xi))

∂aj

=∑

i yi ∂ log g(xia)

∂aj+ (1− y i )

∂ log(1−g(xia))∂aj

=∑

iy i

g(xia)∂g(xia)∂aj

−(1−y i)

1−g(xia)∂g(xia)∂aj

=∑

i

(y ia

g(xia) −(1−y i)

1−g(xia)

)∂g(xia)∂aj

=∑

i

(y i

g(xia) −(1−y i)

1−g(xia)

)g(xia)(1− g(xia))∂x

ia∂aj

=∑

i

(y i

g(xia) −(1−y i)

1−g(xia)

)g(xia)(1− g(xia))x i

j

= (y i − g(xia))x ij

= (y i − ha(xi ))x ij

Page 23: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Gradient ascent for logistic regressionAlgorithm, I

Pseudocode: given α, {(xi , y i)}mi=1

I Initialize a = 〈1, .., 1〉T

I Perform feature scaling on the examples’ attributesI Repeat until convergence

I for each j = 0, ..,n :I a ′

j = aj + α∑

i (yi − ha(xi ))x i

j

I for each j = 0, ..,n :I aj = aj

I Output a

Page 24: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Gradient ascent for logistic regressionAlgorithm, II

I m examples {(xi , y i )}iI g sigmoid function; g its generalization to vectors:

g(〈z1, .., zk 〉) = 〈g(z1), .., g(zk )〉I ha(x) = g(

∑nj=0 aj xj ) = g(xa)

I J (a) = 1m∑

i yi log ha(xi ) + (1− y i ) log(1− ha(xi ))

I ∂J(a)∂aj

= 1m∑m

i=1 x ij (y

i − ha(x i )) = 1m XT

j (y − g(Xa)))I ∇J (a) = 1

m XT (g(Xa) − y))

Pseudocode: given α, X, y

I Initialize a = 〈1, .., 1〉T

I Normalize XI Repeat until convergence

I a = a + αm XT (y − g(Xa))

I Output a

Page 25: Linear and Logistic Regressionmarias/teaching/ml/1regression.pdfPseudocode: given ,X,y I Initializea = h1,..,1iT I NormalizeX I Repeatuntilconvergence I a = a+ m X T(y -g(Xa)) I Outputa

Logistic regressionPractical example with Octave

Octave code:% X is original m x n matrixa = ones(n, 1) % initial value for parameter vectorX = studentize(X) % normalize XX = [ones(m, 1) X] % prepend all 1s columnfor t = 1:100 % repeat 100 times

D = y - sigmoid(X*a)a = a + alpha / m * X’ * D% we store consecutive values of J over time tG = sigmoid(X*a)J(t) = 1/m * (log(G)’*y + log(1-G)’*(1-y))