today’s lecturetoday’s lecture data collection data processing exploratory analysis & data viz...

TODAY’S LECTURE

Data collection

Data processing

Exploratory analysis

& Data viz

Analysis, hypothesis testing, &

ML

Insight & Policy

Decision

1

FEATURE SCALING

2

Price (in 1000$)

220180350…….500

Area(sq. ft.)

160014002100 … ….

2400

# Bathrooms # Bedrooms

2.51.53.5……4

334……5

111

… ….1

base

NORMALIZATION- ZERO MEAN UNIT STANDARD DEVIATION

3

xij =xij − μj

σjj: Area, Bathrooms, Bedrooms

Price (in 1000$)

220180350…….500

Area(sq. ft.)

160014002100 … ….

2400


2.51.53.5……4

334……5

111

… ….1

base

MAX-MIN

4

xij =xij − xminj

xmaxj − xminjj: Area, Bathrooms, Bedrooms

Price (in 1000$)

220180350…….500

Area(sq. ft.)

160014002100 … ….

2400


2.51.53.5……4

334……5

111

… ….1

base

All Observation Model

■ Matrix Notation  For all observations 

y1y1

y2.

.

.

ym

=

1

1

1.

.

.1

x11x21

x31.

.

.xm1

x12x22

x32.

.

.xm2

..

..

...

.

.

..

..

..

...

.

.

..

x1n

x2n

x3n.

.

.

xmn

.

Y = X

θn

θ0θ1θ2

θ

GRADIENT DESCENTAlgorithm for any* hypothesis function , loss function , step size : Initialize the parameter vector: •

Repeat until satisfied (e.g., exact or approximate convergence): • Compute gradient: • Update parameters:

6*must be reasonably well behaved

GRADIENT DESCENT - MULTIVARIATE

7

θ0 := θ0 − α1m

m

∑i=1

(h(θ(xi) − yi)x0i…

Repeat{

θj := θj − α1m

m

∑i=1

(h(θ(xi) − yi)xji

}(update θj for all j = 1…n simultaneously

θ = 01m

m

∑i=1

(h(θ(xi) − yi)xji =∂

∂θjf (θ)

θ1 := θ1 − α1m

m

∑i=1

(h(θ(xi) − yi)x1i

θ2 := θ2 − α1m

m

∑i=1

(h(θ(xi) − yi)x2i

θn := θn − α1m

m

∑i=1

(h(θ(xi) − yi)xni

GRADIENT DESCENT - MULTIVARIATE

8

Repeat{

θj := θj − α1m

m

∑i=1



θ = 01m

m

∑i=1


∂θjf (θ)

STOCHASTIC GRADIENT DESCENT - MULTIVARIATE

9

Repeat{

θj := θj − α1m

m

∑i=1



θ = 01m

m

∑i=1


∂θjf (θ)

Repeat{

θj := θj − α(h(θ(xi) − yi)xji

}(update θj for all j = 1…n

θ = 0

i = random index between 1 and m

STOCHASTIC GRADIENT DESCENT

GRADIENT DESCENT

10

θ0 = 0.1 θ1 = 0.1

̂y = θ0 + θ1x

x y ̂y = θ0 + θ1x12

( ̂y − y)2SSE ∂(SSE )

∂θ0̂y − y

∂(SSE )∂θ1

( ̂y − y)x

0.2 0.440.310.45

0.26

0.123

0.75

0.39

0.120.131

0.145

0.175

0.05120.000032

0.183

0.0231

-0.32

0.008

-0.605

-0.215

-0.064

0.00248

-0.27225

-0.16125

-1.132 -0.495

GRADIENT DESCENT

11

θ0 = 0.1 θ1 = 0.1

̂y = θ0 + θ1x

x y ̂y = θ0 + θ1x12

( ̂y − y)2SSE ∂(SSE )

∂θ0̂y − y

∂(SSE )∂θ1

( ̂y − y)x

0.31 0.123 0.131 0.000032 0.008 0.00248

...…..

STOCHASTIC GRADIENT DESCENT

12

STOCHASTIC GRADIENT DESCENT - MINI BATCH

13

Repeat {

}

(update θj for all j = 1…n

θ = 0

= random index between 1 and mi1, …, ilθj := θj − α

1l

l

∑i=1

(h(θ(xi) − yi)xij

CLASSIFICATION PROBLEM

14

GENERAL CLASSIFICATION PROBLEM

15

• Can we predict categorical response/output Y, from a set of predictors

• For example, an individual’s choice of transportation: • Predictors: income, cost, and time • Response: car, bike, bus or train.

• From this classification model, an inference task: • how do people value price and time when considering a

transportation choice?

θ1, θ2, …, θn

WHY NOT LINEAR REGRESSION• For categorical responses, with more than two values, if

order and scale, don’t make sense, it is not a regression problem.  

Y = { • For binary responses, it is a little better   

Y = {

16

1 if stroke2 if drugoverdose3 if epilecticseizure

0 if stroke

1 if drugoverdose

BINARY RESPONSES• We could use linear regression and interpret response, Y,

as a probability

• for example, if

17

̂y > 0.5 (predict drugoverdose)

CLASSIFICATION AS PROBABILITY ESTIMATION• Instead of modeling classes 0 or 1, model conditional class

probability p(Y=1 | X = x)

• classify based on this probability.

• Use of discriminant functions ( think of scoring).

• One way to do this is, logistic regression.

18

LOGISTIC REGRESSION• Basic idea: Build a linear model related to p(x)

• Linear regression directly doesn’t work. Why?

• Instead build a linear model of log-odds:

19

(i.e. p(x) = θ0 + θ1x)

logp(x)

1 − p(x)= θ0 + θ1x

LOGISTIC REGRESSION - ODDS• Odds are equivalent to ratios of probabilities.

• For example, “Two to one odds that US wins the Women’s soccer world cup”, means   the probability that US wins the soccer world cup is double the probability that they lose.

• So, if odds = 2, probability, p(x) = ??

• if odds = 1/2, p(x) = ??

20

p(x)1 − p(x)

LOGISTIC REGRESSION • Suppose an individual has a 16% chance of defaulting on

their credit card payment. What are the odds that (s)he will default?

• On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?

21

LOGISTIC REGRESSION - PREDICTIONS

Sigmoid function , given by

22

logp(x)

1 − p(x)= θ0 + θ1x

⟹ p(x) =1

1 + e−θT X

hθ(x) =1

1 + e−θT X

y = hθ(x), z = − θT x

ESTIMATION OF PARAMETERS• Bernoulli probability ( think of flipping a coin weighted by

p(x) )

• Estimate the parameters to maximize the likelihood of the observed training data under this binomial model.

• Minimize the negative of the log likelihood of the model,

• Optimization problem:

23

minθ0,θ1 ∑i:yi=1

yi f (xi) − log(1 + e f(xi))

f (xi) = θ0 + θ1xi

LOGISTIC REGRESSION

• Hypothesis function / Classifier

24

0 ≤ hθ(x) ≤ 1

hθ(x) =1

1 + e−zSigmoid function / logistic function

y = hθ(x), z = θT x

LOGISTIC REGRESSION

• Hypothesis function / Classifier

25

0 ≤ hθ(x) ≤ 1

hθ(x) =1

1 + e−z= p(y = 1 |x, θ)

y = hθ(x), z = θT x

y = 1 if hθ(x) ≥ 0.5

y = 0 if hθ(x) < 0.5

ESTIMATION OF PARAMETERS

26

Training set:{(x1, y1), (x2, y2), …, (x − m, ym)}

x ∈

x0x1⋯xn

x0 = 1,y ∈ {0,1}

hθ(x) =1

1 + e−θT x

Parameters, θ?

LOSS FUNCTION

27

Linear Regression: f (θ, x) =1m

m

∑i=1

12

(hθ(xi) − yi)2

x ∈

x0x1⋯xn

x0 = 1,y ∈ {0,1}

hθ(x) =1

1 + e−θT x

Parameters, θ?

minθ0,θ1,…,θn

1m

m

∑i=1

[−yilog(hθ(xi)) − (1 − yi)log(1 − hθ(xi))]

LOSS FUNCTION

28

x ∈

x0x1⋯xn

x0 = 1,y ∈ {0,1}

f(θ) = −1m [

m

∑i=1

yilog(hθ(xi)) + (1 − yi)log(1 − hθ(xi))]

to predict a new observation, x: 1

1 + e−θT x

p(y = 1 |x, θ)

minθ0,θ1,…,θn

1m

m

∑i=1

[−yilog(hθ(xi)) − (1 − yi)log(1 − hθ(xi))]

GRADIENT DESCENT

29

want minθ0,θ1,…,θn

f(θ)

f(θ) = −1m [

m

∑i=1

yilog(hθ(xi)) + (1 − yi)log(1 − hθ(xi))]

Repeat {

θj := θj − α∂

∂θjf(θ)

(update all θj simultaneously)

}

∂∂θj

f(θ) =1m

m

∑i=1

(hθ(xi) − yi)xij where, hθ(x) =1

1 + e−θT x, θ =

θ0θ1…θn

PREDICTION

30

PREDICTION• For Iris data versicolor classification  intercept: -4.220  petal_width feature: 2.617   

• Probability that a given petal width is a versicolor   

31

̂p(1.7) =1

1 + e−(θ0+θ1x)=

11 + e−4.22+2.617*1.7

= 0.55

̂p(2.5) =1

1 + e−(θ0+θ1x)=

11 + e−4.22+2.617*2.5

= 0.91

MULTIPLE LOGISTIC REGRESSION

32

logp(x)

1 − p(x)= θ0 + θ1x1 + … + θnxn

EXERCISE

• Suppose we collect data for a group of students in a statistics class with variables x1=hours, x2 = undergrad GPA, and y = receive an A. We fit a logistic regression and find the following coefficients,     Estimate the probability that a student who studies 40 hours and has an undergraduate GPA of 3.5, gets an A in the class.

• With estimated parameters from previous question, and GPA of 3.5 as before, how many hours would the student need to study to have a 50% chance of getting an A in the class?

33

θ0 = − 6,θ1 = 0.05,θ2 = 1

today’s lecturetoday’s lecture data collection data processing exploratory analysis & data viz...

Documents