today’s lecturetoday’s lecture data collection data processing exploratory analysis & data viz...
TRANSCRIPT
-
TODAY’S LECTURE
Data collection
Data processing
Exploratory analysis
& Data viz
Analysis, hypothesis testing, &
ML
Insight & Policy
Decision
1
-
FEATURE SCALING
2
Price (in 1000$)
220180350…….500
Area(sq. ft.)
160014002100 … ….
2400
# Bathrooms # Bedrooms
2.51.53.5……4
334……5
111
… ….1
base
-
NORMALIZATION- ZERO MEAN UNIT STANDARD DEVIATION
3
xij =xij − μj
σjj: Area, Bathrooms, Bedrooms
Price (in 1000$)
220180350…….500
Area(sq. ft.)
160014002100 … ….
2400
# Bathrooms # Bedrooms
2.51.53.5……4
334……5
111
… ….1
base
-
MAX-MIN
4
xij =xij − xminj
xmaxj − xminjj: Area, Bathrooms, Bedrooms
Price (in 1000$)
220180350…….500
Area(sq. ft.)
160014002100 … ….
2400
# Bathrooms # Bedrooms
2.51.53.5……4
334……5
111
… ….1
base
-
All Observation Model
■ Matrix Notation For all observations
y1y1
y2.
.
.
ym
=
1
1
1.
.
.1
x11x21
x31.
.
.xm1
x12x22
x32.
.
.xm2
..
..
...
.
.
..
..
..
...
.
.
..
x1n
x2n
x3n.
.
.
xmn
.
Y = X
θn
θ0θ1θ2
θ
-
GRADIENT DESCENTAlgorithm for any* hypothesis function , loss function , step size : Initialize the parameter vector: •
Repeat until satisfied (e.g., exact or approximate convergence): • Compute gradient: • Update parameters:
6*must be reasonably well behaved
-
GRADIENT DESCENT - MULTIVARIATE
7
θ0 := θ0 − α1m
m
∑i=1
(h(θ(xi) − yi)x0i…
Repeat{
θj := θj − α1m
m
∑i=1
(h(θ(xi) − yi)xji
}(update θj for all j = 1…n simultaneously
θ = 01m
m
∑i=1
(h(θ(xi) − yi)xji =∂
∂θjf (θ)
θ1 := θ1 − α1m
m
∑i=1
(h(θ(xi) − yi)x1i
θ2 := θ2 − α1m
m
∑i=1
(h(θ(xi) − yi)x2i
θn := θn − α1m
m
∑i=1
(h(θ(xi) − yi)xni
-
GRADIENT DESCENT - MULTIVARIATE
8
Repeat{
θj := θj − α1m
m
∑i=1
(h(θ(xi) − yi)xji
}(update θj for all j = 1…n simultaneously
θ = 01m
m
∑i=1
(h(θ(xi) − yi)xji =∂
∂θjf (θ)
-
STOCHASTIC GRADIENT DESCENT - MULTIVARIATE
9
Repeat{
θj := θj − α1m
m
∑i=1
(h(θ(xi) − yi)xji
}(update θj for all j = 1…n simultaneously
θ = 01m
m
∑i=1
(h(θ(xi) − yi)xji =∂
∂θjf (θ)
Repeat{
θj := θj − α(h(θ(xi) − yi)xji
}(update θj for all j = 1…n
θ = 0
i = random index between 1 and m
STOCHASTIC GRADIENT DESCENT
-
GRADIENT DESCENT
10
θ0 = 0.1 θ1 = 0.1
̂y = θ0 + θ1x
x y ̂y = θ0 + θ1x12
( ̂y − y)2SSE ∂(SSE )
∂θ0̂y − y
∂(SSE )∂θ1
( ̂y − y)x
0.2 0.440.310.45
0.26
0.123
0.75
0.39
0.120.131
0.145
0.175
0.05120.000032
0.183
0.0231
-0.32
0.008
-0.605
-0.215
-0.064
0.00248
-0.27225
-0.16125
-1.132 -0.495
-
GRADIENT DESCENT
11
θ0 = 0.1 θ1 = 0.1
̂y = θ0 + θ1x
x y ̂y = θ0 + θ1x12
( ̂y − y)2SSE ∂(SSE )
∂θ0̂y − y
∂(SSE )∂θ1
( ̂y − y)x
0.31 0.123 0.131 0.000032 0.008 0.00248
...…..
-
STOCHASTIC GRADIENT DESCENT
12
-
STOCHASTIC GRADIENT DESCENT - MINI BATCH
13
Repeat {
}
(update θj for all j = 1…n
θ = 0
= random index between 1 and mi1, …, ilθj := θj − α
1l
l
∑i=1
(h(θ(xi) − yi)xij
-
CLASSIFICATION PROBLEM
14
-
GENERAL CLASSIFICATION PROBLEM
15
• Can we predict categorical response/output Y, from a set of predictors
• For example, an individual’s choice of transportation: • Predictors: income, cost, and time • Response: car, bike, bus or train.
• From this classification model, an inference task: • how do people value price and time when considering a
transportation choice?
θ1, θ2, …, θn
-
WHY NOT LINEAR REGRESSION• For categorical responses, with more than two values, if
order and scale, don’t make sense, it is not a regression problem.
Y = { • For binary responses, it is a little better
Y = {
16
1 if stroke2 if drugoverdose3 if epilecticseizure
0 if stroke
1 if drugoverdose
-
BINARY RESPONSES• We could use linear regression and interpret response, Y,
as a probability
• for example, if
17
̂y > 0.5 (predict drugoverdose)
-
CLASSIFICATION AS PROBABILITY ESTIMATION• Instead of modeling classes 0 or 1, model conditional class
probability p(Y=1 | X = x)
• classify based on this probability.
• Use of discriminant functions ( think of scoring).
• One way to do this is, logistic regression.
18
-
LOGISTIC REGRESSION• Basic idea: Build a linear model related to p(x)
• Linear regression directly doesn’t work. Why?
• Instead build a linear model of log-odds:
19
(i.e. p(x) = θ0 + θ1x)
logp(x)
1 − p(x)= θ0 + θ1x
-
LOGISTIC REGRESSION - ODDS• Odds are equivalent to ratios of probabilities.
• For example, “Two to one odds that US wins the Women’s soccer world cup”, means the probability that US wins the soccer world cup is double the probability that they lose.
• So, if odds = 2, probability, p(x) = ??
• if odds = 1/2, p(x) = ??
20
p(x)1 − p(x)
-
LOGISTIC REGRESSION • Suppose an individual has a 16% chance of defaulting on
their credit card payment. What are the odds that (s)he will default?
• On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?
21
-
LOGISTIC REGRESSION - PREDICTIONS
Sigmoid function , given by
22
logp(x)
1 − p(x)= θ0 + θ1x
⟹ p(x) =1
1 + e−θT X
hθ(x) =1
1 + e−θT X
y = hθ(x), z = − θT x
-
ESTIMATION OF PARAMETERS• Bernoulli probability ( think of flipping a coin weighted by
p(x) )
• Estimate the parameters to maximize the likelihood of the observed training data under this binomial model.
• Minimize the negative of the log likelihood of the model,
• Optimization problem:
23
minθ0,θ1 ∑i:yi=1
yi f (xi) − log(1 + e f(xi))
f (xi) = θ0 + θ1xi
-
LOGISTIC REGRESSION
• Hypothesis function / Classifier
24
0 ≤ hθ(x) ≤ 1
hθ(x) =1
1 + e−zSigmoid function / logistic function
y = hθ(x), z = θT x
-
LOGISTIC REGRESSION
• Hypothesis function / Classifier
25
0 ≤ hθ(x) ≤ 1
hθ(x) =1
1 + e−z= p(y = 1 |x, θ)
y = hθ(x), z = θT x
y = 1 if hθ(x) ≥ 0.5
y = 0 if hθ(x) < 0.5
-
ESTIMATION OF PARAMETERS
26
Training set:{(x1, y1), (x2, y2), …, (x − m, ym)}
x ∈
x0x1⋯xn
x0 = 1,y ∈ {0,1}
hθ(x) =1
1 + e−θT x
Parameters, θ?
-
LOSS FUNCTION
27
Linear Regression: f (θ, x) =1m
m
∑i=1
12
(hθ(xi) − yi)2
x ∈
x0x1⋯xn
x0 = 1,y ∈ {0,1}
hθ(x) =1
1 + e−θT x
Parameters, θ?
minθ0,θ1,…,θn
1m
m
∑i=1
[−yilog(hθ(xi)) − (1 − yi)log(1 − hθ(xi))]
-
LOSS FUNCTION
28
x ∈
x0x1⋯xn
x0 = 1,y ∈ {0,1}
f(θ) = −1m [
m
∑i=1
yilog(hθ(xi)) + (1 − yi)log(1 − hθ(xi))]
to predict a new observation, x: 1
1 + e−θT x
p(y = 1 |x, θ)
minθ0,θ1,…,θn
1m
m
∑i=1
[−yilog(hθ(xi)) − (1 − yi)log(1 − hθ(xi))]
-
GRADIENT DESCENT
29
want minθ0,θ1,…,θn
f(θ)
f(θ) = −1m [
m
∑i=1
yilog(hθ(xi)) + (1 − yi)log(1 − hθ(xi))]
Repeat {
θj := θj − α∂
∂θjf(θ)
(update all θj simultaneously)
}
∂∂θj
f(θ) =1m
m
∑i=1
(hθ(xi) − yi)xij where, hθ(x) =1
1 + e−θT x, θ =
θ0θ1…θn
-
PREDICTION
30
-
PREDICTION• For Iris data versicolor classification intercept: -4.220 petal_width feature: 2.617
• Probability that a given petal width is a versicolor
31
̂p(1.7) =1
1 + e−(θ0+θ1x)=
11 + e−4.22+2.617*1.7
= 0.55
̂p(2.5) =1
1 + e−(θ0+θ1x)=
11 + e−4.22+2.617*2.5
= 0.91
-
MULTIPLE LOGISTIC REGRESSION
32
logp(x)
1 − p(x)= θ0 + θ1x1 + … + θnxn
-
EXERCISE
• Suppose we collect data for a group of students in a statistics class with variables x1=hours, x2 = undergrad GPA, and y = receive an A. We fit a logistic regression and find the following coefficients, Estimate the probability that a student who studies 40 hours and has an undergraduate GPA of 3.5, gets an A in the class.
• With estimated parameters from previous question, and GPA of 3.5 as before, how many hours would the student need to study to have a 50% chance of getting an A in the class?
33
θ0 = − 6,θ1 = 0.05,θ2 = 1