regression - people-kernel regression and locally weighted regression 2 outline • ordinary least...
TRANSCRIPT
-
Regression
Practical Machine LearningFabian Wauthier
09/10/2009
Adapted from slides by Kurt Miller and Romain Thibaux
1
-
Outline• Ordinary Least Squares Regression
- Online version
- Normal equations
- Probabilistic interpretation
• Overfitting and Regularization• Overview of additional topics
- L1 Regression- Quantile Regression- Generalized linear models- Kernel Regression and Locally Weighted Regression
2
-
Outline• Ordinary Least Squares Regression
- Online version
- Normal equations
- Probabilistic interpretation
• Overfitting and Regularization• Overview of additional topics
- L1 Regression- Quantile Regression- Generalized linear models- Kernel Regression and Locally Weighted Regression
3
-
Regression vs. Classification:
Anything:
• continuous (ℜ, ℜd, …)
• discrete ({0,1}, {1,…k}, …)
• structured (tree, string, …)
• …
• Discrete:
– {0,1} binary
– {1,…k} ! multi-class
– tree, etc. structured
Classification
X Y⇒4
-
Regression vs. Classification:
Anything:
• continuous (ℜ, ℜd, …)
• discrete ({0,1}, {1,…k}, …)
• structured (tree, string, …)
• …
PerceptronLogistic RegressionSupport Vector Machine
Decision TreeRandom Forest
Kernel trick
Classification
X Y⇒5
-
Regression vs. Classification:
Anything:
• continuous (ℜ, ℜd, …)
• discrete ({0,1}, {1,…k}, …)
• structured (tree, string, …)
• …
Regression
• continuous:– ℜ, ℜd
X Y⇒6
-
Examples
• Voltage Temperature• Processes, memory Power consumption• Protein structure Energy• Robot arm controls Torque at effector• Location, industry, past losses Premium
⇒⇒⇒
⇒
⇒
7
-
Linear regression
0 10 200
20
40
x
y
010
2030
40
0
10
20
30
20
22
24
26
x
y
Given examplesPredict given a new point
8
-
where is a parameter to be estimated and we have used the standard convention of letting the first component of be 1.
We wish to estimate by a linear function of our data :
010
2030
40
0
10
20
30
20
22
24
26
Linear regression
0 10 200
20
40
x
y
x
y
ŷ x
w
ŷn+1 = w0 + w1xn+1,1 + w2xn+1,2= w!xn+1
x
9
-
Choosing the regressor
10
Of the many regression fits that approximate the data, which should we choose?
Observation
0 200
Xi =(
1xi
)
10
-
LMS Algorithm(Least Mean Squares)
In order to clarify what we mean by a good choice of , we will define a cost function for how well we are doing on the training data:
w
0 200
Error or “residual”
Prediction
Observation
Cost =12
n∑i=1
(w!xi − yi)2
Xi =(
1xi
)
11
-
LMS Algorithm(Least Mean Squares)
The best choice of is the one that minimizes our cost functionw
E =12
n∑i=1
(w!xi − yi)2 =n∑
i=1
Ei
In order to optimize this equation, we use standard gradient descent
where
∂
∂wE =
n∑i=1
∂
∂wEi and
∂
∂wEi =
12
∂
∂w(w!xi − yi)2
= (w!xi − yi)xi
wt+1 := wt − α ∂∂w
E
12
-
LMS Algorithm(Least Mean Squares)
The LMS algorithm is an online method that performs the following update for each new data point
wt+1 := wt − α ∂∂w
Ei
= wt + α(yi − x!i w)xi
α∂Ei∂w
13
-
LMS, Logistic regression, and Perceptron updates
• LMS
• Logistic Regression
• Perceptron
wt+1 := wt + α(yi − x!i w)xi
wt+1 := wt + α(yi − fw(xi))xi
wt+1 := wt + α(yi − fw(xi))xi
14
-
Ordinary Least Squares (OLS)
0 200
Error or “residual”
Prediction
Observation
Cost =12
n∑i=1
(w!xi − yi)2
Xi =(
1xi
)
15
-
Minimize the sum squared error
n
d
∂
∂wE = X!Xw −X!y
Setting the derivative equal to zero gives us the Normal Equations
X!Xw = X!yw = (X!X)−1X!y
E =12
n∑i=1
(w!xi − yi)2
=12(Xw − y)!(Xw − y)
=12(w!X!Xw − 2y!Xw + y!y)
16
-
A geometric interpretation
17
We solved∂
∂wE = X!(Xw − y) = 0
Residuals are orthogonal to columns of X⇒⇒ gives the best reconstruction ofŷ = Xw y
in the range ofX
17
-
18
[X]1
y
[X]2
y’
y’ is an orthogonal projection of y onto S
Subspace S spanned by columns of X
Residual vector y!y’ isorthogonal to subspace S
18
-
Computing the solution
19
w.
.
Euclidean norm.
the pseudoinverse XIf X!X is not invertible, there is no unique solution
In that case chooses the solution with smallest
and the solution is unique.
w = (X!X)−1X!yWe compute If X!X is invertible, then (X!X)−1X! coincides with
X+ of
An alternative way to deal with non-invertible X!X isto add a small portion of the identity matrix (= Ridge regression).
w = X+y
19
-
Beyond lines and planes
0 10 200
20
40
Linear models become powerful function approximators when we consider non-linear feature transformations.
⇒All the math is the same!
Predictions are still linear in X !
20
-
Geometric interpretation
[Matlab demo]
010
20 0
100
200
300
400
-10
0
10
20
ŷ = w0 + w1x + w2x2
21
-
Ordinary Least Squares [summary]
n
d
Let
For example
Let
Minimize by solving
Given examples
Predict22
-
Probabilistic interpretation
0 200
Likelihood
23
-
240 2 4 6 8 10
0
5
10
15
20
25
X
y
µ=8µ=5µ=3
Mean µ
ConditionalGaussiansp(y|x)
24
-
BREAK
25
-
Outline• Ordinary Least Squares Regression
- Online version
- Normal equations
- Probabilistic interpretation
• Overfitting and Regularization• Overview of additional topics
- L1 Regression- Quantile Regression- Generalized linear models- Kernel Regression and Locally Weighted Regression
26
-
Overfitting
• So the more features the better? NO! • Carefully selected features can improve
model accuracy. • But adding too many can lead to overfitting.• Feature selection will be discussed in a
separate lecture.
27
27
-
Overfitting
0 2 4 6 8 10 12 14 16 18 20-15
-10
-5
0
5
10
15
20
25
30
[Matlab demo]
Degree 15 polynomial
28
-
Ridge Regression(Regularization)
0 2 4 6 8 10 12 14 16 18 20-10
-5
0
5
10
15Effect of regularization (degree 19)
with “small” by solving
Minimize
(X!X + !I)w = X!y
[Continue Matlab demo]29
-
Probabilistic interpretation
Likelihood
Prior
Posterior
P (w|X, y) = P (w, x1, . . . , xn, y1, . . . , yn)P (x1, . . . , xn, y1, . . . , yn)
∝ P (w.x1, . . . , x1, y1, . . . , yn)∝ exp
{− !
2σ2||w||22
}∏i
exp{− 1
2σ2(X!i w − yi
)2}
= exp
{− 1
2σ2
[!||w||22 +
∑i
(X!i w − yi)2]}
30
-
Outline• Ordinary Least Squares Regression
- Online version
- Normal equations
- Probabilistic interpretation
• Overfitting and Regularization• Overview of additional topics
- L1 Regression- Quantile Regression- Generalized linear models- Kernel Regression and Locally Weighted Regression
31
-
Errors in Variables(Total Least Squares)
00
32
-
Sensitivity to outliers
High weight given to outliers
010
2030
40
010
2030
5
10
15
20
25
Temperature at noon
Influence function
33
-
L1 Regression
Linear programInfluence function
[Matlab demo]34
-
Quantile Regression
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
15 16 17 18 19 20 21
260
280
300
320
340
360
workload (ViewItem.php) [req/s]
CPU
utiliz
atio
n [M
Hz]
mean CPU95th percentile of CPU
Slide courtesy of Peter Bodik35
-
Generalized Linear Models
36
Probabilistic interpretation of OLSMean is linear in Xi
OLS: linearly predict the mean of a Gaussian conditional.
GLM: predict the mean of some other conditional density.
May need to transform linear prediction by to produce a valid parameter.
yi|xi ∼ p(f(X!i w)
)f(·)
36
-
Example: “Poisson regression”
37
yi|xi ∼ Poisson(f(X!i w)
)
Suppose data are event counts:y
Typical distribution for count data: Poisson
Poisson(y|λ) = e−λλy
y!Mean parameter is λ > 0
Say we predict λ = f(x!w) = exp{x!w
}
y ∈ N0
GLM:
37
-
380 2 4 6 8 100
5
10
15
20
25
X
y Mean !
ConditionalPoissonsp(y|x)
!=8!=5!=3
38
-
Poisson regression: learning
39
As for OLS: optimize by maximizing the likelihood of data.w
Equivalently: maximize log likelihood.
Likelihood L =∏
i
Poisson(yi|f(X!i w)
)l =
∑i
(X!i wyi − exp
{X!i w
})+ const.Log likelihood
∂l
∂w=
∑i
(yi − exp
{X!i w
})Xi
=∑
i
(yi − f
(X!i w
))Xi
Batch gradient:
︸ ︷︷ ︸“residual”
39
-
LMS, Logistic regression, Perceptron and GLM updates
• GLM (online)
• LMS
• Logistic Regression
• Perceptron
wt+1 := wt + α(yi − x!i w)xi
wt+1 := wt + α(yi − fw(xi))xi
wt+1 := wt + α(yi − fw(xi))xi
wt+1 := wt + α(yi − fw(xi))xi
40
-
Kernel Regression and Locally Weighted Linear Regression
• Kernel Regression: Take a very very conservative function approximator called
AVERAGING. Locally weight it.
• Locally Weighted Linear Regression: Take a conservative function approximator called LINEAR
REGRESSION. Locally weight it.
Slide from Paul Viola 200341
-
Kernel Regression
0 2 4 6 8 10 12 14 16 18 20-10
-5
0
5
10
15Kernel regression (sigma=1)
42
-
! " # $ % &! &" &$ &% "!!&!
!'
!
'
&!
&'
Locally Weighted Linear Regression(LWR)
Kernel regression (sigma=1)
E =12
n∑i=1
(w!xi − yi)2
OLS cost function:
LWR cost function:
E′ =n∑
i=1
k(xi − x)(w"xi − yi)2
[Matlab demo]43
-
0 1 20
#requests per minute
Time (days)
5000
Heteroscedasticity
44
-
What we covered• Ordinary Least Squares Regression
- Online version
- Normal equations
- Probabilistic interpretation
• Overfitting and Regularization• Overview of additional topics
- L1 Regression- Quantile Regression- Generalized linear models- Kernel Regression and Locally Weighted Regression
45