regression - people-kernel regression and locally weighted regression 2 outline • ordinary least...

45
Regression Practical Machine Learning Fabian Wauthier 09/10/2009 Adapted from slides by Kurt Miller and Romain Thibaux 1

Upload: others

Post on 20-Feb-2021

13 views

Category:

Documents


0 download

TRANSCRIPT

  • Regression

    Practical Machine LearningFabian Wauthier

    09/10/2009

    Adapted from slides by Kurt Miller and Romain Thibaux

    1

  • Outline• Ordinary Least Squares Regression

    - Online version

    - Normal equations

    - Probabilistic interpretation

    • Overfitting and Regularization• Overview of additional topics

    - L1 Regression- Quantile Regression- Generalized linear models- Kernel Regression and Locally Weighted Regression

    2

  • Outline• Ordinary Least Squares Regression

    - Online version

    - Normal equations

    - Probabilistic interpretation

    • Overfitting and Regularization• Overview of additional topics

    - L1 Regression- Quantile Regression- Generalized linear models- Kernel Regression and Locally Weighted Regression

    3

  • Regression vs. Classification:

    Anything:

    • continuous (ℜ, ℜd, …)

    • discrete ({0,1}, {1,…k}, …)

    • structured (tree, string, …)

    • …

    • Discrete:

    – {0,1} binary

    – {1,…k} ! multi-class

    – tree, etc. structured

    Classification

    X Y⇒4

  • Regression vs. Classification:

    Anything:

    • continuous (ℜ, ℜd, …)

    • discrete ({0,1}, {1,…k}, …)

    • structured (tree, string, …)

    • …

    PerceptronLogistic RegressionSupport Vector Machine

    Decision TreeRandom Forest

    Kernel trick

    Classification

    X Y⇒5

  • Regression vs. Classification:

    Anything:

    • continuous (ℜ, ℜd, …)

    • discrete ({0,1}, {1,…k}, …)

    • structured (tree, string, …)

    • …

    Regression

    • continuous:– ℜ, ℜd

    X Y⇒6

  • Examples

    • Voltage Temperature• Processes, memory Power consumption• Protein structure Energy• Robot arm controls Torque at effector• Location, industry, past losses Premium

    ⇒⇒⇒

    7

  • Linear regression

    0 10 200

    20

    40

    x

    y

    010

    2030

    40

    0

    10

    20

    30

    20

    22

    24

    26

    x

    y

    Given examplesPredict given a new point

    8

  • where is a parameter to be estimated and we have used the standard convention of letting the first component of be 1.

    We wish to estimate by a linear function of our data :

    010

    2030

    40

    0

    10

    20

    30

    20

    22

    24

    26

    Linear regression

    0 10 200

    20

    40

    x

    y

    x

    y

    ŷ x

    w

    ŷn+1 = w0 + w1xn+1,1 + w2xn+1,2= w!xn+1

    x

    9

  • Choosing the regressor

    10

    Of the many regression fits that approximate the data, which should we choose?

    Observation

    0 200

    Xi =(

    1xi

    )

    10

  • LMS Algorithm(Least Mean Squares)

    In order to clarify what we mean by a good choice of , we will define a cost function for how well we are doing on the training data:

    w

    0 200

    Error or “residual”

    Prediction

    Observation

    Cost =12

    n∑i=1

    (w!xi − yi)2

    Xi =(

    1xi

    )

    11

  • LMS Algorithm(Least Mean Squares)

    The best choice of is the one that minimizes our cost functionw

    E =12

    n∑i=1

    (w!xi − yi)2 =n∑

    i=1

    Ei

    In order to optimize this equation, we use standard gradient descent

    where

    ∂wE =

    n∑i=1

    ∂wEi and

    ∂wEi =

    12

    ∂w(w!xi − yi)2

    = (w!xi − yi)xi

    wt+1 := wt − α ∂∂w

    E

    12

  • LMS Algorithm(Least Mean Squares)

    The LMS algorithm is an online method that performs the following update for each new data point

    wt+1 := wt − α ∂∂w

    Ei

    = wt + α(yi − x!i w)xi

    α∂Ei∂w

    13

  • LMS, Logistic regression, and Perceptron updates

    • LMS

    • Logistic Regression

    • Perceptron

    wt+1 := wt + α(yi − x!i w)xi

    wt+1 := wt + α(yi − fw(xi))xi

    wt+1 := wt + α(yi − fw(xi))xi

    14

  • Ordinary Least Squares (OLS)

    0 200

    Error or “residual”

    Prediction

    Observation

    Cost =12

    n∑i=1

    (w!xi − yi)2

    Xi =(

    1xi

    )

    15

  • Minimize the sum squared error

    n

    d

    ∂wE = X!Xw −X!y

    Setting the derivative equal to zero gives us the Normal Equations

    X!Xw = X!yw = (X!X)−1X!y

    E =12

    n∑i=1

    (w!xi − yi)2

    =12(Xw − y)!(Xw − y)

    =12(w!X!Xw − 2y!Xw + y!y)

    16

  • A geometric interpretation

    17

    We solved∂

    ∂wE = X!(Xw − y) = 0

    Residuals are orthogonal to columns of X⇒⇒ gives the best reconstruction ofŷ = Xw y

    in the range ofX

    17

  • 18

    [X]1

    y

    [X]2

    y’

    y’ is an orthogonal projection of y onto S

    Subspace S spanned by columns of X

    Residual vector y!y’ isorthogonal to subspace S

    18

  • Computing the solution

    19

    w.

    .

    Euclidean norm.

    the pseudoinverse XIf X!X is not invertible, there is no unique solution

    In that case chooses the solution with smallest

    and the solution is unique.

    w = (X!X)−1X!yWe compute If X!X is invertible, then (X!X)−1X! coincides with

    X+ of

    An alternative way to deal with non-invertible X!X isto add a small portion of the identity matrix (= Ridge regression).

    w = X+y

    19

  • Beyond lines and planes

    0 10 200

    20

    40

    Linear models become powerful function approximators when we consider non-linear feature transformations.

    ⇒All the math is the same!

    Predictions are still linear in X !

    20

  • Geometric interpretation

    [Matlab demo]

    010

    20 0

    100

    200

    300

    400

    -10

    0

    10

    20

    ŷ = w0 + w1x + w2x2

    21

  • Ordinary Least Squares [summary]

    n

    d

    Let

    For example

    Let

    Minimize by solving

    Given examples

    Predict22

  • Probabilistic interpretation

    0 200

    Likelihood

    23

  • 240 2 4 6 8 10

    0

    5

    10

    15

    20

    25

    X

    y

    µ=8µ=5µ=3

    Mean µ

    ConditionalGaussiansp(y|x)

    24

  • BREAK

    25

  • Outline• Ordinary Least Squares Regression

    - Online version

    - Normal equations

    - Probabilistic interpretation

    • Overfitting and Regularization• Overview of additional topics

    - L1 Regression- Quantile Regression- Generalized linear models- Kernel Regression and Locally Weighted Regression

    26

  • Overfitting

    • So the more features the better? NO! • Carefully selected features can improve

    model accuracy. • But adding too many can lead to overfitting.• Feature selection will be discussed in a

    separate lecture.

    27

    27

  • Overfitting

    0 2 4 6 8 10 12 14 16 18 20-15

    -10

    -5

    0

    5

    10

    15

    20

    25

    30

    [Matlab demo]

    Degree 15 polynomial

    28

  • Ridge Regression(Regularization)

    0 2 4 6 8 10 12 14 16 18 20-10

    -5

    0

    5

    10

    15Effect of regularization (degree 19)

    with “small” by solving

    Minimize

    (X!X + !I)w = X!y

    [Continue Matlab demo]29

  • Probabilistic interpretation

    Likelihood

    Prior

    Posterior

    P (w|X, y) = P (w, x1, . . . , xn, y1, . . . , yn)P (x1, . . . , xn, y1, . . . , yn)

    ∝ P (w.x1, . . . , x1, y1, . . . , yn)∝ exp

    {− !

    2σ2||w||22

    }∏i

    exp{− 1

    2σ2(X!i w − yi

    )2}

    = exp

    {− 1

    2σ2

    [!||w||22 +

    ∑i

    (X!i w − yi)2]}

    30

  • Outline• Ordinary Least Squares Regression

    - Online version

    - Normal equations

    - Probabilistic interpretation

    • Overfitting and Regularization• Overview of additional topics

    - L1 Regression- Quantile Regression- Generalized linear models- Kernel Regression and Locally Weighted Regression

    31

  • Errors in Variables(Total Least Squares)

    00

    32

  • Sensitivity to outliers

    High weight given to outliers

    010

    2030

    40

    010

    2030

    5

    10

    15

    20

    25

    Temperature at noon

    Influence function

    33

  • L1 Regression

    Linear programInfluence function

    [Matlab demo]34

  • Quantile Regression

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    15 16 17 18 19 20 21

    260

    280

    300

    320

    340

    360

    workload (ViewItem.php) [req/s]

    CPU

    utiliz

    atio

    n [M

    Hz]

    mean CPU95th percentile of CPU

    Slide courtesy of Peter Bodik35

  • Generalized Linear Models

    36

    Probabilistic interpretation of OLSMean is linear in Xi

    OLS: linearly predict the mean of a Gaussian conditional.

    GLM: predict the mean of some other conditional density.

    May need to transform linear prediction by to produce a valid parameter.

    yi|xi ∼ p(f(X!i w)

    )f(·)

    36

  • Example: “Poisson regression”

    37

    yi|xi ∼ Poisson(f(X!i w)

    )

    Suppose data are event counts:y

    Typical distribution for count data: Poisson

    Poisson(y|λ) = e−λλy

    y!Mean parameter is λ > 0

    Say we predict λ = f(x!w) = exp{x!w

    }

    y ∈ N0

    GLM:

    37

  • 380 2 4 6 8 100

    5

    10

    15

    20

    25

    X

    y Mean !

    ConditionalPoissonsp(y|x)

    !=8!=5!=3

    38

  • Poisson regression: learning

    39

    As for OLS: optimize by maximizing the likelihood of data.w

    Equivalently: maximize log likelihood.

    Likelihood L =∏

    i

    Poisson(yi|f(X!i w)

    )l =

    ∑i

    (X!i wyi − exp

    {X!i w

    })+ const.Log likelihood

    ∂l

    ∂w=

    ∑i

    (yi − exp

    {X!i w

    })Xi

    =∑

    i

    (yi − f

    (X!i w

    ))Xi

    Batch gradient:

    ︸ ︷︷ ︸“residual”

    39

  • LMS, Logistic regression, Perceptron and GLM updates

    • GLM (online)

    • LMS

    • Logistic Regression

    • Perceptron

    wt+1 := wt + α(yi − x!i w)xi

    wt+1 := wt + α(yi − fw(xi))xi

    wt+1 := wt + α(yi − fw(xi))xi

    wt+1 := wt + α(yi − fw(xi))xi

    40

  • Kernel Regression and Locally Weighted Linear Regression

    • Kernel Regression: Take a very very conservative function approximator called

    AVERAGING. Locally weight it.

    • Locally Weighted Linear Regression: Take a conservative function approximator called LINEAR

    REGRESSION. Locally weight it.

    Slide from Paul Viola 200341

  • Kernel Regression

    0 2 4 6 8 10 12 14 16 18 20-10

    -5

    0

    5

    10

    15Kernel regression (sigma=1)

    42

  • ! " # $ % &! &" &$ &% "!!&!

    !'

    !

    '

    &!

    &'

    Locally Weighted Linear Regression(LWR)

    Kernel regression (sigma=1)

    E =12

    n∑i=1

    (w!xi − yi)2

    OLS cost function:

    LWR cost function:

    E′ =n∑

    i=1

    k(xi − x)(w"xi − yi)2

    [Matlab demo]43

  • 0 1 20

    #requests per minute

    Time (days)

    5000

    Heteroscedasticity

    44

  • What we covered• Ordinary Least Squares Regression

    - Online version

    - Normal equations

    - Probabilistic interpretation

    • Overfitting and Regularization• Overview of additional topics

    - L1 Regression- Quantile Regression- Generalized linear models- Kernel Regression and Locally Weighted Regression

    45