introduction to predictive models the bias variance tradeo ... · introduction to predictive models...

91
Introduction to Predictive Models The Bias Variance Tradeoff Cross Validation Some of the figures in this presentation are taken from An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani Carlos Carvalho Mladen Kolar, and Rob McCulloch

Upload: others

Post on 16-Mar-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

  • Introduction to Predictive ModelsThe Bias Variance Tradeoff

    Cross Validation

    Some of the figures in this presentation are taken from An Introduction toStatistical Learning, with applications in R (Springer, 2013) with permission

    from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani

    Carlos Carvalho Mladen Kolar, and Rob McCulloch

  • 1. Introduction to Predictive Models2. Measuring Accuracy3. Out-of-Sample Predictions4. Bias-Variance Trade-Off5. Cross-Validation6. More on k-Nearest Neighbors, p > 17. Doing CV with a Bigger n

  • 1. Introduction to Predictive Models

    Simply put, the goal is to predict atarget variable Y with input variables X !

    In Data Mining terminology this is know as supervised learning(also called Predictive Analytics).

    In general, a useful way to think about it is that Y and X arerelated in the following way:

    Yi = f (Xi ) + �i

    The main purpose of this part of the course is to learn or estimatef (·) from data

    1

  • Examples:

    I Y: will a customer respond to a promotion (target marketing).

    I Y: which customer is likely to cancel

    I Y: the lifetime value of a customer (how much will theyspend).

    I Y: pregnancy (from shopping behaviour) so you can target.

    I Y: will a customer defect.

    I Y: predict which products a customer will like (Pandora,Amazon).

    I Y: predict age of death (insurance companies)

    I ...

    See Tables 1-9 after page 142 of “Predictive Analytics” by EricSiegel for many examples.

    2

  • Y = f (X ) + �

    I f (x): the part of Y you learn from X , the signal.

    I �: the part of Y you don’t learn from X , the noise.

    More generally,we want the conditional distribution of Y given X = x .

    3

  • Example: Boston Housing

    We might be interested in predicting the median house value as afunction of some measure of social economic level... here’s somedata:

    Each observation corre-sponds to a town in theBoston area.

    medv: median house value(data is old).

    lstat: % lower status.

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●● ●

    ●● ●

    ●●

    ● ●● ●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ● ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●● ●

    ●●

    ●●● ●●

    ● ●

    ●●

    ● ●

    ●●● ●

    ●● ● ●

    ●●

    ● ●

    ●●

    ●●●

    ● ●

    ●● ●

    ● ●

    ●● ●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●● ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    10 20 30

    1020

    3040

    50

    lstat

    med

    v

    What should f (·) be?

    4

  • How about this...

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●● ●

    ●● ●

    ●●

    ● ●● ●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ● ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●● ●

    ●●

    ●●● ●●

    ● ●

    ●●

    ● ●

    ●●● ●

    ●● ● ●

    ●●

    ● ●

    ●●

    ●●●

    ● ●

    ●● ●

    ● ●

    ●● ●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●● ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    10 20 30

    1020

    3040

    50

    lstat

    med

    v

    If lstat = 30 what is the prediction for medv?

    5

  • or this?

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●● ●

    ●● ●

    ●●

    ● ●● ●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ● ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●● ●

    ●●

    ●●● ●●

    ● ●

    ●●

    ● ●

    ●●● ●

    ●● ● ●

    ●●

    ● ●

    ●●

    ●●●

    ● ●

    ●● ●

    ● ●

    ●● ●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●● ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    10 20 30

    1020

    3040

    50

    lstat

    med

    v

    If lstat = 30 what is the prediction for medv?

    6

  • How do we estimate f (·)?

    I Using training data:

    {(X1,Y1), (X2,Y2), . . . , (Xn,Yn)}

    I We use a statistical method to estimate the function f (·)I Two general methodological strategies:

    1. simple parametric models (restricted assumptions about f (·))2. non-parametric models (flexibility in defining f (·))

    Years of Education

    Sen

    iorit

    y

    Incom

    e

    Years of Education

    Sen

    iorit

    y

    Incom

    e

    7

  • Back to Boston Housing

    Parametric Model Non-Parametric Model(Y = α + βx + �) (k-nearest neighbors)

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●● ●

    ●● ●

    ●●

    ● ●● ●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ● ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●● ●

    ●●

    ●●● ●●

    ● ●

    ●●

    ● ●

    ●●● ●

    ●● ● ●

    ●●

    ● ●

    ●●

    ●●●

    ● ●

    ●● ●

    ● ●

    ●● ●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●● ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    10 20 30

    1020

    3040

    50

    lstat

    med

    v

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●● ●

    ●● ●

    ●●

    ● ●● ●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ● ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●● ●

    ●●

    ●●● ●●

    ● ●

    ●●

    ● ●

    ●●● ●

    ●● ● ●

    ●●

    ● ●

    ●●

    ●●●

    ● ●

    ●● ●

    ● ●

    ●● ●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●● ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    10 20 3010

    2030

    4050

    lstat

    med

    v

    8

  • Simple parametric model:

    Yi = α + β xi + �i

    Using the training data,we estimate f (x) as

    f̂ (x) = α̂ + β̂ x

    where α̂ and β̂are the linearregression estimates.

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●● ●

    ●● ●

    ●●

    ● ●● ●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ● ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●● ●

    ●●

    ●●● ●●

    ● ●

    ●●

    ● ●

    ●●● ●

    ●● ● ●

    ●●

    ● ●

    ●●

    ●●●

    ● ●

    ●● ●

    ● ●

    ●● ●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●● ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    10 20 30

    1020

    3040

    50

    lstat

    med

    v

    9

  • To get this estimate we usedkNN- k-nearest neighbors.

    To estimate f (xf ), average they values for the k training ob-servations with x closest to xf .

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●● ●

    ●● ●

    ●●

    ● ●● ●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ● ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●● ●

    ●●

    ●●● ●●

    ● ●

    ●●

    ● ●

    ●●● ●

    ●● ● ●

    ●●

    ● ●

    ●●

    ●●●

    ● ●

    ●● ●

    ● ●

    ●● ●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●● ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    10 20 3010

    2030

    4050

    lstat

    med

    v

    What do I mean by closest?We will choose the k=50 points that are closest to the X value atwhich we are trying to predict.

    10

  • ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●● ●

    ●● ●

    ●●

    ● ●● ●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ● ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●● ●

    ●●

    ●●● ●●

    ● ●

    ●●

    ● ●

    ●●● ●

    ●● ● ●

    ●●

    ● ●

    ●●

    ●●●

    ● ●

    ●● ●

    ● ●

    ●● ●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●● ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    10 20 30

    1020

    3040

    50

    k= 50

    lstat

    med

    v

    ●●

    ●●

    ●●●

    ●●

    11

  • ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●● ●

    ●● ●

    ●●

    ● ●● ●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ● ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●● ●

    ●●

    ●●● ●●

    ● ●

    ●●

    ● ●

    ●●● ●

    ●● ● ●

    ●●

    ● ●

    ●●

    ●●●

    ● ●

    ●● ●

    ● ●

    ●● ●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●● ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    10 20 30

    1020

    3040

    50

    k= 50

    lstat

    med

    v

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    12

  • ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●● ●

    ●● ●

    ●●

    ● ●● ●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ● ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●● ●

    ●●

    ●●● ●●

    ● ●

    ●●

    ● ●

    ●●● ●

    ●● ● ●

    ●●

    ● ●

    ●●

    ●●●

    ● ●

    ●● ●

    ● ●

    ●● ●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●● ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    10 20 30

    1020

    3040

    50

    k= 50

    lstat

    med

    v

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●

    ●●

    13

  • ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●● ●

    ●● ●

    ●●

    ● ●● ●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ● ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●● ●

    ●●

    ●●● ●●

    ● ●

    ●●

    ● ●

    ●●● ●

    ●● ● ●

    ●●

    ● ●

    ●●

    ●●●

    ● ●

    ●● ●

    ● ●

    ●● ●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●● ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    10 20 30

    1020

    3040

    50

    k= 50

    lstat

    med

    v

    ●●

    ●●

    ●●

    ●●

    ●● ●●

    ● ●●

    14

  • ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●● ●

    ●● ●

    ●●

    ● ●● ●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ● ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●● ●

    ●●

    ●●● ●●

    ● ●

    ●●

    ● ●

    ●●● ●

    ●● ● ●

    ●●

    ● ●

    ●●

    ●●●

    ● ●

    ●● ●

    ● ●

    ●● ●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●● ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    10 20 30

    1020

    3040

    50

    k= 50

    lstat

    med

    v

    ●●

    ●●

    ●●

    ●●

    ●●

    15

  • ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●● ●

    ●● ●

    ●●

    ● ●● ●

    ●●

    ● ●

    ●●

    ●●

    ●●●

    ● ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ●● ●

    ●●

    ●●● ●●

    ● ●

    ●●

    ● ●

    ●●● ●

    ●● ● ●

    ●●

    ● ●

    ●●

    ●●●

    ● ●

    ●● ●

    ● ●

    ●● ●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●● ●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    10 20 30

    1020

    3040

    50

    k= 50

    lstat

    med

    v

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    16

  • Okay, that seems sensible, but, 2 neighbors or 200 neighbors?

    ●●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●● ●●

    ● ●

    ● ●

    ●●●

    ●●

    ● ●

    ●●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●●

    ● ●●

    ●●

    10 20 30

    1020

    3040

    50

    k= 2

    lstat

    med

    v

    17

  • ●●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●● ●●

    ● ●

    ● ●

    ●●●

    ●●

    ● ●

    ●●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●●

    ● ●●

    ●●

    10 20 30

    1020

    3040

    50

    k= 10

    lstat

    med

    v

    18

  • ●●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●● ●●

    ● ●

    ● ●

    ●●●

    ●●

    ● ●

    ●●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●●

    ● ●●

    ●●

    10 20 30

    1020

    3040

    50

    k= 50

    lstat

    med

    v

    19

  • ●●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●● ●●

    ● ●

    ● ●

    ●●●

    ●●

    ● ●

    ●●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●●

    ● ●●

    ●●

    10 20 30

    1020

    3040

    50

    k= 100

    lstat

    med

    v

    20

  • ●●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●● ●●

    ● ●

    ● ●

    ●●●

    ●●

    ● ●

    ●●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●●

    ● ●●

    ●●

    10 20 30

    1020

    3040

    50

    k= 150

    lstat

    med

    v

    21

  • ●●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ● ●● ●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●● ●