data analytics for social science logistic regression · data analytics for social science logistic...

36
logistic regression Classification Linear regression Linear discriminant analysis Logistic regression Presentation Model fit References Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University College Dublin 17 October 2017

Upload: others

Post on 21-Oct-2019

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Data Analytics for Social ScienceLogistic regression

Johan A. Elkink

School of Politics & International Relations

University College Dublin

17 October 2017

Page 2: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Outline

1 Classification

2 Linear regression

3 Linear discriminant analysis

4 Logistic regression

5 Presentation

6 Model fit

Page 3: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Levels of measurement

Discreet Continuous

Nominal party choice -Ordinal

education level

--

Interval

how likely to vote ... temperature

Ratio

deaths in war ideological distance

Categories in no particular order

(examples in cells)

Page 4: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Levels of measurement

Discreet Continuous

Nominal party choice -Ordinal education level -

-

Interval

how likely to vote ... temperature

Ratio

deaths in war ideological distance

Categories in a specific order

(examples in cells)

Page 5: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Levels of measurement

Discreet Continuous

Nominal party choice -Ordinal education level -

-

Interval how likely to vote ... temperatureRatio

deaths in war ideological distance

All values possible

, with a meaningful zero point

(examples in cells)

Page 6: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Levels of measurement

Discreet Continuous

Nominal party choice -Ordinal education level -

-

Interval how likely to vote ... temperatureRatio deaths in war ideological distance

All values possible, with a meaningful zero point

(examples in cells)

Page 7: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Levels of measurement

Discreet Continuous

Nominal party choice -Ordinal education level -Binary turnout -

Interval how likely to vote ... temperatureRatio deaths in war ideological distance

Two categories, coded as 0 and 1

(examples in cells)

Page 8: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Binary models

Binary models have a dependent variable consisting of twocategories.

For example,

• Vote on a particular law

• Turning out in an election

• Approval in a referendum

• Bankrupt or not

Page 9: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Classification

With linear regression, we try to:

1 estimate the relationship between two variables, and

2 predict the mean of yi based on the values in xi .

When the dependent variable is categorical, we instead try to:

1 estimate the relationship between two variables, and

2 predict the category of yi based on the values in xi .

The former is typically called regression, the latterclassification.

Note that we are still in a supervised learning context: weestimate the model to classify cases, based on observedcategorisation y.

Page 10: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Classification: example data

●●

●●

●●

−5.0

−2.5

0.0

2.5

−4 −2 0 2 4

x1

x2

For example, lets assume wehave two continuous variablesx1 and x2 and try tounderstand a classify based ona binary dependent variable y.

Page 11: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Outline

1 Classification

2 Linear regression

3 Linear discriminant analysis

4 Logistic regression

5 Presentation

6 Model fit

Page 12: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Linear regression

We could regress a dependent variable y coded as ones andzeros on X using linear regression and then apply a decisionrule that we classify observation i as 1 if yi > 0.5 and as 0otherwise.

model <- lm(y ~ x1 + x2)

predicted <- ifelse(predict(model) > 0.5, 1, 0)

The decision boundary is then a straight line, as in:

yi = β1 + β2xi1 + β3xi2 = 0.5

xi2 = −β1

β3− β2

β3xi1

Page 13: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Linear regression

●●

●●

●●

−5.0

−2.5

0.0

2.5

−4 −2 0 2 4

x1

x2

Page 14: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Outline

1 Classification

2 Linear regression

3 Linear discriminant analysis

4 Logistic regression

5 Presentation

6 Model fit

Page 15: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Linear discriminant analysis

An alternative might be to find the mean on X for each groupin y and then classify based on the nearest mean.

This is problematic, as the direction in which the means aremost distinct, there might also be a lot more overlap in thedata, e.g. x1 in this plot:

Gutierrez-Osuna, http://www.nada.kth.se/∼stefanc/DATORSEENDE AK/pr l10.pdf

Page 16: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Linear discriminant analysis

Due to the variation withineach group, a differentprojection of the means mightbe more suitable.

Linear discriminant analysis(LDA) projects the data suchthat the difference in means ismaximized and the variancesminimized.

The decision boundary is then orthogonal to this projection.

model <- lda(y ~ x1 + x2)

predicted <- predict(model)$class

Gutierrez-Osuna, http://www.nada.kth.se/∼stefanc/DATORSEENDE AK/pr l10.pdf

Page 17: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Linear discriminant analysis

●●

●●

●●

−5.0

−2.5

0.0

2.5

−4 −2 0 2 4

x1

x2

Page 18: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Linear discriminant analysis

Unlike the linear regression example, LDA can be used forcategorical variables with more than two categories.

(Hastie, Tibshirani and Friedman, 2009, 118)

Page 19: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Outline

1 Classification

2 Linear regression

3 Linear discriminant analysis

4 Logistic regression

5 Presentation

6 Model fit

Page 20: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Limited dependent variables

For binary dependent variables we might estimate theprobability of observing a one.

When a dependent variable is not continuous, or is truncatedfor some reason, a linear model would lead to implausiblepredictions, i.e. impossible probabilities.

• Prediction below 0 and above 1 would not make sense.

• For any case where the predicted probability is alreadyhigh, it cannot increase much with a change in X (andvice versa for low probabilities).

• A linear regression implies high levels ofheteroskedasticity and therefore unreliable t-tests.

Page 21: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Estimators

A typical approach is to have an estimator that is “linear in theparameters” – i.e. it generates a linear prediction based on Xand β – but then transforms this linear prediction into onebounded between 0 and 1.

−6 −4 −2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

Logistic transformation

Linear prediction

Pre

dict

ed p

roba

bilit

y

Pr(Y = 1) = 0.5

Page 22: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Logistic regression

The most common transformation is the logistictransformation, which relates to the log-odds:

log

(Pr(yi = 1)

Pr(yi = 0)

)= β1 + β2xi1 + β3xi2,

which can also be formulated as:

Pr(yi = 1) =1

1 + e−(β1+β2xi1+β3xi2).

model <- glm(y ~ x1 + x2, family =

binomial(link = "logit"))

predicted <- ifelse(predict(model) > 0, 1, 0)

Page 23: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Estimating a logistic regression

Estimating a logisticregression is straightforwardand output will look similar tothat of linear regression.

E.g. explaining “Yes” in theMarriage EqualityReferendum.

Note the use of continuousand discreet independentvariables.

Age 25-34 −0.152(0.410)

35-44 −0.707∗

(0.386)45-54 −0.865∗∗

(0.390)55-64 −1.084∗∗∗

(0.399)65+ −1.857∗∗∗

(0.374)Urban 0.305∗

(0.168)Pro-abortion 0.221∗∗∗

attitude (0.028)intercept 0.358

(0.372)

N 851

Page 24: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Logistic regression

●●

●●

●●

−5.0

−2.5

0.0

2.5

−4 −2 0 2 4

x1

x2

Page 25: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Linear, logistic, and discriminant

●●

●●

●●

−5.0

−2.5

0.0

2.5

−4 −2 0 2 4

x1

x2

Page 26: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Choosing between LDA and logistic

Generally, they provide very similar results, but:

• LDA is more sensitive to outliers, logistic regression morerobust.

• LDA assumes normally distributed X, while logisticregression makes no such assumption.

• If the assumptions are satisfied, LDA is more precise, haslower estimation variance.

• If the two classes are perfectly separable, logisticregression will fail, but LDA will not.

• Logistic regression more appropriate for imbalancedcategories (e.g. 80% ones).

(Hastie, Tibshirani and Friedman, 2009, 128)

Page 27: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Stepwise and regularized

As with linear regression, stepwise variable selection approachesand regularized versions for logistic regression and discriminantanalysis exist, but this is beyond the scope of this module.

https://read01.com/jEdyx.html

Page 28: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Outline

1 Classification

2 Linear regression

3 Linear discriminant analysis

4 Logistic regression

5 Presentation

6 Model fit

Page 29: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Presentation of results

Contrary to linear regression, the coefficients for LDA orlogistic regression are not straightforward to interpret.

It is therefore useful to try to graphically represent the effect.

Note that a quick method to interpret logistic regressioncoefficients is to divide them by 4 to get the slope at the pointwhere P(y = 1|X) = 0.5.

Note that the graphical visualisation in the lab is asimplification and not really suitable for publication. Theproblem is that we present results from the simple regression(one independent variable), not the multiple regression—thelatter is more involved due to the non-linear nature of therelation between X and y.

Page 30: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Outline

1 Classification

2 Linear regression

3 Linear discriminant analysis

4 Logistic regression

5 Presentation

6 Model fit

Page 31: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Confusion matrix

Evaluating the performance of the binary model can be done byusing the confusion matrix:

True value1 0

Pre

dic

tion 1 True positive (TP) False positive (FP)

Precision: TPTP+FP

0 False negative (TN) True negative (FN)

Sensitivity: TPTP+FN Specificity: TN

FP+TN

Page 32: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Confusion matrix

Evaluating the performance of the binary model can be done byusing the confusion matrix:

True value1 0

Pre

dic

tion 1 True positive (TP) False positive (FP) Precision: TP

TP+FP

0 False negative (TN) True negative (FN)

Sensitivity: TPTP+FN Specificity: TN

FP+TN

Page 33: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Receiver Operating Characteristic curve

The accuracy of predictions will depend on the thresholdprobability—variations on default of π = 0.5 are possible.

Depending on the application, it might be better or worse toover- or underestimate ones relative to zeros.

The ROC-curve plots, for all possible thresholds, the truepositive rate against the false positive rate.

An ROC-curve further from (above) the 45 degree lineindicates a better predictive performance; any predictions underthis line indicate worse than random prediction.

Page 34: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

ROC curves

True negative rate (specificity)

True

pos

itive

rat

e (s

ensi

tivity

)

1.0 0.8 0.6 0.4 0.2 0.0

0.0

0.2

0.4

0.6

0.8

1.0

●●●

Linear regressionLinear discriminant analysisLogistic regression

Page 35: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Area Under Curve (AUC)

Given the above, we can also calculate the area under theROC-curve as a measure of prediction quality.

This is somewhat related to the Gini coefficient for incomedistributions (G = 2AUC − 1).

library(pROC)

model <- lm(y ~ x1 + x2)

auc(y, predict(model))

plot(roc(y, predict(model)))

Page 36: Data Analytics for Social Science Logistic regression · Data Analytics for Social Science Logistic regression Johan A. Elkink School of Politics & International Relations University

logisticregression

Classification

Linearregression

Lineardiscriminantanalysis

Logisticregression

Presentation

Model fit

References

Hastie, Trevor, Robert Tibshirani and Jerome Friedman. 2009. The Elements of Statistical Learning: DataMining, Inference, and Prediction. 2nd edition ed. New York: Springer.