Download - POLS/PUAD6080: Statistical Regression with Other Categorical … · 2020. 4. 20. · Motivation Ordered Outcomes Nominal Outcomes Event Counts Conclusion POLS/PUAD6080: Statistical

Motivation Ordered Outcomes Nominal Outcomes Event Counts Conclusion

POLS/PUAD6080: Statistical Regression withOther Categorical Outcomes

David A. Hughes, Ph.D.

Auburn University at Montgomery

[email protected]

April 20, 2020

1 / 47


Overview

1 Motivation

2 Ordered Outcomes

3 Nominal Outcomes

4 Event Counts

5 Conclusion

2 / 47


Categorical data and regression analysis

• Last week, we examined binary dependent variables anddiscussed why the CLRM was inappropriate.

• This week, we extend our analysis to other types ofcategorical dependent variables.

• These include ordered, nominal, and event outcomes.

• What are these, and how could we think about regressionsgiven their levels of measurement?

3 / 47


Ordered Logit/Probit

• Start with a latent variable such that:

Y ∗ = µ+ ui.

• Similar to how we motivated the binary response model,suppose that:

Yi = j if τj−1 ≤ Y ∗i < τj , j ∈ {1, . . . , J}.

• Therefore, Y has J ordered outcome ctaegories and J − 1“cutpoints” (τ).

4 / 47


Estimating the ordered logit/probit

• We can express the probability of any particular discreteoutcome on Y as:

Pr(Yi = j) = Pr(τj−1 ≤ Y ∗ < τj)

= Pr(τj−1 ≤ µ+ ui < τj).

• With minimal assumptions, we can substitute Xiβ for µ:

µi = Xiβ.

5 / 47


Estimating the ordered logit/probit (cont’d.)

• We can rewrite the above equations as:

Pr(Yi = j | X, β) = Pr(τj − 1 ≤ Y ∗i < τj | X)

= Pr(τj−1 ≤ Xiβ + ui < τj)

= Pr(τj−1 −Xiβ ≤ ui < τj −Xiβ)

=

∫ τj−Xiβ

−∞f(ui)du−

∫ τj−1−Xiβ

−∞f(ui)du)

= F (τj −Xiβ)− F (τj−1 −Xiβ),

where f is the density for u, and F is the corresponding cdf.

• The intuition is that we “cut” the density at different points,and the probability of a given observation receiving the the ofY associated with this interval is simply the area under thedensity curve between those points.

6 / 47


Estimating the ordered logit/probit (cont’d.)

• If we make certain distributional assumptions about the errorterms, then we’re well on our way to forming a likelihood.We’re either going to assume that errors are distributedaccording to the logistic or standard normal distributions.

• Proceeding with the standard normal, and assuming we havea three outcome DV:

Pr(Yi = 1) = Φ(τ1 −Xiβ)− 0

Pr(Yi = 2) = Φ(τ2 −Xiβ)− Φ(τ1 −Xiβ)

Pr(Yi = 3) = 1− Φ(τ2 −Xiβ).

7 / 47


What about the intercept?

• We often think of the τs in ordered models as being a seriesof “intercepts.”

• In the binary model, the intercept tells us the probabilityY = 1 | Xi = 0—that is, the probability of being in eithercategory of Y .

• An identification problem occurs if we try to estimate theintercept and all J − 1 cutpoints.

8 / 47


Estimating Ordered Logit/Probit in Stata

• Estimating an ordered logit ologit or ordered probitoprobit involves basically the same syntax in Stata, so I willonly present the ordered probit

• Basic syntax:

• oprobit depvar [indepvars] [if] [in] [weight] [,

options]

• The if, in, and weight commands are the same as with thestandard logit and probit models

• The option commands are very similar as well, with optionsfor estimation of the standard errors begin the mostcommonly used (use help oprobit for a full list of options)

9 / 47


Example Output

. ologit afterlife age attendance partyid protestant catholic jewish

Iteration 0: log likelihood = -1176.262





Ordered logistic regression Number of obs = 1,068

LR chi2(6) = 196.23

Prob > chi2 = 0.0000

Log likelihood = -1078.1468 Pseudo R2 = 0.0834

------------------------------------------------------------------------------

afterlife | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .0047425 .0036268 1.31 0.191 -.0023659 .011851

attendance | -.243044 .0267099 -9.10 0.000 -.2953945 -.1906935

partyid | -.0630709 .0332457 -1.90 0.058 -.1282312 .0020895

protestant | -.8593461 .1643058 -5.23 0.000 -1.18138 -.5373126

catholic | -.5447963 .1764368 -3.09 0.002 -.890606 -.1989865

jewish | .2911736 .458702 0.63 0.526 -.6078657 1.190213

-------------+----------------------------------------------------------------

/cut1 | -.8667024 .2075344 -1.273462 -.4599425

/cut2 | .3661047 .2063156 -.0382665 .7704759

/cut3 | 1.408207 .2170333 .9828294 1.833584

------------------------------------------------------------------------------

10 / 47


Example Output Using Margins

• When compared to logit models, the margins command forordered models will produce a much larger matrix due to theincreased number of outcome categories

• Even in a simple case, with 4 outcome categories and a singlebinary explanatory variable, the margins results can besomewhat confusing to interpret• The variables in the following examples are:

• A four category variable on belief in afterlife (firm belief tonone)

• A binary measure of whether a respondent is Protestant (1 ifyes, 0 else)

• A 7 point scale of a respondent’s party id (strong D to strongR)

11 / 47


Example Output Using Margins

. margins protestant

Predictive margins Number of obs = 1,073

Model VCE : OIM

1._predict : Pr(afterlife==1), predict(pr outcome(1))




-------------------------------------------------------------------------------------

| Delta-method

| Margin Std. Err. z P>|z| [95% Conf. Interval]

--------------------+----------------------------------------------------------------

_predict#protestant |

1 0 | .475419 .0205397 23.15 0.000 .4351618 .5156761

1 1 | .7031778 .0198939 35.35 0.000 .6641865 .742169

2 0 | .2573557 .0150527 17.10 0.000 .2278529 .2868585

2 1 | .1746949 .0126143 13.85 0.000 .1499714 .1994185

3 0 | .1457501 .0129964 11.21 0.000 .1202776 .1712225

3 1 | .0721371 .0080107 9.01 0.000 .0564365 .0878377

4 0 | .1214752 .0123616 9.83 0.000 .0972469 .1457036

4 1 | .0499902 .0065168 7.67 0.000 .0372174 .062763

-------------------------------------------------------------------------------------

12 / 47


Example Output Using Margins• A better approach with continuous and quasi-continuous

variables may be to evaluate predicted probabilities from theminimum to maximum values, where theoretically appropriate

0

.2

.4

.6

.8

Prob

abilit

y

Non-Protestants Protestants

Yes, Definitely Yes, ProbablyNo, Probably Not No, Definitely Not

Is there an afterlife?

13 / 47


Example output using margins

0

.2

.4

.6

.8

Prob

abilit

y

Strong

Dem

ocrat

Modera

te Dem

ocrat

Lean

s Dem

ocrat

Indep

ende

nt

Lean

s Rep

ublica

n

Modera

te Rep

ublica

n

Strong

Rep

ublica

n



14 / 47


Example output using margins

0

.2

.4

.6

.8

1

Prob

abilit

y

Never

< Onc

e a Yea

r

Once a

Year

Severa

l Tim

es a

Year

Once a

Mon

th

2-3 Tim

es a

Month

Nearly

Every

Week

Every

Week

> Onc

e a W

eek



15 / 47


Motivating the model

• Consider a set of N individuals, i ∈ {1, 2, . . . , N} withdependent variable Yi that takes on J unordered responses.

• Let Pr(Yi = 1) = Pij and note that∑J

j=1 Pij = 1. That is,every i is required to make at least some choice in J .

• Naturally, we want to allow Pij to vary as a function of somek independent variable(s), Xi, indexed by a k × 1 vector ofparameters specific to that outcome, βj .

16 / 47


Motivating the model (cont’d.)

• As before, we’ll make use of the exponential function:

Pij = exp(Xiβj).

• However,∑J

j=1 6= 1, which it must be. Therefore, we rescalePij by dividing each by the sum of all Pijs:

Pr(Yi = j) ≡ Pij =exp(Xiβj)∑Jj=1 exp(Xiβj)

. (1)

17 / 47



• Equation one helps us to express what we will term themult-inomial logit (MNL).

• Unfortunately, as specified, Equation 1 is unidentified.

• That is, there are an infinite set of βjs that will renderidentical sets of probabilities.

• This problem is similar to what we encounted with the ordinallogit/probit vis-a-vis the constant term.

18 / 47



• To address the identification problem, we constrain theparameters for one of the outcomes, J , to zero making thatcategory the baseline for comparison to other outcomes.

• If we omit the first category, then Equation 1 changes as such:

Pr(Yi = 1) =1

1 +∑J

j=2 exp(Xiβ′j),

where β′j represents the rescaled influence of the various Xs

on a given outcome, relative to Pr(Yi = 1).

• We express the probability of the other J − 1 alternatives as:

exp(Xiβ′j)

1 +∑J

j=2 exp(Xiβ′j).

19 / 47


Interpreting MLE coefficients for the MNL

• MLE yields separate βs for each of the alternatives (except thebaseline, which is omitted as its parameters are set to zero).

• Coefficients on given covariates reflect the change in theprobability of a given outcome, relative to the omittedcategory.

20 / 47


An example: Judicial departures

• Let’s look at data on the choices elected judges make overtheir employment.

• In a given year, a judge can, (1) Stay on the bench, (2) Retirefor good, or (3) Retire for some other work.

• For now, we’ll just consider the lone covariate, “ElectionYear,” which is a dummy variable for whether the judge iscurrently up for election.

21 / 47


MNL output for judicial departures

. mlogit risk i.elecyr, base(0)

Multinomial logistic regression Number of obs = 1,992

LR chi2(2) = 93.92

Prob > chi2 = 0.0000


------------------------------------------------------------------------------

risk | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

0 | (base outcome)

-------------+----------------------------------------------------------------

1 |

1.elecyr | -15.41396 469.3825 -0.03 0.974 -935.3867 904.5587

_cons | -2.605912 .0970537 -26.85 0.000 -2.796133 -2.41569

-------------+----------------------------------------------------------------

2 |

1.elecyr | 2.814694 .4362368 6.45 0.000 1.959686 3.669703

_cons | -5.396386 .3788515 -14.24 0.000 -6.138921 -4.653851

------------------------------------------------------------------------------

22 / 47


Analyzing the output

• Note a few things:• First, “0” (staying on the bench) is the omitted category.• Second, the coefficient on election year is negative on outcome

“1” (retire for good) but positive on outcome 2 (seek otheremployment).

• These results would change if we changed the baselinecategory.

• Nevertheless, model fit statistics would not.

23 / 47


Results witha different baseline category

. mlogit risk i.elecyr, base(2)

Multinomial logistic regression Number of obs = 1,992

LR chi2(2) = 93.92

Prob > chi2 = 0.0000


------------------------------------------------------------------------------


-------------+----------------------------------------------------------------

0 |

1.elecyr | -2.814694 .4362368 -6.45 0.000 -3.669703 -1.959686

_cons | 5.396386 .3788515 14.24 0.000 4.653851 6.138921

-------------+----------------------------------------------------------------

1 |

1.elecyr | -18.22866 469.3826 -0.04 0.969 -938.2017 901.7444

_cons | 2.790474 .3894259 7.17 0.000 2.027214 3.553735

-------------+----------------------------------------------------------------

2 | (base outcome)

------------------------------------------------------------------------------

24 / 47


Presenting results from the MNL (cont’d.)

• The margins command in Stata is going to be a

friend of ours.

• Returning to the previous example, let’s include a fewadditional covariates (remember the unit of analysis is ajudge-year).

• Let’s estimate the likelihood of a given retirement strategygiven covariates: Election Year, Workload Vested Age, andlet’s interact age with vested.

25 / 47


Model resultsMultinomial logistic regression Number of obs = 1,932

LR chi2(10) = 187.21

Prob > chi2 = 0.0000


------------------------------------------------------------------------------


-------------+----------------------------------------------------------------

0 | (base outcome)

-------------+----------------------------------------------------------------

1 |

1.elecyr | -15.49575 600.7121 -0.03 0.979 -1192.87 1161.878

workload | -.0000559 .0001426 -0.39 0.695 -.0003354 .0002237

1.vested | -5.669977 2.338391 -2.42 0.015 -10.25314 -1.086814

age | .0403759 .0288213 1.40 0.161 -.0161129 .0968647

|

vested#c.age |

1 | .1031974 .0386273 2.67 0.008 .0274893 .1789055

|

_cons | -5.601525 1.608329 -3.48 0.000 -8.753792 -2.449258

-------------+----------------------------------------------------------------

2 |

1.elecyr | 2.899413 .4652694 6.23 0.000 1.987501 3.811324

workload | .000246 .0002613 0.94 0.346 -.0002661 .0007581

1.vested | -14.76489 6.873756 -2.15 0.032 -28.2372 -1.292574

age | .0310025 .0325234 0.95 0.340 -.0327422 .0947472

|

vested#c.age |

1 | .2157043 .1047291 2.06 0.039 .0104391 .4209695

|

_cons | -7.402557 1.843319 -4.02 0.000 -11.0154 -3.789718

------------------------------------------------------------------------------ 26 / 47


Estimating predicted probabilities

• Let’s look at the interaction effect between age and vested.

• margins vested, at(age=(30(10)80))

• marginsplot

27 / 47


Graphed predicted probabilities

0.2

.4.6

.81

Prob

abilit

y

30 40 50 60 70 80age

vested=0, Outcome=0 vested=0, Outcome=1vested=0, Outcome=2 vested=1, Outcome=0vested=1, Outcome=1 vested=1, Outcome=2

Predictive Margins of vested with 95% CIs

28 / 47


The Poisson process

• A good way to think about an event count outcome is tothink of them as events that occur over time.

• Let λ denote the constant rate at which events occur—thiscould be the expected number of events in a given period oflength h.

• Then the probability that an event occurs in a given interval isλh, and the probability is does not occur is 1− λh.

29 / 47


The Poisson process (cont’d.)

• Let Yt reflect the number of events that occur in the interval tof length h.

• The probability that the number of events that occurs in(t, t+ h] is equal to some value y ∈ {0, 1, 2, 3, . . .} is:

Pr(Yt = y) =exp(−λh)λhy

y!. (2)

• And if all the intervals are of equal length 1, Equation (1)becomes:

Pr(Yt = y) =exp(−λ)λy

y!. (3)

30 / 47


The Poisson distribution

• Equations (1) and (2) give the Poisson distribution.

• Critically, we assume that values of Yt arrive at a constant rate(λ) and are independent across draws from the distribution.

• The parameter λ is interpreted as a rate or the expectednumber of events during a given period, t. That is, E(Y ) = λ.• As λ increases:

• The mean of the distribution gets bigger (shockingly enough)• The variance of the distribution gets larger too, and it turns

out that E(Y ) = V ar(Y ) = λ.• The distribution becomes a normal distribution (relevant for

deciding between MLE and OLS).

31 / 47


Example: The Chief Justice’s Year-End Report

• As a running example, consider the number of requests theChief Justice of the United States makes to Congress in hisannual Year-End Report.

• Consider the following covariates: (1) Public approval of theCourt, (2, 3, 4) The Court’s ideological distance to the House,Senate, and President, and (5) The number of years the chiefjustice has held his position.

32 / 47


Requests for institutional reform

02

46

Freq

uenc

y

0 5 10 15Total number of requests

33 / 47


Poisson in Stata

• The standard framework for Poisson regressions follows whatought now to feel like a pretty familiar pattern:• poisson y x1 x2 ... xk [if...] [, options]

34 / 47


Poisson results

. poisson requests_total gss_conf house sen pres cjtenurebyte





Poisson regression Number of obs = 39

LR chi2(5) = 21.63

Prob > chi2 = 0.0006


--------------------------------------------------------------------------------

requests_total | Coef. Std. Err. z P>|z| [95% Conf. Interval]

---------------+----------------------------------------------------------------

gss_conf | -.0157822 .0292324 -0.54 0.589 -.0730767 .0415124

house_dist | -2.289797 1.082493 -2.12 0.034 -4.411444 -.1681492

sen_dist | .6036195 .852443 0.71 0.479 -1.067138 2.274377

pres_dist | -1.951108 .7159464 -2.73 0.006 -3.354337 -.5478791

cjtenurebyte | .0442441 .0179891 2.46 0.014 .0089861 .0795022

_cons | 2.802411 .5752195 4.87 0.000 1.675001 3.92982

--------------------------------------------------------------------------------

35 / 47


Using Stata to recover count estimates

• Stata’s margins command will once again prove quite

useful.

• For example, to estimate the predicted number of requests forinstitutional reform in a given year by a chief justice’s tenurein office across a 15 year period:margins, at(cjtenurebyte=(0(3)15)).

• We can then use the marginsplot command to recover a

graphical presentation of this relationship.

36 / 47


Exposure and offsets in Poisson models

• We’ve been modeling the number of outcomes in a givenperiod so far.

• But what if there never were any opportunities during a givenperiod for a non-zero outcom to occur?

• For example, if I were to model the number of congressionalacts the Supreme Court invalidated in a given year, it might bepertinent to know if no congressional acts were even reviewed.

37 / 47


Exposure and offsets in Poisson models (cont’d.)

• We need to account for the exposure term, and the easiestway to thisis to include Mi as an “offset” in the model:

λi = exp[Xiβ + ln(Mi)],

which constrains the effect of the offset to a coefficient of 1.

• In Stata, we use the exposure option when estimating a

Poisson.

• We could also include Mi as a covariate and model its effecton E(Yi) directly, examining its coefficient to see how close toone it really is.

38 / 47


Some problems with Poisson

• We’ve made strong assumptions in setting up the Poissonmodel.

• First, we required that the probability of some event occurringis constant within a given period. And second, we requiredthat the probability of some event occuring was independentof other events during the same period.

• But what if this assumption were violated?

39 / 47


Contagion and dispersion

• Suppose I count the total number of leaves on my hydrangeaover four periods: winter, spring, summer, and fall.

• How likely is it that the rate of occurrence (λ) is constantacross all four seasons?

• Because we find that the occurrence of observing one leafincreases the likelihood of observing another, we have a“positive contagion.”

• This increases the variance of the observed counts, which isbad mojo when we have assumed that E(Y ) = V ar(Y ) = λand leads to a problem known as “overdispersion.”

40 / 47


Testing for over-dispersion

• We can test whether we have over/under-dispersion in ourdata.

• The easiest way to do this is by running a negative binomialregression and checking the statistical significance of thedispersion parameter, α.

41 / 47


Addressing overdispersion

• If we have problems with dispersion, it makes sense to just goahead and model it directly rather than to rely uponinadequate results from a Poisson.

• Droping the assumption that λ is a constant, we can insteadtreat it as a random variable:

E(Yi) ≡ λi = exp(Xiβ + ui)

= exp(Xiβ)exp(ui)

= λiνi. (4)

• All that’s left now is to specify a distribution on ui. Weusually use the Gamma distribution.

42 / 47


The negative binomial distribution

• If νi is assumed to be randomly distributed according to aone-parameter Gamma distribution with mean E(ν) = 1 andvariance V ar(ν) = 1

α , then the marginal density of Y is saidto be negative binomial:

Pr(Yi = y | λi, α) =

(Γ(α−1 + Yi)

Γ(α−1)Γ(Yi + 1)

)(α−1

α−1 + λi

)α−1(λi

λi + α−1

)Yi

, (5)

where Γ is the gamma function.

• We model λi = exp(Xiβ), which has E(Y ) = λ andV ar(Y ) = λ(1 + αλ), where α > 0.

• Note that when α = 0, the negative binomial reduces to thePoisson.

43 / 47


Negative binomial regressions in Stata

• To estimate the negative binomial in Stata, we simply replacepoisson with nbreg.

• We really want to pay attention to the value of α and ln(α).

• If we find statistically significant evidence that α 6= 0, then welikely shouldn’t be using Poisson as our data are overdispersed.

44 / 47


Example with data

Negative binomial regression Number of obs = 39

LR chi2(5) = 12.69

Dispersion = mean Prob > chi2 = 0.0264


--------------------------------------------------------------------------------

requests_total | Coef. Std. Err. z P>|z| [95% Conf. Interval]

---------------+----------------------------------------------------------------

gss_conf | -.0165893 .0353904 -0.47 0.639 -.0859531 .0527746

house_dist | -2.168843 1.319143 -1.64 0.100 -4.754315 .4166289

sen_dist | .6064278 1.078424 0.56 0.574 -1.507245 2.7201

pres_dist | -1.896643 .8535061 -2.22 0.026 -3.569484 -.2238019

cjtenurebyte | .0444888 .0222564 2.00 0.046 .0008671 .0881106

_cons | 2.76428 .6811576 4.06 0.000 1.429236 4.099324

---------------+----------------------------------------------------------------

/lnalpha | -2.384044 .7152737 -3.785954 -.982133

---------------+----------------------------------------------------------------

alpha | .0921771 .0659318 .0226872 .3745114

--------------------------------------------------------------------------------

LR test of alpha=0: chibar2(01) = 3.43 Prob >= chibar2 = 0.032

45 / 47


Assessing the model’s results

• We see from the results that (a) the data appear to beoverdispersed. When the chief justice asks for one for ofinstitutional reform, he’s likely to request even more.

• Also note what happened to the estimated effect on the CJ’sdistance to the House. When we didn’t account foroverdispersion, we found the CJ catering to Housepreferences, but under the negative binomial, we’re unable toreject the null that House members are unrelated to the CJ’srequests for institutional reform.

46 / 47


Discussion

• In this unit, we analyzed three additional statistical regressionmodels useful for the examination of categorical dependentvariables.

• We learned that the ordered logit/probit is appropriate whenthe DV has a level of measurement that is ordinal.

• The multinomial logit is appropriate when the DV has a levelof measurement that is nominal.

• And finally we learned that the Poisson or negative binomial isappropriate when our DV is an event count.

47 / 47

Download - POLS/PUAD6080: Statistical Regression with Other Categorical … · 2020. 4. 20. · Motivation Ordered Outcomes Nominal Outcomes Event Counts Conclusion POLS/PUAD6080: Statistical

Top Related