beyond the two-part model: methods for handling truncated and skewed dependent variables eric p....

Beyond the Two-Part Model: Beyond the Two-Part Model: Methods for Handling Truncated Methods for Handling Truncated

and Skewed Dependent Variablesand Skewed Dependent Variables

Eric P. Slade, PhDEric P. Slade, PhD1,21,2

1.1. VISN5 Capitol Network Mental Illness Research and VISN5 Capitol Network Mental Illness Research and Education Clinical Center (VISN5 MIRECC), Baltimore, MD Education Clinical Center (VISN5 MIRECC), Baltimore, MD

2.2. University of Maryland, School of Medicine, Baltimore, MDUniversity of Maryland, School of Medicine, Baltimore, MD

BackgroundBackground Dependent variables with truncated Dependent variables with truncated

and/or skewed distributions are common and/or skewed distributions are common in health economics.in health economics.

Health care expenditures or costsHealth care expenditures or costs Scores on symptom or physical function Scores on symptom or physical function

scalesscales Number of service visits or days of careNumber of service visits or days of care

BackgroundBackground Distributions often have mass points at zero Distributions often have mass points at zero

and long right tails. and long right tails.

010

20

30

40

Perc

ent

0 1000 2000 3000 4000 5000y5

Normal Distribution

BackgroundBackground What’s the Objective? What’s the Objective?

To estimate a marginal effect (M.E.) To estimate a marginal effect (M.E.) or an elasticity, often evaluated at or an elasticity, often evaluated at the means of all covariates.the means of all covariates.

|

jx

E y x

x

BackgroundBackground Other evaluation points (besides Other evaluation points (besides

the mean) might also be of the mean) might also be of interest.interest. For example, might want to know the For example, might want to know the

effect on the expenditure of low effect on the expenditure of low income consumers, if their co-income consumers, if their co-payment changed from $0 to $40. payment changed from $0 to $40.

BackgroundBackground What functional form should I use to What functional form should I use to

estimate these effects?estimate these effects? Specification for Specification for YY

Mass points create non-linearities.Mass points create non-linearities. Skewness of Skewness of YY may cause violation of may cause violation of

homoskedasticity assumption.homoskedasticity assumption.

Specification for the Specification for the xx’s’s

Over the years, several estimators Over the years, several estimators have been proposed.have been proposed.

BackgroundBackground TobitTobit

Assumes Assumes YY* is normally distributed * is normally distributed but partially observed. but partially observed.

Observed Observed YY is truncated/censored at is truncated/censored at YY*<*<cc. .

Coefficients may be biased if Coefficients may be biased if normality or homoskedasticity normality or homoskedasticity assumption fails. assumption fails.

Same linear index (Same linear index (X’bX’b) ) characterizes all characterizes all YY values values

BackgroundBackground 2-Part Model2-Part Model

Addresses mass point and non-Addresses mass point and non-linearity at linearity at YY=0. =0.

Part 1: Use probit to model whether Part 1: Use probit to model whether (or not) (or not) YY>0.>0.

Part 2: Use ordinary least squares Part 2: Use ordinary least squares (OLS) to model (OLS) to model YY conditional on conditional on YY>0>0 Untransformed Untransformed YY Log-transformed ln(Log-transformed ln(YY))

BackgroundBackground 2-Part model (cont.)2-Part model (cont.)

Heteroskedasticity can result in bias.Heteroskedasticity can result in bias. Even log-transformed estimates of Even log-transformed estimates of

E(E(YY) can be imprecise.) can be imprecise. Also, estimated M.E.’s apply to log-scaled Also, estimated M.E.’s apply to log-scaled

YY, requiring retransformation of variance , requiring retransformation of variance to untransformed to untransformed YY scale. scale. See Mullahy JHE 17(3), 1998; Manning JHE See Mullahy JHE 17(3), 1998; Manning JHE

17(3), 1998. 17(3), 1998.

BackgroundBackground Generalized Linear Model (GLM) with a Generalized Linear Model (GLM) with a

“log link” (Manning and Mullahy, JHE “log link” (Manning and Mullahy, JHE 20(4), 2001)20(4), 2001)

2-part model with probit for part 1, or 2-part model with probit for part 1, or 1-part model using only obs. with y>0.1-part model using only obs. with y>0.

ln[ln[EE((Y | XY | X)])] = = XXββ, , YY>0. >0. No “re-transformation” of the variance No “re-transformation” of the variance

required.required. More precise than OLS when the data are More precise than OLS when the data are

generated by the assumed dist’n. generated by the assumed dist’n.

BackgroundBackground One drawback to GLM is that E(One drawback to GLM is that E(Y | X)Y | X)

is non-linear in parameters.is non-linear in parameters. How should we specify the How should we specify the XX’s? ’s?

Imprecision (large standard errors) can Imprecision (large standard errors) can also be a problem if the distribution also be a problem if the distribution generating the data is misspecified. generating the data is misspecified.

CDE IntroCDE Intro Conditional Density Estimation (CDE)Conditional Density Estimation (CDE)

Gilleskie, D.B. and Mroz, T.A. 2004. A flexible approach for Gilleskie, D.B. and Mroz, T.A. 2004. A flexible approach for estimating the effects of covariates on health expenditures. J. estimating the effects of covariates on health expenditures. J. Health Econ, 23(2), pp. 217-420.Health Econ, 23(2), pp. 217-420.

Let the data tell you the model!Let the data tell you the model! CDE allows us to construct a non-linear CDE allows us to construct a non-linear

empirical approximation to: empirical approximation to:

| |E Y x y f y x dy

CDE IntroCDE Intro

The CDE approximates E(The CDE approximates E(YY | | xx) with: ) with:

1 11

| , |K

k k k kk

E Y x y y Y y K p y Y y x

The approximate value of y drawn from the kth interval of Y.

Does not depend on x.

The approximate prob. that y is from the kth interval of Y.

Depends on x.

• K is the total number of intervals supporting the distribution of Y.

CDE MethodCDE Method

Conceptually, 4 Steps are required to Conceptually, 4 Steps are required to construct the approximation.construct the approximation.

Step 1Step 1: Discretize the support of : Discretize the support of YY into into KK intervals.intervals.


p[yk-1 ≤ Y < yk | x]

CDE MethodCDE Method KK trades off smoothness against fit. trades off smoothness against fit.

As As KK ↑, estimated function better ↑, estimated function better predicts observed values predicts observed values yy (i.e., (i.e., better fit). better fit).

However, higher However, higher KK means fewer data means fewer data points available for estimation in each points available for estimation in each interval interval kk..

Potential over-fitting Potential over-fitting Some intervals may have relatively Some intervals may have relatively

large standard errors. large standard errors.

CDE MethodCDE Method 4 steps (cont.)4 steps (cont.)

Step 2Step 2: Estimate the conditional : Estimate the conditional density:density:

Pr(Pr(yyk-1k-1 < < YY < < yyk k | | xx)), for all , for all k = 1 k = 1 to to KK. .

By Bayes’ Rule:

p[yk-1 ≤ Y < yk | x] =

p[yk-1 ≤ Y < yk | x, Y ≥ yk-1] × p[Y ≥ yk-1 | x]

CDE MethodCDE Method Step 2: Estimate the conditional Step 2: Estimate the conditional

density (cont.)density (cont.) The “Hazard Rate Decomposition”The “Hazard Rate Decomposition”

1- p[Y < yk-1 | x]


p[Y ≥ yk-1 | x] is the prob. represented by the shaded area to the right of AB, OR 1 minus the prob. represented by the area to the left of AB.

A

B

CDE MethodCDE MethodHAZARD RATE DECOMPOSITION

p[yk-1 ≤ Y < yk | x, Y ≥ yk-1] = λ(k,x)

is the “Discrete Time Hazard” Function

The probability that an observation on Y is drawn from the kth interval, given that it is not drawn from the first k-1 intervals.


p[yk-1 ≤ Y < yk | x, Y ≥ yk-1] is the prob. represented by the area of ABCD as a fraction of the entire shaded area to the right of AB

A

B

C

D


(4)

So,

HAZARD RATE DECOMPOSITION

The probability that an observation on Y is not drawn from the first k-1 intervals.

CDE MethodCDE Method 4 steps (cont.)4 steps (cont.)

Step 3Step 3: : Approximate y in interval kk, , k k = = 11 to to K.K. Specify:Specify:

h* is fixed for all values of x within the kth interval, i.e., the estimate of y is a constant.

1 , ( | )k ky y Y y K h k K


(6)

Therefore the CDE approximation of E(Y|x) is:

The approximation works like a random histogram, conditional on K and x.

CDE EstimationCDE EstimationSUMMARY (5 Steps)SUMMARY (5 Steps)

(1) Choose (1) Choose KK, the number of intervals, , the number of intervals, and (2) the interval boundaries.and (2) the interval boundaries.

G&M treat mass point at 0 separately.G&M treat mass point at 0 separately. G&M impose the restriction that all G&M impose the restriction that all KK intervals intervals

must have an equal number of observations with must have an equal number of observations with positive values for positive values for Y.Y.

G&M propose a likelihood-based criterion for G&M propose a likelihood-based criterion for selecting selecting KK..

(3) Choose constants for (3) Choose constants for hh*(*(kk), ), kk=1 to =1 to KK. . G&M use the mean of observed G&M use the mean of observed yy in interval in interval kk. .

CDE EstimationCDE Estimation (4) Decide how to approximate (4) Decide how to approximate

conditional densities. conditional densities. G&M use a single flexibly specified G&M use a single flexibly specified logitlogit.. The R.H.S. includes a transformation of the The R.H.S. includes a transformation of the

interval number interval number kk: : ααkk = -ln( = -ln(KK--kk) for ) for kk<<KK..

R.H.S. also includes polynomials in the R.H.S. also includes polynomials in the XX’s ’s and interactions between and interactions between XX’s and the ’s and the ααkk‘s‘s

Standard significance testing (Wald) used Standard significance testing (Wald) used to determine the order of polynomials.to determine the order of polynomials.

CDE EstimationCDE Estimation (5) Decide how to approximate (5) Decide how to approximate

partial derivatives (marginal effects). partial derivatives (marginal effects). G&M calculate “arc” derivatives. G&M calculate “arc” derivatives.

Evaluate E(Evaluate E(YY||xx) at various values of the ) at various values of the explanatory variables.explanatory variables.

The population average derivative is the The population average derivative is the average of the arc derivatives. average of the arc derivatives.

CDE PerformanceCDE Performance Monte Carlo simulations suggest that, for Monte Carlo simulations suggest that, for

most data generating processes (DGP’s), most data generating processes (DGP’s), CDE outperforms both GLM and a flexibly CDE outperforms both GLM and a flexibly specified OLS one-part model.specified OLS one-part model. CDE’s advantage over GLM and OLS is CDE’s advantage over GLM and OLS is

greater the more heteroskedastic is the DGP. greater the more heteroskedastic is the DGP.

CDE PerformanceCDE Performance The GLM model performs poorly when the The GLM model performs poorly when the

model specification for explanatory variables model specification for explanatory variables is not sufficiently flexible. is not sufficiently flexible. Akaike Information Criterion produces too Akaike Information Criterion produces too

conservative a specification.conservative a specification.

OLS tends to outperform GLM, and performs OLS tends to outperform GLM, and performs almost as well as CDE.almost as well as CDE.

SummarySummary Extra effort required by CDE could be Extra effort required by CDE could be

“worth it” if obtaining robust coefficient “worth it” if obtaining robust coefficient estimates is paramount. estimates is paramount. e.g., if accurate prediction e.g., if accurate prediction reallyreally matters. matters.

CDE offers the advantage of being able to CDE offers the advantage of being able to simulate variation in M.E.’s at different simulate variation in M.E.’s at different points in the covariate distribution. points in the covariate distribution.

The CDE learning curve could be steepThe CDE learning curve could be steep However, in the VA, there is a recurring need However, in the VA, there is a recurring need

for robust cost estimation.for robust cost estimation.

Next StepsNext Steps Plan to use CDE in upcoming MHICM Plan to use CDE in upcoming MHICM

project. project. Modify CDE for 2-stage estimationModify CDE for 2-stage estimation

CDE could be sensitive to outliers.CDE could be sensitive to outliers. Problem of over-fittingProblem of over-fitting Newer methods test for mixtures -- could Newer methods test for mixtures -- could

identify outliersidentify outliers

beyond the two-part model: methods for handling truncated and skewed dependent variables eric p....

Documents