building useful models: some new developments and easily avoidable errors michael babyak, phd

90
Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Upload: montana-oak

Post on 14-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Building useful models: Some new developments and easily

avoidable errorsMichael Babyak, PhD

Page 2: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

What is a model ?

Y = f(x1, x2, x3…xn)

Y = a + b1x1 + b2x2…bnxn

Y = e a + b1x1 + b2x2…bnxn

Page 3: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

“All models are wrong, some are useful” -- George Box

• A useful model is– Not very biased– Interpretable– Replicable (predicts in a new sample)

Page 4: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD
Page 5: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Some Premises

• “Statistics” is a cumulative, evolving field• Newer is not necessarily better, but should be

entertained in the context of the scientific question at hand

• Data analytic practice resides along a continuum, from exploratory to confirmatory. Both are important, but the difference has to be recognized.

• There’s no substitute for thinking about the problem

Page 6: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Statistics is a cumulative, evolving field: How do we know this stuff?

• Theory

• Simulation

Page 7: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Y = b X + error

bs

1

bs

2

bs

3

bs

4

bsk-1 bsk………………….

Concept of Simulation

Page 8: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

bs

1

bs

2

bs

3

bs

4

bsk-1 bsk………………….

Y = b X + error

Evaluate

Concept of Simulation

Page 9: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Y = .4 X + error

bs

1

bs

2

bs

3

bs

4

bsk-1 bsk………………….

Simulation Example

Page 10: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

bs

1

bs

2

bs

3

bs

4

bsk-1 bsk………………….

Evaluate

Y = .4 X + error

Simulation Example

Page 11: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

0.2 0.4 0.6

05

00

10

00

15

00

20

00

25

00

Value of beta for x1

Fre

qu

en

cy o

f b

eta

va

lue

True Model:Y = .4*x1 + e

Page 12: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Ingredients of a Useful Model

Correct probability model

Good measures/no loss of information

Based on theory

Comprehensive

Parsimonious

Flexible

Tested fairly

Useful Model

Page 13: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Correct Model

• Gaussian: General Linear Model• Multiple linear regression

• Binary (or ordinal): Generalized Linear Model• Logistic Regression• Proportional Odds/Ordinal Logistic

• Time to event: • Cox Regression or parametric survival

models

Page 14: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Generalized Linear Model

General Linear Model/Linear Regression

ANOVA/t-testANCOVA

Logistic Regression

Chi-square

Poisson, ZIP,negbin, gamma

Normal Binary/Binomial Count, heavy skew,Lots of zeros

Regression w/Transformed DV

Can be applied to clustered (e.g, repeated measures data)

Page 15: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Factor Analytic Family

Structural Equation Models

Partial Least SquaresLatent Variable Models

(Confirmatory Factor Analysis)

Multiple regression Principal

Components

Common FactorAnalysis

Page 16: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Use Theory

• Theory and expert information are critical in helping sift out artifact

• Numbers can look very systematic when the are in fact random– http://www.tufts.edu/~gdallal/multtest.htm

Page 17: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Measure well

Adequate rangeRepresentative valuesWatch for ceiling/floor effects

Page 18: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Using all the information

Preserving cases in data sets with missing dataConventional approaches:

Use only complete caseFill in with mean or medianUse a missing data indicator in the model

Page 19: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Missing Data

• Imputation or related approaches are almost ALWAYS better than deleting incomplete cases

• Multiple Imputation

• Full Information Maximum Likelihood

Page 20: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Multiple Imputation

Page 21: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Modern Missing Data Techniques

Preserve more information from original sample

Incorporate uncertainty about missingness into final estimates

Produce better estimates of population (true) values

Page 22: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Don’t throw waste information from variables

• Use all the information about the variables of interest

• Don’t create “clinical cutpoints” before modeling

• Model with ALL the data first, then use prediction to make decisions about cutpoints

Page 23: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Dichotomizing for Convenience = Dubious Practice

(C.R.A.P.*)

•Convoluted Reasoning and Anti-intellectual Pomposity •Streiner & Norman: Biostatistics: The Bare Essentials

Page 24: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

0 4 8 12 16 20 24 28 32 36 40 44

Depression score

AB C

Implausible measurement assumption

“not depressed” “depressed”

Page 25: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

http://psych.colorado.edu/~mcclella/MedianSplit/

http://www.bolderstats.com/jmsl/doc/medianSplit.html

Loss of power

Sometimes through sampling errorYou can get a ‘lucky cut.’

Page 26: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Dichotomization, by definition, reduces the magnitude of the estimate

by a minimum of about 30%

Dear Project Officer,

In order to facilitate analysis and interpretation, we have decided to throw away about 30% of our data. Even though this will waste about 3 or 4 hundred thousand dollars worth of subject recruitment and testing money, we are confident that you will understand.

Sincerely,

Dick O. Tomi, PhDProf. Richard Obediah Tomi, PhD

Page 27: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Power to detect non-zero b-weight when x is continuous versus

dichotomized

50

60

70

80

90

100

0.85 0.75 0.65Reliability of x

% c

orr

ec

t re

jec

tio

ns

of

nu

ll h

yp

oth

es

is

Continuous xDichotomized x

True model: y =.4x + e

Page 28: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Dichotomizing will obscure non-linearity

Dichotomized at Median (CES-D = 7)

Perc

ent w

ith W

all

Motio

n A

bnorm

alit

y

0

6

12

18

24

30

Not Depressed Depressed

Low HighCESD Score

Page 29: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

WMA on at Least 1 TaskUsing Cubic Spline

CES-D Score

Pro

babi

lity

of W

MA

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15 20 25 30 35 40

Dichotomizing will obscure non-linearity:Same data as previous slide modeled

continuously

Page 30: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Type I error rates for the relation between x2 and y after dichotomizing two continuous predictors.

Maxwell and Delaney calculated the effect of dichotomizing two continuous predictors as a function of the correlation between them. The true model is

y = .5x1 + 0x2, where all variables are continuous. If x1 and x2 are dichotomized, the error rate for the relation between x2 and y increases as the

correlation between x1 and x2 increases.

Correlation between x1 and x2

N 0 .3 .5 .7

50 .05 .06 .08 .10

100 .05 .08 .12 .18

200 .05 .10 .19 .31

Page 31: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Is it ever a good idea to categorize quantitatively measured variables?

• Yes: – when the variable is truly categorical– for descriptive/presentational purposes– for hypothesis testing, if enough categories

are made.• However, using many categories can lead to problems of

multiple significance tests and still run the risk of misclassification

Page 32: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

CONCLUSIONS• Cutting:

– Doesn’t always make measurement sense– Almost always reduces power– Can fool you with too much power in some

instances– Can completely miss important features of the

underlying function• Modern computing/statistical packages can

“handle” continuous variables

• Want to make good clinical cutpoints? Model first, decide on cuts afterward.

Page 33: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Sample size and the problem of underfitting vs overfitting

• Model assumption is that “ALL” relevant variables be included—the “antiparsimony principle”

• Tempered by fact that estimating too many unknowns with too little data will yield junk

Page 34: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Sample Size Requirements• Linear regression

– minimum of N = 50 + 8:predictor (Green, 1990)

• Logistic Regression– Minimum of N = 10-15/predictor among

smallest group (Peduzzi et al., 1990a)

• Survival Analysis– Minimum of N = 10-15/predictor (Peduzzi et

al., 1990b)

Page 35: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Consequences of inadequate sample size

• Lack of power for individual tests

• Unstable estimates

• Spurious good fit—lots of unstable estimates will produce spurious ‘good-looking’ (big) regression coefficients

Page 36: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

All-noise, but good fit

R-Square from Full Model

De

nsi

ty

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

02

46

81

01

21

41

6

n/p~3n/p~6.6n/p=10n/p~13.3

Events per predictor ratio

R-squares from a population model of completelyrandom variables

Page 37: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Simulation: number of events/predictor ratio

Y = .5*x1 + 0*x2 + .2*x3 + 0*x4

-- Where x1 x4 = .4

-- N/p = 3, 5, 10, 20, 50

Page 38: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Parameter stability and n/p ratiox1

Den

sity

-2.0 -1.0 0.0 0.5 1.0 1.5 2.0

01

23

45

67

8

n/p=3n/p=5n/p=10n/p=20n/p=50

x2

-2.0 -1.0 0.0 0.5 1.0 1.5 2.0

01

23

45

67

8

x3

Parameter Estimate

Den

sity

-2.0 -1.0 0.0 0.5 1.0 1.5 2.0

01

23

45

67

8

x4

Parameter Estimate

-2.0 -1.0 0.0 0.5 1.0 1.5 2.0

01

23

45

67

8

Page 39: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Peduzzi’s Simulation: number of events/predictor ratio

P(survival) =a + b1*NYHA + b2*CHF + b3*VES+b4*DM + b5*STD + b6*HTN + b7*LVC

--Events/p = 2, 5, 10, 15, 20, 25

--% relative bias = (estimated b – true b/true b)*100

Page 40: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

-20

-10

0

10

20

30

40

50

0 2 5 10 15 20 25

Events per variable

% R

elat

ive

Bia

s NYHACHFVESDMSTDHTNLVC

Simulation results: number of events/predictor ratio

Page 41: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 2 5 10 15 20 25

Events per variable

Pro

port

ion w

/ B

ias

>

100%

NYHACHFVESDMSTDHTNLVC

Simulation results: number of events/predictor ratio

Page 42: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Approaches to variable selection

• “Stepwise” automated selection• Pre-screening using univariate tests• Combining or eliminating redundant predictors• Fixing some coefficients• Theory, expert opinion and experience• Penalization/Random effects• Propensity Scoring

– “Matches” individuals on multiple dimensions to improve “baseline balance”

• Tibshirani’s “Lasso”

Page 43: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Any variable selection technique based on looking at the data first

will likely be biased

Page 44: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

“I now wish I had never written the stepwise selection code for SAS.” --Frank Harrell, author of forward and

backwards selection algorithm for SAS PROC REG

Page 45: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Automated Selection: Derksen and Keselman (1992) Simulation Study

• Studied backward and forward selection

• Some authentic variables and some noise variables among candidate variables

• Manipulated correlation among candidate predictors

• Manipulated sample size

Page 46: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Automated Selection: Derksen and Keselman (1992) Simulation Study

• “The degree of correlation between candidate predictors affected the frequency with which the authentic predictors found their way into the model.”

• “The greater the number of candidate predictors, the greater the number of noise variables were included in the model.”

• “Sample size was of little practical importance in determining the number of authentic variables contained in the final model.”

Page 47: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

0

5

10

15

20

25

30

35

0 1 2 3 4 5 6 7

Variables in Final Model

% o

f sa

mple

s

100200500100010000

Simulation results: Number of noise variables included

20 candidate predictors; 100 samples

Sample Size

Page 48: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

0102030405060708090

100

0 0-5 5-10 10-15 15-20 20-25 > 25

% Variance Explained

% o

f sa

mple

s

100200500100010000

Simulation results: R-square from noise variables

20 candidate predictors; 100 samples

Sample Size

Page 49: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

0

0.05

0.1

0.15

0.2

0.25

0.3

Samples (Deciles)

R-S

quare

10,0001,000500200100

Simulation results: R-square from noise variables

20 candidate predictors; 100 samples

Sample Size

Page 50: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

1. It yields R-squared values that are badly biased high 2. The F and chi-squared tests quoted next to each variable on the

printout do not have the claimed distribution 3. The method yields confidence intervals for effects and predicted

values that are falsely narrow (See Altman and Anderson Stat in Med)

4. It yields P-values that do not have the proper meaning and the proper correction for them is a very difficult problem

5. It gives biased regression coefficients that need shrinkage (the coefficients for remaining variables are too large; see Tibshirani, 1996).

6. It has severe problems in the presence of collinearity 7. It is based on methods (e.g. F tests for nested models) that were

intended to be used to test pre-specified hypotheses. 8. Increasing the sample size doesn't help very much (see Derksen

and Keselman) 9. It allows us to not think about the problem 10. It uses a lot of paper

SOME of the problems with stepwise variable selection.

Page 51: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

author ={Chatfield, C.},   title =  {Model uncertainty, data mining and statistical inference (with discussion)},   journal = JRSSA,   year =     1995,   volume = 158,   pages =   {419-466},   annote =              

--bias by selecting model because it fits the data well; bias in standard errors; P. 420: ... need for a better balance in the literature and in statistical teaching between techniques and problem solving strategies}.  P. 421: It is `well known' to be `logically unsound and practically misleading' (Zhang, 1992) to make inferences as if a model is known to be true when it has, in fact, been selected from the same data to be used for estimation purposes.  However, although statisticians may admit this privately (Breiman (1992) calls it a `quiet scandal'), they (we) continue to ignore the difficulties because it is not clear what else could or should be done. P. 421: Estimation errors for regression coefficients are usually smaller than errors from failing to take into account model specification. P. 422: Statisticians must stop pretending that model uncertainty does not exist and begin to find ways of coping with it.  P. 426: It is indeed strange that we often admit model uncertainty by searching for a best model but then ignore this uncertainty by making inferences and predictions as if certain that the best fitting model is actually true.  

Page 52: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Phantom Degrees of Freedom

• Faraway (1992)—showed that any pre-modeling strategy cost a df over and above df used later in modeling.

• Premodeling strategies included: variable selection, outlier detection, linearity tests, residual analysis.

• Thus, although not accounted for in final model, these phantom df will render the model too optimistic

Page 53: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Phantom Degrees of Freedom

• Therefore, if you transform, select, etc., you must include the DF in (i.e., penalize for) the “Final Model”

Page 54: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Conventional Univariate Pre-selection

• Non-significant tests also cost a DF• Non-significance is NOT

necessarily related to importance• Variables may not behave the

same way in a multivariable model—variable “not significant” at univariate test may be very important in the presence of other variables

Page 55: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

• Despite the convention, testing for confounding has not been systematically studied—in many cases leads to overadjustment and underestimate of true effect of variable of interest.

• At the very least, pulling variables in and out of models inflates the model fit, often dramatically

Conventional Univariate Pre-selection

Page 56: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Better approach

• Pick variables a priori• Stick with them• Penalize appropriately for any

data-driven decision about how to model a variable

Page 57: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Spending DF wisely

• If not enough N/predictor, combine covariates using techniques that do not look at Y in the sample, PCA, FA, conceptual clustering, collapsing, scoring, established indexes.

• Save DF for finer-grained look at variables of most interest, e.g, non-linear functions

Page 58: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Help is on the way?

• Penalization/Random effects

• Propensity Scoring– “Matches” individuals on multiple dimensions

to improve “baseline balance”

• Tibshirani’s Lasso

Page 59: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

http://myspace.com/monkeynavigatedrobots

Page 60: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Validation• Apparent fit

• Usually too optimistic• Internal

• cross-validation, bootstrap• honest estimate for model

performance• provides an upper limit to what would

be found on external validation• External validation

• replication with new sample, different circumstances

Page 61: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Validation

• Steyerburg, et al. (1999) compared validation methods

• Found that split-half was far too conservative

• Bootstrap was equal or superior to all other techniques

Page 62: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Conclusions• Measure well• Use all the information• Recognize the limitations based on how much

data you actually have• In the confirmatory mode, be as explicit as

possible about the model a priori, test it, and live with it

• By all means, explore data, but recognize— and state frankly --the limits post hoc analysis places on inference

Page 63: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Advanced topics and examples

Page 64: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

?1………………….

My Sample

Evaluate

Bootstrap

?2 ?3 ?4 ?k-1 ?k

WITH REPLACEMENT

Page 65: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

1, 3, 4, 5, 7, 10

7114510

1032221

351427

211727

4414210

Page 66: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Can use data to determine where to spend DF

• Use Spearman’s Rho to test “importance”

• Not peeking because we have chosen to include the term in the model regardless of relation to Y

• Use more DF for non-linearity

Page 67: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Example-Predict Survival from age, gender, and fare on Titanic:

example using S-Plus (or R) software

Page 68: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

If you have already decided to include them (and promise to keep them in the model) you can peek at predictors in order to see where to add complexity

Page 69: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Adjusted rho^2

0.0 0.05 0.10 0.15 0.20 0.25

1046 1

1308 1

1309 1

N df

age

fare

sex

Spearman Test

Page 70: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Non-linearity using splines

Page 71: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

0

0.5

1

1.5

2

2.5

0 0 5 10 15 20 25

X

YLinear Spline

(piecewise regression)

Y = a + b1(x<10) + b2(10<x<20) + b3 (x >20)

Page 72: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

0

0.5

1

1.5

2

2.5

0 0

X

Y

Cubic Spline (non-linear piecewise

regression)

knots

Page 73: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

fitfare<-lrm(survived~(rcs(fare,3)+age+sex)^2,x=T,y=T)

anova(fitfare)

Logistic regression model

Spline with 3 knots

Page 74: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Wald Statistics Response: survived

Factor Chi-Square d.f. P fare (Factor+Higher Order Factors) 55.1 6 <.0001 All Interactions 13.8 4 0.0079 Nonlinear (Factor+Higher Order Factors) 21.9 3 0.0001 age (Factor+Higher Order Factors) 22.2 4 0.0002 All Interactions 16.7 3 0.0008 sex (Factor+Higher Order Factors) 208.7 4 <.0001 All Interactions 20.2 3 0.0002 fare * age (Factor+Higher Order Factors) 8.5 2 0.0142 Nonlinear 8.5 1 0.0036 Nonlinear Interaction : f(A,B) vs. AB 8.5 1 0.0036 fare * sex (Factor+Higher Order Factors) 6.4 2 0.0401 Nonlinear 1.5 1 0.2153 Nonlinear Interaction : f(A,B) vs. AB 1.5 1 0.2153 age * sex (Factor+Higher Order Factors) 9.9 1 0.0016 TOTAL NONLINEAR 21.9 3 0.0001 TOTAL INTERACTION 24.9 5 0.0001 TOTAL NONLINEAR + INTERACTION 38.3 6 <.0001 TOTAL 245.3 9 <.0001

Page 75: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Wald Statistics Response: survived

Factor Chi-Square d.f. P fare (Factor+Higher Order Factors) 55.1 6 <.0001 All Interactions 13.8 4 0.0079 Nonlinear (Factor+Higher Order Factors) 21.9 3 0.0001 age (Factor+Higher Order Factors) 22.2 4 0.0002 All Interactions 16.7 3 0.0008 sex (Factor+Higher Order Factors) 208.7 4 <.0001 All Interactions 20.2 3 0.0002 fare * age (Factor+Higher Order Factors) 8.5 2 0.0142 Nonlinear 8.5 1 0.0036 Nonlinear Interaction : f(A,B) vs. AB 8.5 1 0.0036 fare * sex (Factor+Higher Order Factors) 6.4 2 0.0401 Nonlinear 1.5 1 0.2153 Nonlinear Interaction : f(A,B) vs. AB 1.5 1 0.2153 age * sex (Factor+Higher Order Factors) 9.9 1 0.0016 TOTAL NONLINEAR 21.9 3 0.0001 TOTAL INTERACTION 24.9 5 0.0001 TOTAL NONLINEAR + INTERACTION 38.3 6 <.0001 TOTAL 245.3 9 <.0001

Page 76: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Wald Statistics Response: survived

Factor Chi-Square d.f. P fare (Factor+Higher Order Factors) 55.1 6 <.0001 All Interactions 13.8 4 0.0079 Nonlinear (Factor+Higher Order Factors) 21.9 3 0.0001 age (Factor+Higher Order Factors) 22.2 4 0.0002 All Interactions 16.7 3 0.0008 sex (Factor+Higher Order Factors) 208.7 4 <.0001 All Interactions 20.2 3 0.0002 fare * age (Factor+Higher Order Factors) 8.5 2 0.0142 Nonlinear 8.5 1 0.0036 Nonlinear Interaction : f(A,B) vs. AB 8.5 1 0.0036 fare * sex (Factor+Higher Order Factors) 6.4 2 0.0401 Nonlinear 1.5 1 0.2153 Nonlinear Interaction : f(A,B) vs. AB 1.5 1 0.2153 age * sex (Factor+Higher Order Factors) 9.9 1 0.0016 TOTAL NONLINEAR 21.9 3 0.0001 TOTAL INTERACTION 24.9 5 0.0001 TOTAL NONLINEAR + INTERACTION 38.3 6 <.0001 TOTAL 245.3 9 <.0001

Page 77: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Wald Statistics Response: survived

Factor Chi-Square d.f. P fare (Factor+Higher Order Factors) 55.1 6 <.0001 All Interactions 13.8 4 0.0079 Nonlinear (Factor+Higher Order Factors) 21.9 3 0.0001 age (Factor+Higher Order Factors) 22.2 4 0.0002 All Interactions 16.7 3 0.0008 sex (Factor+Higher Order Factors) 208.7 4 <.0001 All Interactions 20.2 3 0.0002 fare * age (Factor+Higher Order Factors) 8.5 2 0.0142 Nonlinear 8.5 1 0.0036 Nonlinear Interaction : f(A,B) vs. AB 8.5 1 0.0036 fare * sex (Factor+Higher Order Factors) 6.4 2 0.0401 Nonlinear 1.5 1 0.2153 Nonlinear Interaction : f(A,B) vs. AB 1.5 1 0.2153 age * sex (Factor+Higher Order Factors) 9.9 1 0.0016 TOTAL NONLINEAR 21.9 3 0.0001 TOTAL INTERACTION 24.9 5 0.0001 TOTAL NONLINEAR + INTERACTION 38.3 6 <.0001 TOTAL 245.3 9 <.0001

Page 78: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Wald Statistics Response: survived

Factor Chi-Square d.f. P fare (Factor+Higher Order Factors) 55.1 6 <.0001 All Interactions 13.8 4 0.0079 Nonlinear (Factor+Higher Order Factors) 21.9 3 0.0001 age (Factor+Higher Order Factors) 22.2 4 0.0002 All Interactions 16.7 3 0.0008 sex (Factor+Higher Order Factors) 208.7 4 <.0001 All Interactions 20.2 3 0.0002 fare * age (Factor+Higher Order Factors) 8.5 2 0.0142 Nonlinear 8.5 1 0.0036 Nonlinear Interaction : f(A,B) vs. AB 8.5 1 0.0036 fare * sex (Factor+Higher Order Factors) 6.4 2 0.0401 Nonlinear 1.5 1 0.2153 Nonlinear Interaction : f(A,B) vs. AB 1.5 1 0.2153 age * sex (Factor+Higher Order Factors) 9.9 1 0.0016 TOTAL NONLINEAR 21.9 3 0.0001 TOTAL INTERACTION 24.9 5 0.0001 TOTAL NONLINEAR + INTERACTION 38.3 6 <.0001 TOTAL 245.3 9 <.0001

Page 79: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

0.50 2.00 4.00 6.00 8.00 10.00 12.00

fare - 31:7.9

age - 39:21

0.95

sex - female:male

Adjusted to:fare=14 age=28 sex=male

Predictors of Survival on Titanic

Page 80: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

0

50

100150

200250

Fare10

20

30

40

50

60

age

00.

20.

40.

60.

81

Pro

b. o

f Sur

viva

l

Adjusted to: sex=male

Fare and Age Interaction

Page 81: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Fare

Pro

b.

of

Su

rviv

al

0 50 100 150 200 250 300

0.2

0.4

0.6

0.8

1.0

female

male

Adjusted to: age=28

Fare and Gender Interaction

Page 82: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Index Training Corrected

Dxy 0.6565 0.646

R2 0.4273 0.407

Intercept 0.0000 -0.011

Slope 1.0000 0.952

Bootstrap Validation

Page 83: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Summary

• Think about your model• Collect enough data

Page 84: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Summary

• Measure well• Don’t destroy what you’ve

measured

Page 85: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

• Pick your variables ahead of time and collect enough data to test the model you want

• Keep all your variables in the model unless extremely unimportant

Summary

Page 86: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

• Use more df on important variables, fewer df on “nuisance” variables

• Don’t peek at Y to combine, discard, or transform variables

Summary

Page 87: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

• Estimate validity and shrinkage with bootstrap

Summary

Page 88: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

• By all means, tinker with the model later, but be aware of the costs of tinkering

• Don’t forget to say you tinkered

• Go collect more data

Summary

Page 89: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

Web links for references, software, and more

• Harrell’s regression modeling text– http://hesweb1.med.virginia.edu/biostat/rms/

• SAS Macros for spline estimation– http://hesweb1.med.virginia.edu/biostat/SAS/survrisk.txt

• Some results comparing validation methods– http://hesweb1.med.virginia.edu/biostat/reports/logistic.val.pdf

• SAS code for bootstrap– ftp://ftp.sas.com/pub/neural/jackboot.sas

• S-Plus home page– insightful.com

• Mike Babyak’s e-mail – [email protected]

• This presentation– http://www.duke.edu/~mbabyak

Page 90: Building useful models: Some new developments and easily avoidable errors Michael Babyak, PhD

• www.duke.edu/~mababyak

• michael.babyak @ duke.edu

• symptomresearch.nih.gov/chapter_8/