sc968: panel data methods for sociologists

SC968: Panel Data Methods for Sociologists

Introduction to survival/event history models

Types of outcome

Continuous OLS Linear regression

Binary Binary regressionLogistic or probit regression

Time to event data Survival or event history analysis

Examples of time to event data

Time to death Time to incidence of disease Unemployed - time till find job Time to birth of first child Smokers – time till quit smoking

Time to event data

Set of a finite, discrete states Units (individuals, firms, households

etc.) –in one state Transitions between states

Time until a transition takes place

4 key concepts for survival analysis

States Events Risk period Duration/ time

States

States are categories of the outcome variable of interest Each unit (ex.: person, household etc.) occupies exactly

one state at any moment in time Examples

alive, dead single, married, divorced, widowed never smoker, smoker, ex-smoker employed, unemployed, inactive

Set of possible states called the state space

Events

Event=a transition from one state to another From an origin state to a destination state Examples

From smoker to ex-smoker From married to widowed

Not all transitions may be possible E.g. from smoker to never smoker

Risk period

2 states: A & B Event: transition from A B To be able to undergo this transition, one must be in

state A (if in state B already cannot transition) Not all individuals will be in state A at any given time Example

can only experience divorce if married

The period of time that someone is at risk of a particular event is called the risk period

All subjects at risk of an event at a point in time called the risk set

Time

Various meanings...

Calendar time ...but onset of risk usually not simultaneous

for all units Ex: by age 40, some individuals will have

smoked for 20+ years, other for 1 year Duration=time since onset of risk

divorce: time since getting married finding a job: time since becoming unemployed death: time since being born

Duration

Event history analysis : analyzing the length of duration, i.e. the length of time between the onset of risk and the occurrence of an event

Examples Duration of marriage Length of life

In practice we model the probability of a transition conditional on being in the risk set

Example data

Person ID Married (Onset of risk)

Divorce (event) End of observation period

1 01/01/1991 - 01/01/2008

2 01/01/1991 01/01/2000 01/01/2000

3 01/01/1995 - 01/01/2005

4 01/01/1994 01/07/2004 01/07/2004

Calendar time

1991 1994 1997 2000 2003 2006 2009

Study follow-up ended

Censoring

Ideally: observe individual since the onset of risk until event has occurred

...very demanding in terms of data collection (ex: risk of death starts when one is born)

Usually– incomplete data censoring An observation is censored if it has incomplete

information cannot accurately measure duration Types of censoring

Right censoring Left censoring

Censoring

Right censoring: the person did not experience the event during the time that they were studied

Common reasons for right censoring the study ends the person drops-out of the study

We do not know when the person experiences the event but we do know that it is later than a given time T

Left censoring: the person became at risk before we started observing her We do not know when the person entered the risk set EHA

cannot deal with We know when the person entered the risk set condition on

the person having survived long enough to enter the study Censoring independent of survival processes!!

Study time in years

0 3 6 9 12 15 18

censored

event

censored

event

Why a special set of methods?

duration =continuous variable why not OLS? Censoring

If excluding higher probability to throw out longer durations If treating as complete mis-measurement of duration

Non normality of residuals Time varying co-variates Interested in the probability of a transition at any given

time rather than in the length of complete spells Need to simultaneously take into account:

Whether the event has taken place or not The length of the period at risk before the event occurred

Survival function

Length of time (duration) before an event occurs (length of ‘spell’-T) probability density function (pdf)- f(t)

f(t)= lim Pr(t<=T<=t+Δt) = δF(t) δt Δt0 Δt

cumulative density function (cdf)- F(t)F(t)= Pr( T<=t) =∫f(t) dt

Survival function: S(t)=1-F(t)

Survival function

Duration (T)

CDF

Duration (T)

CDF

PDF

S(t)

Duration (T)

Hazard rate

h(t)= f(t)/ S(t) The exact definition & interpretation of h(t) differs:

duration is continuous duration is discrete

Conditional on having survived up to t, what is the probability of leaving between t and t+Δt

It is a measure of risk intensity h(t) >=0 In principle h(t)= rate; not a probability There is a 1-1 relationship between h(t), f(t), F(t), S(t) EHA analysis:

h(t)= g (t, Xs) g=parametric & semi-parametric specifications

Data

Survival or event history data characterised by 2 variables Time or duration of risk period Failure (event)

• 1 if not survived or event observed• 0 if censored or event not yet occurred

Data structure different: Duration is discrete Duration is continuous

Assume: 2 states; 1 transition; no repeated events

Data structure-Discrete time

ID Entry End date Event X at t0 X at t1 ....

1 01/01/1991 01/01/2008 01/01/2002

2 01/01/1991 01/01/2008 -

ID Date Duration (t) Event X

1 01/01/1991 1 0

1 01/01/1992 2 0

... ..... .... .....

1 01/01/2002 11 1

2 01/01/1991 1 0

... .... .... ....

2 01/01/2008 17 0

t records (1 for each unit of time the person is at risk)

Data structure-Discrete time

The row is a an individual period An individual has as many rows as the number of

periods he is observed to be at risk No longer at risk when

Experienced event No longer under observation (censored)

For each period (row)- explanatory variable X very easy to incorporate time varying co-variates

Stata: reshape long

Data structure-continuous time

ID Entry Died End date Duration Event X

1 01/01/1991 01/01/2008 17.0 0 02 01/01/1991 01/01/2002 01/01/2002 11.0 1 03 01/01/1995 01/01/2000 5.0 0 03 01/01/2000 01/01/2005 01/01/2005 5.0 1 1

Data structure-continuous time

The row is a person Indicator for observed events/ censored cases Calculate duration= exit date – entry date Exit date=

Failure date Censoring date

If time-varying covariates- Split the period an individual is under observation by the

number of times time-varying Xs change If many Xs-change often- multiple rows

Worked example

Random 20% sample from BHPS Waves 1 – 15 One record per person/wave Outcome: Duration of cohabitation Condition on cohabiting Survival time: years from starting cohabitation till

year living without a partner

The data

+----------------------------+ | pid wave mastat | |----------------------------| | 10081798 1 single | | 10081798 2 married | | 10081798 3 married | | 10081798 4 married | | 10081798 5 married | | 10081798 6 married | | 10081798 7 widowed | | 10081798 8 widowed | | 10081798 9 widowed | | 10081798 10 widowed | | 10081798 11 widowed | | 10081798 12 widowed | | 10081798 13 widowed | | 10081798 14 widowed | | 10081798 15 widowed | |----------------------------|

Duration = 5 years

Event = 1

Ignore data after event = 1

The data (continued)

+----------------------------+ | pid wave mastat | |----------------------------| | 10162747 1 living a | | 10162747 2 living a | | 10162747 3 living a | | 10162747 4 living a | | 10162747 5 living a | | 10162747 6 living a | | 10162747 10 separate | | 10162747 11 . | | 10162747 12 . | | 10162747 13 . | | 10162747 14 never ma | | 10162747 15 never ma | +----------------------------+

Note missing waves before event

Preparing the data. sort pid wave . recode mastat (1/2=1) (2/6=0) (.=.), gen (couple) . by pid: gen skey=1 if couple==1 &couple[_n-1]==0 . by pid: replace skey=1 if skey[_n-1]==1 . keep if skey==1 . by pid: gen diff=wave[1] //first wave . by pid: gen time=wave-diff+1 . stset time, id(pid) failure(couple=0) exit(couple=0, .) id: pid failure event: couple == 0 obs. time interval: (time[_n-1], time] exit on or before: couple==0 . ------------------------------------------------------------------------------ 1804 total obs. 572 obs. begin on or after exit ------------------------------------------------------------------------------ 1232 obs. remaining, representing 181 subjects 62 failures in single failure-per-subject data 1234 total analysis time at risk, at risk from t = 0 earliest observed entry t = 0 last observed exit t = 14

Select records after onset of risk

Declare that you want to set the data to survival time

Important to check that you have set data as intended

Generate new duration variable

Checking the data setup

. list pid wave time couple mastat _st _d _t _t0 if pid==11032391 ,sepby(pid) noobs +------------------------------------------------------------------+ pid wave time couple mastat _st _d _t _t0 ------------------------------------------------------------------ 11032391 7 1 1 living a 1 0 1 0 11032391 8 2 1 living a 1 0 2 1 11032391 9 3 1 living a 1 0 3 2 11032391 10 4 1 living a 1 0 4 3 11032391 11 5 1 living a 1 0 5 4 11032391 12 6 0 never ma 1 1 6 5 11032391 13 7 0 never ma 0 . . . 11032391 14 8 0 never ma 0 . . . 11032391 15 9 0 never ma 0 . . . +------------------------------------------------------------------+

1 if observation is to be used and 0 otherwise

1 if event, 0 if censoring orevent not yet occurred

time of exit

time of entry

Checking the data setup

.list pid time wave couple mastat _st _d _t _t0 if pid==12230278 ,sepby(pid) noobs +------------------------------------------------------------------+ pid time wave couple mastat _st _d _t _t0 ------------------------------------------------------------------ 12230278 1 4 1 married 1 0 1 0 12230278 2 5 1 married 1 0 2 1 12230278 3 6 1 married 1 0 3 2 12230278 4 7 1 married 1 0 4 3 12230278 5 8 1 married 1 0 5 4 12230278 6 11 0 divorced 1 1 6 5 12230278 7 12 0 divorced 0 . . . 12230278 8 13 0 divorced 0 . . . 12230278 9 14 0 divorced 0 . . . 12230278 10 15 . . 0 . . . +------------------------------------------------------------------+

How do we know whenthis person divorced?

Trying again!

. fillin pid time

. stset time, id(pid) failure(couple=0) exit (couple=0, .)

id: pid

failure event: couple == 0

obs. time interval: (time[_n-1], time]

exit on or before: couple==0 .

--------------------------------------------------------------------------

2534 total obs.

1222 obs. begin on or after exit

---------------------------------------------------------------------------

1312 obs. remaining, representing

181 subjects

61 failures in single failure-per-subject data

1312 total analysis time at risk, at risk from t = 0

earliest observed entry t = 0

last observed exit t = 14

. list pid time wave couple mastat _st _d _t _t0 if pid==12230278 ,sepby(pid) noobs +------------------------------------------------------------------+ | pid time wave couple mastat _st _d _t _t0 | |------------------------------------------------------------------| | 12230278 1 4 1 married 1 0 1 0 | | 12230278 2 5 1 married 1 0 2 1 | | 12230278 3 6 1 married 1 0 3 2 | | 12230278 4 7 1 married 1 0 4 3 | | 12230278 5 8 1 married 1 0 5 4 | | 12230278 6 . . . 1 0 6 5 | | 12230278 7 . . . 0 . . . | | 12230278 8 11 0 divorced 0 . . . | | 12230278 9 12 0 divorced 0 . . . | | 12230278 10 13 0 divorced 0 . . . | | 12230278 11 14 0 divorced 0 . . . | | 12230278 12 15 . . 0 . . . | | 12230278 13 . . . 0 . . . | | 12230278 14 . . . 0 . . . | +------------------------------------------------------------------+

Checking the new data setup

Now censored instead of an event

Summarising time to event data

Individuals followed up for different lengths of time So can’t use prevalence rates (% people who have

an event) Use rates instead that take account of person years

at risk Incidence rate per year Death rate per 1000 person years

Summarising time to event data

Number of observationsPerson-years Rate per year

Less than 50% of the sample has experienced the event by the end of the study

. stsum failure _d: couple == 0 analysis time _t: time exit on or before: couple==0 . id: pid | incidence no. of |------ Survival time -----| | time at risk rate subjects 25% 50% 75% ---------+--------------------------------------------------------------------- total | 1312 .0464939 181 5 . .

List the cumulative hazard functionDefault is the survivor function. sts list, failure

failure _d: couple == 0 analysis time _t: time exit on or before: couple==0 . id: pid Beg. Net Failure Std. Time Total Fail Lost Function Error [95% Conf. Int.] ------------------------------------------------------------------------------- 2 181 26 7 0.1436 0.0261 0.1002 0.2038 3 148 8 8 0.1899 0.0294 0.1396 0.2555 4 132 5 10 0.2206 0.0313 0.1662 0.2895 5 117 5 5 0.2539 0.0333 0.1953 0.3262 6 107 1 8 0.2609 0.0337 0.2014 0.3339 7 98 6 9 0.3062 0.0364 0.2412 0.3837 8 83 4 8 0.3396 0.0383 0.2706 0.4204 9 71 2 10 0.3582 0.0394 0.2869 0.4410 10 59 1 9 0.3691 0.0402 0.2962 0.4534 11 49 3 7 0.4077 0.0435 0.3283 0.4981 12 39 0 11 0.4077 0.0435 0.3283 0.4981 13 28 0 9 0.4077 0.0435 0.3283 0.4981 14 19 0 19 0.4077 0.0435 0.3283 0.4981 -------------------------------------------------------------------------------

Graphs of survival time0.

000.

250.

500.

751.

00

0 5 10 15analysis time

Kaplan-Meier survival estimate

Kaplan-Meier graphs

Can read off the estimated probability of surviving a relationship at any time point on the graph E.g. at 5 years 88% are still cohabiting

The survival probability only changes when an event occurs

So the graph is stepped and not a smooth curve

0.00

0.25

0.50

0.75

1.00

0 5 10 15time in years

Kaplan-Meier survival estimate

0.00

0.25

0.50

0.75

1.00

0 5 10 15analysis time

sex = male sex = female

Comparing survival by group using Kaplan-Meier graphs

Testing equality of survival curves among groups

The log-rank test

A non –parametric test that assesses the null hypothesis that there are no differences in survival times between groups

. sts test sex, logrank failure _d: mastat == 3 4 5 6 analysis time _t: wave exit on or before: mastat==3 4 5 6 . id: pid Log-rank test for equality of survivor functions | Events Events sex | observed expected -------+------------------------- male | 98 113.59 female | 136 120.41 -------+------------------------- Total | 234 234.00 chi2(1) = 4.25 Pr>chi2 = 0.0392

Log-rank test example

Significant difference between men and women

The Cox regression model

Event History with Cox ModelCox regression model

Also known as the Cox proportional hazard model

No longer modelling the duration Modelling the hazard rate

Hazard rate h(t)= f(t)/ S(t) conditional on having survived up to t, what is the

probability of leaving between t and t+Δt

Some hazard shapes Increasing:

as time elapses, more likely to experience the event Ex: onset of Alzheimer's

Decreasing as time elapses, less likely to experience the event Ex: Survival after surgery

U-shaped the hazard rate is highest when duration is low/ high Ex: age specific mortality

Constant hazard rate does not change with time Ex: time till next email arrives

Hazard rate varying with time.0

35.0

4.0

45.0

5.0

55

2 4 6 8 10 12analysis time

Smoothed hazard estimate

Cox regression model

Regression model for survival analysis Can model time invariant and time varying explanatory

variables Produces estimated hazard ratios

(sometimes called rate ratios or risk ratios) Regression coefficients are on a log scale

Exponentiate to get hazard ratio Similar to odds ratios from logistic models

Model is semi-parametric: Does not estimate how the hazard rate changes with time Estimates the effect of co-variates in shifting a baseline hazard

rate

Cox regression equation (i)

).......exp()()( 22110 inniii xxxthth

)(0 th

)(thi

is the baseline hazard function and can take any form

is the hazard function for individual i

inii xxx ,....,, 21

n ,....,, 21

are the covariates

are the regression coefficients estimated from the data

One important assumption!:• the effect of co-variates does not change with time

(proportional hazards)

Cox regression equation (ii)

If we divide both sides of the equation on the previous slide by h0(t) and take logarithms, we obtain:

We call h(t) / h0(t) the hazard ratio The coefficients bi...bn are estimated by Cox regression, and can

be interpreted in a similar manner to that of multiple logistic regression

exp(bi) is the instantaneous relative risk of an event

inniii xxxthth

.......

)()(

ln 22110

Cox regression assumptions

Assumption of proportional hazards No censoring patterns (i.e. censoring process

independent of the survival process) No left censoring True starting time (on-set of risk unambiguous) Plus assumptions for all modelling

Sufficient sample size, proper model specification, independent observations, exogenous covariates, no high multicollinearity, random sampling, and so on

Proportional hazards assumption

The ratio of the hazard functions of two groups is constant over time

Cox estimates of βs= change in the hazard rate relative to the baseline hazard (=hazard function of the reference group)

If a covariate fails this assumption for hazard ratios that increase over time for that covariate,

relative risk is overestimated for ratios that decrease over time, relative risk is underestimated standard errors are incorrect and significance tests are

decreased in power

).......exp()()( 22110 inniii xxxthth

Cox regression in Stata

Will first model a time invariant covariate (sex) on risk of partnership ending

Then will add a time dependent covariate (age) to the model

Cox regression in Stata

. stcox female failure _d: couple == 0 analysis time _t: time exit on or before: couple==0 . id: pid Iteration 0: log likelihood = -295.48855 Iteration 1: log likelihood = -295.34519 Iteration 2: log likelihood = -295.34518 Refining estimates: Iteration 0: log likelihood = -295.34518 Cox regression -- Breslow method for ties No. of subjects = 181 Number of obs = 1231 No. of failures = 61 Time at risk = 1231 LR chi2(1) = 0.29 Log likelihood = -295.34518 Prob > chi2 = 0.5923 ------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | .8705857 .2259992 -0.53 0.593 .5234128 1.448034 ------------------------------------------------------------------------------

Interpreting output from Cox regression

Hazard ratios (exp(β)) The effect of the covariate on the hazard rate relative to the

baseline hazard Baseline hazard function not estimated no intercept In our example, the baseline hazard is when sex=1 (male)

The hazard ratio is the ratio of the hazard for a unit change in the covariate HR = 0.9 for women vs. men The risk of partnership breakdown is lowered by 10% for women

compared with men Hazard ratio assumed constant over time

At any time point, the hazard of partnership breakdown for a woman is 0.9 times the hazard for a man

Interpreting output from Cox regression (ii)

The hazard rate may change with time but the hazard ratio is constant

So if we know that the probability of a man having a partnership breakdown in the following year is 1.5% then the probability of a woman having a partnership breakdown in the following year is

0.015*0.9 = 1.35% The (instantaneous) hazard rate cannot be derived from Cox

estimates alone Need to estimate the baseline hazard function

But... can calculate the odds of experiencing the event first = (hazard ratio) / (1 + hazard ratio)

So in our example, a HR of 0.9 corresponds to aprobability of 0.47 that a woman will experience a partnership breakdown first

Time dependent covariates

Examples Current age group rather than age at baseline GHQ score may change over time and predict break-ups

Will use age to predict duration of cohabitation Nonlinear relationship hypothesised Recode age into 8 equally spaced age groups

Cox regression with time dependent covariates

stcox female i.agecat failure _d: couple == 0 analysis time _t: time exit on or before: couple==0 . id: pid Iteration 0: log likelihood = -889.54681 Iteration 1: log likelihood = -882.64848 Iteration 2: log likelihood = -881.56019 Iteration 3: log likelihood = -881.54391 Iteration 4: log likelihood = -881.54388 Refining estimates: Iteration 0: log likelihood = -881.54388 Cox regression -- Breslow method for ties No. of subjects = 474 Number of obs = 2527 No. of failures = 156 Time at risk = 2527 LR chi2(8) = 16.01 Log likelihood = -881.54388 Prob > chi2 = 0.0423 ------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | .9966605 .1622162 -0.02 0.984 .7244458 1.371161 | agecat | 36 | .7828117 .1630285 -1.18 0.240 .5204583 1.177413 45 | .7128063 .1901955 -1.27 0.205 .4225212 1.202526 54 | .6585124 .2116483 -1.30 0.194 .3507395 1.236355 63 | 1.103972 .430024 0.25 0.800 .5145101 2.368764 73 | 1.389126 .7298926 0.63 0.532 .4960122 3.890371 82 | 4.411906 1.931261 3.39 0.001 1.87078 10.40471 91 | 5.355065 5.439317 1.65 0.099 .7314272 39.20652 ------------------------------------------------------------------------------

Testing the proportional hazards assumption

Graphical methods Comparison of Kaplan-Meier observed & predicted curves

by group. Observed lines should be close to predicted Survival probability plots (cumulative survival against time

for each group). Lines should not cross Log minus log plots (minus log cumulative hazard against

log survival time). Lines should be parallel

Testing the proportional hazards assumption

Formal tests of proportional hazard assumption

Include an interaction between the covariate and a function of time. Log time often used but could be any function. If significant then assumption violated

Test the proportional hazards assumption on the basis of partial residuals. Type of residual known as Schoenfeld residuals.

When assumptions are not met

If categorical covariate, include the variable as a strata variable

Allows underlying hazard function to differ between categories and be non proportional

Estimates separate underlying baseline hazard for each stratum

When assumptions are not met

If a continuous covariate

Consider splitting the follow-up time. For example, hazard may be proportional within first 5 years, next 5-10 years and so on

Could covariate be included as time dependent covariate? Consider another (parametric) model

Censoring assumptions

Censored cases must be independent of the survival distribution. There should be no pattern to these cases, which instead should be missing at random.

If censoring is not independent, then censoring is said to be informative

You have to judge this for yourself Usually don’t have any data that can be used to test the

assumption Think carefully about start and end dates Always check a sample of records

True starting time

The ideal model for survival analysis would be where there is a true zero time

If the zero point is arbitrary or ambiguous, the data series will be different depending on starting point. The computed hazard rate coefficients could differ, sometimes markedly

Conduct a sensitivity analysis to see how coefficients may change according to different starting points

Other extensions to survival analysis

Repeated events Multi-state models (more than 1 event type)

competing risks Transition from employment to unemployment or leaving

labour market Modelling type of exit from cohabiting relationship-

separation/divorce/widowhood

Could you use logistic regression instead?

May produce similar results for short or fixed follow-up periods Examples

• everyone followed-up for 7 years• maximum follow-up 5 years

Results may differ if there are varying follow-up times

If dates of entry and dates of events are available then better to use Cox regression

Finally….

This is just an introduction to survival/ event history analysis

Only reviewed the Cox regression model Also parametric survival methods But Cox regression likely to suit type of analyses of

interest to sociologists

Consider an intensive course if you want to use survival analysis in your own work

sc968: panel data methods for sociologists

Documents