guide to segmentation for survival models using sas · 2018-05-24 · white paper guide to...

WHITE PAPER

GUIDE TO SEGMENTATION FOR SURVIVAL MODELS USING SAS

Alok RustagiVice President , EXL

[email protected]

Swagata MajumderSenior Manager, EXL

Written by

April 26, 2018

Contributor:

EXLSERVICE.COM 2

It relies on comparing the survivor functions across sub groups through Log-Rank Test (PROC LIFETEST). The examples given in this paper are from the credit card domain, but this technique can be effectively applied to any kind of survival data to generate intuitive segmentation trees.

The importance of segmentation in any kind of modelling exercise is undeniable. Segmentation into different population sets enables a modeller to develop separate models for different subsets of the population. This often outperforms a single standalone model through higher accuracy in predictions, lower bias, or both. The relationship between the predictors and target variables is often different in each subpopulation, which can be effectively captured by a segmented model leading to its better performance.

Popular segmentation tools like Classification and Regression Tree (CART) were originally developed to analyse cross sectional data where several subjects are observed at the same point in time. Applying these techniques is complex in case of survival data, where several subjects are observed at different period of time or until the event of interest occurs. A distinguishing feature of this kind of data called “censoring” can make it difficult to be handled with conventional statistical methods. In simple terms, if subjects are observed over a five-year duration to see whether an event of interest (for example, default on credit card payment) occurs, there will be subjects at the end of the study who do not default within the time period. Such cases are referred to as censored. It is not known when or if a censored customer will

experience the event, only that he or she has not done so by the end of the observation period.

Methodology

Interest in using survival analysis for credit scoring is quite recent, and is aimed at assessing the risk of customers who have already been assigned credit cards. The reason is that the objective of credit scoring, also known as credit risk modelling, has recently shifted towards choosing the customers that will provide the highest profit. To do so loan offers must consider not only if a customer will default, but also when they will default. This knowledge can be gained through survival models. The use of survival models also avoids the need to define a fixed period within which the default event is measured – a step inherent to logistic regression. They also allow the inclusion of behavioural and economic risk factors over time, like macroeconomic variables. There are several alternative survival models to estimate the hazard/survivor function, the most popular of them in credit scoring literature being the Cox Proportional Hazards (PH) model.

However, before proceeding to the actual modelling exercise, it often makes sense to split the data into sub groups and build separate models for each of these groups. This allows for a much greater level of accuracy in predictions and portfolio management. The question then becomes how many models are optimal, and which set of segmentation structure will provide a client the best business results.

This paper highlights how to tackle segmentation structure in the case of survival data, and also elaborates on its implementation in SAS. This is a step prior to the actual model building exercise, and is about dividing the population into segments which are homogeneous within themselves and heterogeneous amongst themselves, so that separate probability of default models can be developed on each of these segments.

GUIDE TO SEGMENTATION FOR SURVIVAL MODELS USING SAS

EXLSERVICE.COM 3

should ensure that accounts with similar default patterns are grouped together.

The data available for this example analysis is at account and monthly level, whereby for each month (to be referred as snapshot from now), there is a dataset containing all the non-defaulted accounts as of that snapshot, and their characteristics like month on book (MOB), delinquency status (DELQ), utilization (UTIL), balance (BAL), payments (PMT), full-payer indicator (FULL_PAY_IND) and so on. In addition, there are two variables which denote the default performance of that account using the most recent date till which data is available: default indicator (t_PD), which is a binary variable taking a value of 1 if the account has ever defaulted and 0 otherwise, and default month (def_month), which denotes the month when the account defaulted, taking a value of 0 if the account has never defaulted.

An example of a snippet of the data structure for a particular snapshot (in this case, May 2014) is illustrated below. A similar dataset will be available for other snapshots as well.

A1, A2 and A3 have a non-default status as of May 2014. Assuming that data is available until December 2015, each account can be observed for a performance window of a period of 19 months from the snapshot date. A1 and A3 hit a default status in this performance window, as indicated by the value of t_PD. The variable def_month takes a value

In case of cross sectional data, a classification tree technique called CHAID (Chi-Square Automatic Interaction Detection) is very popular for segmentation. This technique recursively partitions a population into separate and distinct groups defined by a set of independent predictor variables, such that the variance of the target variable is minimized within the groups and maximized across the groups. The advantage of CHAID is that the output is highly visual and easy to interpret. The development of the decision, or classification tree, starts with identifying the target variable or dependent variable which would be considered the root. CHAID analysis splits the target into two or more categories that are called the initial, or parent nodes, and then the nodes are split using statistical algorithms into child nodes.

The methodology outlined in this paper is somewhat inspired by CHAID, but it has been adapted to suit the requirements of a time series data structure. This paper provides a step by step guide to choosing the appropriate predictor or predictors to segment the population, and highlights how to use a Log-Rank test to decide the potential candidates for segmentation so that the underlying survivor functions of the sub groups are statistically different from each other.

A. Data Structure

The first step is to develop an appropriate segmentation structure, so that separate survival models can be built for each of these segments. The segmentation structure

Snapshot Account ID MOB DELQ UTIL BAL PMT FULL_PAY_IND

t_PD def_month

201405 A1 36 1 80 1000 20 0 1 14

201405 A2 60 0 50 800 100 1 0 0

201405 A3 .5 2 90 920 40 0 1 3

Table 1: Snapshot of Account Data

For Example Purposes Only

EXLSERVICE.COM 4

B. Identification of Potential Candidates

for Segmentation

Once the data has been converted to the desired format, the next task is to identify a set of potential candidates that can be used to segment the population using the account characteristics available. Business intuition comes in handy at this stage, as there may be certain variables that are important based on policies, underwriting strategies etc. Another approach is to shortlist the variables by fitting a survival model on the entire population using all possible predictors. This can be achieved in SAS using the stepwise selection methods within PROC PHREG. Variables that are highly significant in terms of p-value of Chi-Square in this model are most likely to be good candidates for segmenting the population.

The segmentation structure is also governed by data availability. There are certain subgroups of the population which have more information available than others, and it is a good idea to develop a separate model for this segment which optimally utilizes all of the available extra information.

Irrespective of whether a variable is shortlisted through business intuition or statistical techniques, it is essential to convert continuous variables to categorical in order to be able to compare the survivor functions across different categories of the variable. This can be achieved

of 14 for A1 and 3 for A3, implying that they default in July 2015 and August 2014 respectively. On the other hand A2 does not default over the entire performance window, and hence both t_PD and def_month are 0 for this account. In the context of survival analysis, A2 is a censored case, as the study ends before default occurs. A1 and A3, instead have uncensored default times.

The next step is to convert the data into a format which can be easily handled by the survival analysis procedures in SAS, be it LIFETEST, LIFEREG or PHREG. For each account in the sample, there must be one variable (named event_duration in this example) that contains either the time that an event occurred or, for censored cases, the last time at which that account was observed, both measured from the chosen origin. A second variable is required to denote the status of the account at the time recorded in the event_duration variable. Fortunately, this variable is already available in the data (t_PD) which takes a value of 1 for uncensored cases and 0 for censored ones. The variable event_duration can be created by a simple data step within SAS as follows:

*CREATING EVENT_DURATION VARIABLE;

%MACRO INCL_DATE(SAMPDATE = , TERM = );

DATA X_&SAMPDATE.;

SET Y_&SAMPDATE.;

IF DEF_MONTH = 0 THEN EVENT_DURATION = TERM;

ELSE EVENT_DURATION = DEF_MONTH;

RUN;

%MEND;

The variable TERM in the above macro is simply the difference between the snapshot date and the last date of study. Depending on the snapshot date, the above macro can be invoked as follows:

%INCL_DATE(SAMPDATE = 201409, TERM = 15);

%INCL_DATE(SAMPDATE = 201203, TERM = 45);

The data for each month is then appended to generate a master data on which the segmentation exercise is to be carried out.

EXLSERVICE.COM 5

Where:

dij = Number of events experienced by group j at event time ti Yij = Number of individuals at risk in group j just prior to time tidi = Total number of events experienced by the entire study

population at event time ti Yi = Total number of persons at risk in the entire study

population just prior to time t

The Log-Rank test begins by calculating a statistic representing the sum of weighted differences between dij

Yij and

di

Yi at each event time ti for each group j=1

through k. For the Log-Rank test, the weights applied to these differences are all equal to 1, so each event time has an equal weighting on the value of the statistics. The statistics calculated for the k groups are linearly dependent, and therefore only (k-1) may be used to calculate a test statistic. To calculate the test statistic, (k-1) of the statistics are formed into a vector called Z. The variances and covariances for these (k-1) statistics are placed into a variance-covariance matrix called ∑. A test statistic is then calculated as:

x2=Z(∑-1)Zt

This has a chi-squared distribution with (k-1) degrees of freedom when the null hypothesis is true.

The Log-Rank test allows for two types of non perfect survival data, left truncated data and right censored data. If censored observations are not present in the data then Wilcoxon Rank sum test is more appropriate.

II. Application of the Log-Rank Test

The LIFETEST procedure in SAS can be used to generate the Log-Rank test for comparison of survival patterns across different groups.

by grouping the accounts into ten equal bins based on the values of the concerned variable. Adjacent groups can then be clubbed if the default rate in the next X months are similar (X can typically be 12 months or 18 months, depending on the length of the performance window of the latest snapshot available).

C. GENERATION OF SEGMENTATION STRUCTURE

I. Theoretical Background of Statistical Tests Used

A Log-Rank Test was used as an approach for sub segmenting the population. It is based on non-parametric hypothesis tests to compare the survival distributions of two or more samples. It basically compares estimates of hazard function of the groups at each observed event time, or unique time when any individual from any group experiences the event, the null hypothesis being that the hazard functions for all groups are equal for all study time.

The idea behind segmentation is to divide the population in a way such that the survival functions are statistically different across the sub categories. The following is how the Log-Rank test can be represented mathematically.

H0: h1 (t)= h2 (t)= ... =hk (t) for all t ≤ τ

H1: At least one hj (t)is different for some t ≤ τ

τ is the largest time during which each group has at least one individual at risk, hj (t) represents the hazard function at time interval t for group k. The Log-Rank Test will compare the hazards from each group at each event time between 0 and τ.

If all hazards are equal for all groups, then it is assumed that the proportion of each group experiencing the event at any given time τi will be equal to the proportion of the overall population experiencing the event at that same time:

dij

Yij=

di

Yi for all event times ti, i=1,2,3,... .. ..D, j=1,2,3,... .. ..k

EXLSERVICE.COM 6

It is essential to configure some options of the LIFETEST procedure before executing it:

• In the TIME statement, the survival time variable, EVENT_DURATION, is crossed with the censoring variable, T_PD, with the value 0 indicating censoring. Hence the values of EVENT_DURATION are considered censored if the corresponding values of T_PD are 0. Otherwise, they are considered as event times.

• In the STRATA statement, the variable name is specified, which indicates that the data are to be divided into strata based on the values of that particular variable. In this example, a separate PROC LIFETEST is run for each of the five shortlisted variables - UTIL_FMT, DELQ, MOB, BAL_FMT and FULL_PAY_IND.

• The METHOD option specifies the method to be used to compute the survival function estimates. LT refers to the life table (actuarial estimates). This method is preferred when the number of observations is large2 .

• The INTERVALS option specifies interval endpoints for life-table estimates. Each interval contains its lower endpoint but does not contain its upper endpoint.

Continuing with the credit card example as before, once a few categorical variables have been shortlisted through either business intuition or statistical techniques as potential candidates for segmenting the population, a Log-Rank test will be applied to each of them to test whether the survivor/hazard functions are different across the categories of that variable. The variable having the highest Chi- Square will be used to create the first split of the population.

Before even going ahead with the segmentation methodology, it is advised to summarize the data across the shortlisted variables for ease of computation. Assuming that there are 4 shortlisted variables, MOB, DELQ, UTIL and BAL, this can be achieved in SAS through a simple SQL procedure. Continuous variables like utilization and balance need to be converted to categorical variables– UTIL_FMT and BAL_FMT.

*SUMMARIZING THE DATA;

PROC SQL;

CREATE TABLE SUMMARY1 AS

SELECT EVENT_DURATION,UTIL_FMT,DELQ,MOB,BAL_

FMT,FULL_PAY_IND,T_PD,

COUNT(*) AS NUMBER,

SUM(T_PD = 1) AS DEFAULTS

FROM STACKED_DATA

GROUP BY 1,2,3,4,5,6,7 ;

END;

Next, Log-Rank test will be computed iteratively for each of the four selected variables by specifying them in the STRATA option of PROC LIFETEST. A separate survivor function is then estimated for each stratum, and tests of the homogeneity of strata are performed. The precise SAS code is as follows:

*LIFETEST FOR EACH VARIABLE;

ODS OUTPUT SURVDIFF = SD HOMTESTS = HT;

PROC LIFETEST DATA = SUMMARY1 METHOD = LT

INTERVALS = 0 TO 108 BY 2;

TIME EVENT_DURATION*T_PD(0);

STRATA VAR_NAME/ADJUST = TUKEY;

FREQ NUMBER;

RUN;

EXLSERVICE.COM 7

The Chi-Square test results for each variable are then appended to create a table like the below. The table is sorted in order of descending Chi-Square values. The top variable, DELQ, is used as the first segmentation split.

Now, DELQ has 4 categories: cycle 0, 1, 2 and 3.

Since cycle 0 comprises of around 95% of the non-default population, the data is further divided into two categories: Inorder (cycle 0) and Delinquent (cycle 1+). Since the Delinquent population is relatively small, it is not split further. The Inorder population is then considered, and the segmentation exercise is carried on this subset, using the remaining variables.

The InOrder population is further split into Full Payer and Revolver population according to the top splitter in this subset, FULL_PAY_IND. Each of this subset can further be split using the remaining variables, following the same steps as before. It should be kept in mind that any of the final nodes post segmentation should have sufficient volume for the model to be robust. In this example, the final segmentation structure is obtained by further splitting each of the Full Payer and Revolver population by MOB.

Hence the specification in the above code produces the set of intervals

{[0,2), [2,4), ...............[106,108), {108, ∞)}

• The FREQ statement is useful for producing life tables when the data are already in the form of a summary data set. The FREQ statement identifies a variable (NUMBER in this case) that contains the frequency of occurrence of each observation. PROC LIFETEST treats each observation as if it appeared n times, where n is the value of the FREQ variable for the observation.

Once the LIFETEST procedure is run for each of the shortlisted variables and the output datasets are stored in the dataset named HT, they are appended together to create a final table having the Chi-Square test results for each variable.

DATA HT1;

SET HT;

LENGTH VAR $30.;

VAR = “VAR_SEG.”;

WHERE TEST = “Log-Rank”;

RUN;

Table 2: Chi-Square Test Results

Level 1

Test ChiSq DF ProbChiSq Var

Log-Rank 734,420 3 <.0001 DELQ

Log-Rank 457,622 5 <.0001 UTIL_FMT

Log-Rank 340,373 5 <.0001 BAL_FMT

Log-Rank 295,331 9 <.0001 MOB

Log-Rank 294,356 1 <.0001 FULL_PAY_IND


Level 2: DELQ = 0 (InOrder)

Test ChiSq DF ProbChiSq Var

Log-Rank 275,918 1 <.0001 FULL_PAY_IND

Log-Rank 230,824 5 <.0001 UTIL_FMT

Log-Rank 175,167 9 <.0001 MOB

Log-Rank 169,771 5 <.0001 BAL_FMT

Table 3: In Order Population Segmentation


EXLSERVICE.COM 8

unanimously rejecting the null hypothesis and providing evidence that at least one of the five stratum hazard plots is significantly different from others for some value of t≤ τ

The second output in Table 5 shows the Log Rank tests comparing each possible pair of strata. All the tests are significant both using the raw p-values and after the Tukey adjustment, suggesting that each segment is significantly different from another. This rules out the possibility to collapse the segments.

The graph in Figure 2 shows some evidence of difference in survival functions across the five strata, thereby supporting the results already obtained from the Chi-Square tests.

D. ANALYSIS OF SEGMENTATION PERFORMANCE

The blue highlighted boxes in Figure 1 are the final segments for this population. Once the final segmentation structure has been decided, it makes sense to check the survival distributions across the five segments. The following program can be used for that, assuming that the variable “pd_seg_ind_sm_2” captures the new segmentation structure:

ods output SurvDiff = SD HomTests = HT ; proc lifetest data=<DATA> method=lt intervals= 0

to 108 by 2 plots = (s,h);

time event_duration * t_pd(0);

Strata pd_seg_ind_sm_2/Adjust = Tukey;

run;

The ADJUST option (new in SAS 9.2) tells PROC LIFETEST to produce p-values for all ten pairwise comparisons of the five strata and then to report p-values that have been adjusted for multiple comparisons using the Tukey’s method. Results are shown in Table 4.

Table 4 shows the overall chi-square tests of the null hypothesis that the survivor functions are identical across the five segments. All three tests are highly significant,

Figure 1: Final Segmentation Tree

Cards (Stacked Sample)Volume percentage : 100%

12M Default Rate: 6.43%60M Default Rate: 27.22%

DELQ >0Volume Percentage : 4.0%

12M Default Rate : 62.70%60M Default Rate: 93.92%

FULL_PAY_IND = 1Volume Percentage : 52.44%

12M Default Rate : 0.75%60M Default Rate : 3.69%

FULL_PAY_IND = 0Volume Percentage : 43.55%

12M Default Rate : 9.28%60M Default Rate: 42.32%

MOB < XVolume Percentage : 6.11%


MOB >= XVolume Percentage : 46.34%

12M Default Rate : 0.33%60 M Default Rate : 3.03%

MOB < XVolume Percentage : 5.41%


MOB >= XVolume Percentage : 38.15%


DELQ = 0Volume Percenatge : 96.0%

12M Default Rate: 3.62%60M Default Rate: 21.76%

Test ChiSq DF ProbChiSq

Log-Rank 559,397 4 <.0001

Wilcoxon 643,746 4 <.0001

-2Log(LR) 455,316 4 <.0001

Table 4: Comparison of Results


EXLSERVICE.COM 9

Figure 2: Survival Graphs for Chosen Segments

1.00

0.75

0.50

0.25

0.00

Surv

ival

Dis

trib

utio

n Fu

nctio

n

0 20 40 60 80 100

STRATA: pd_seg_ind_sm_2=1pd_seg_ind_sm_2=3pd_seg_ind_sm_2=5

pd_seg_ind_sm_2=2pd_seg_ind_sm_2=4

Adjustment for Multiple Comparisons for the Log-Rank Test

Strata ComparisonChi-Square

p-values

pd_seg_ind_sm_2 pd_seg_ind_sm_2 Raw Tukey-Kramer

1 2 160,134 <.0001 <.0001

1 3 82,894 <.0001 <.0001

1 4 29,425 <.0001 <.0001

1 5 124,817 <.0001 <.0001

2 3 335,678 <.0001 <.0001

2 4 138,374 <.0001 <.0001

2 5 388,489 <.0001 <.0001

3 4 95 <.0001 <.0001

3 5 3,176 <.0001 <.0001

4 5 393 <.0001 <.0001

Table 5: Log Rank Test Results

EXLSERVICE.COM 10

Survival analysis can be applied to build models for time of default on credit cards. This knowledge helps the issuer to pre-empt the attrition and devise customer engagement strategies. We here proposed to create an intuitive segmentation structure on a large dataset of credit card accounts before the onset of the actual modelling exercise by using the Log-Rank test to compare the hazard function across the different sub groups. The program used in this paper serves as a fast, efficient way to churn through a large quantity of data to provide the client the necessary information needed for a final decision on modelling splits.

References1. Allison, P. D. (2010). Survival Analysis using SAS®: A Practical

Guide, Second Edition. Cary,NC: SAS Institute Inc.

2. Bellotti, T., & Crook, J. (2007, May 7). Credit Scoring With Macroeconomic Variables Using Survival Analysis.

3. Man, R. (2014, May 9). Survival analysis in credit scoring: A framework for PD estimation.

4. Pazdera, J., Rychnovsky, M., & Zahradnik, P. (2009, Feb 1). Survival analysis in credit scoring.

5. Sayles, H., & Soulakova, J. (n.d.). Log-Rank Test for More tan Two Groups.

6. Weldon, G., & Zidun, H. (n.d.). Segmentation of Data Prior to Modeling. Atlanta: Merkle,Inc.

End Notes1 Refer to (Bellotti & Crook, 2007), (Pazdera, Rychnovsky, &

Zahradnik, 2009), (Man, 2014).

2 The Kaplan – Meier method of estimating survivor functions is more suitable when sample size is small, and event times are measured with precision. This is in fact the default method in PROC LIFETEST.

Finally, since the default rates of each of the five segments at two time intervals, 12 month and 60 months, are considerably different across the segments, this indicates that the segments are different in terms of the default rates.

Conclusion and Limitations

This approach also has its own set of limitations:

First, the test statistics for the Log-Rank test are based on large-sample approximations and gives good results when the sample size is large. The number of comparison segments should not be allowed to get too large to avoid having segments with too few subjects. Each group should contain at least 30 subjects, preferably more for the best results.

Secondly, the Log-Rank test is more powerful for detecting differences of the form S1 (t) = [ S2 (t)]Ƴ, where Ƴ is some positive number other than 1.0. This equation defines a proportional hazards model, and the log rank test is not particularly good at detecting differences when survival curves cross.

Segmentation is a unique aspect in modelling in that it blends art and science in almost equal measures. There are times when a segmentation structure based entirely on statistical measures does not add enough value; however, it will be effective only when these numbers are coupled with business requirements and common sense as was demonstrated in the example discussed above. Dividing the population into five different groups and building separate survival models on each of these groups yielded better results instead of building a single standalone model on the entire population, as these groups are inherently different in terms of the survival patterns.

EXLSERVICE.COM

GLOBAL HEADQUARTERS280 Park Avenue, 38th FloorNew York, New York 10017T +1 212.277.7100 F +1 212.771.7111

United States • United Kingdom • Czech Republic • Romania • Bulgaria • India • Philippines • Colombia • South Africa

EXL (NASDAQ: EXLS) is a leading operations management and analytics company that designs and enables agile, customer-centric operating models to help clients improve their revenue growth and profitability. Our delivery model provides market-leading business outcomes using EXL’s proprietary Business EXLerator Framework®, cutting-edge analytics, digital transformation and domain expertise. At EXL, we look deeper to help companies improve global operations, enhance data-driven insights, increase customer satisfaction, and manage risk and compliance. EXL serves the insurance, healthcare, banking and financial services, utilities, travel, transportation and logistics industries. Headquartered in New York, New York, EXL has more than 27,000 professionals in locations throughout the United States, Europe, Asia (primarily India and Philippines), South America, Australia and South Africa.

© 2018 ExlService Holdings, Inc. All Rights Reserved.For more information, see www.exlservice.com/legal-disclaimer

[email protected]

guide to segmentation for survival models using sas · 2018-05-24 · white paper guide to...

Documents