guide to segmentation for survival models using sas€¦ · of survival data to generate intuitive...

15
AN EXL WHITE PAPER Guide to Segmentation for Survival Models using SAS Swagata Majumder Senior Manager, EXL Contributor: Alok Rustagi Vice President , EXL [email protected] Written by:

Upload: others

Post on 10-Jul-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Guide to Segmentation for Survival Models using SAS€¦ · of survival data to generate intuitive segmentation trees. The importance of segmentation in any kind of modelling exercise

AN EXL WHITE PAPER

Guide to Segmentation for Survival Models using SAS

Swagata MajumderSenior Manager, EXL

Contributor:

Alok RustagiVice President , EXL

[email protected]

Written by:

Page 2: Guide to Segmentation for Survival Models using SAS€¦ · of survival data to generate intuitive segmentation trees. The importance of segmentation in any kind of modelling exercise

It relies on comparing the survivor

functions across sub groups through

Log-Rank Test (PROC LIFETEST). The

examples given in this paper are from the

credit card domain, but this technique

can be effectively applied to any kind

of survival data to generate intuitive

segmentation trees.

The importance of segmentation in any

kind of modelling exercise is undeniable.

Segmentation into different population

sets enables a modeller to develop

separate models for different subsets of

the population. This often outperforms a

single standalone model through higher

accuracy in predictions, lower bias, or

both. The relationship between the

predictors and target variables is often

different in each subpopulation, which can

be effectively captured by a segmented

model leading to its better performance.

Popular segmentation tools like

Classification and Regression Tree (CART)

were originally developed to analyse

cross sectional data where several

subjects are observed at the same point

in time. Applying these techniques is

complex in case of survival data, where

several subjects are observed at different

period of time or until the event of

interest occurs. A distinguishing feature

of this kind of data called “censoring”

can make it difficult to be handled with

conventional statistical methods. In

simple terms, if subjects are observed

over a five-year duration to see whether

an event of interest (for example, default

on credit card payment) occurs, there

will be subjects at the end of the study

who do not default within the time period.

Such cases are referred to as censored.

It is not known when or if a censored

customer will experience the event, only

This paper highlights how to tackle segmentation structure in the case of survival data, and also

elaborates on its implementation in SAS. This is a step prior to the actual model building exercise,

and is about dividing the population into segments which are homogeneous within themselves and

heterogeneous amongst themselves, so that separate probability of default models can be developed

on each of these segments.

Guide to Segmentation for Survival Models using SAS

EXLservice.com | 2

Page 3: Guide to Segmentation for Survival Models using SAS€¦ · of survival data to generate intuitive segmentation trees. The importance of segmentation in any kind of modelling exercise

that he or she has not done so by the end

of the observation period.

MethodologyInterest in using survival analysis for credit

scoring is quite recent, and is aimed at

assessing the risk of customers who have

already been assigned credit cards . The

reason is that the objective of credit scoring,

also known as credit risk modelling, has

recently shifted towards choosing the

customers that will provide the highest

profit. To do so loan offers must consider

not only if a customer will default, but also

when they will default. This knowledge can

be gained through survival models. The

use of survival models also avoids the need

to define a fixed period within which the

default event is measured – a step inherent

to logistic regression. They also allow the

inclusion of behavioural and economic

risk factors over time, like macroeconomic

variables. There are several alternative

survival models to estimate the hazard/

survivor function, the most popular of them

in credit scoring literature being the Cox

Proportional Hazards (PH) model.

However, before proceeding to the actual

modelling exercise, it often makes sense

to split the data into sub groups and build

separate models for each of these groups.

This allows for a much greater level of

accuracy in predictions and portfolio

management. The question then becomes

how many models are optimal, and which

set of segmentation structure will provide a

client the best business results.

In case of cross sectional data, a

classification tree technique called

CHAID (Chi-Square Automatic Interaction

Detection) is very popular for segmentation.

This technique recursively partitions a

population into separate and distinct

groups defined by a set of independent

predictor variables, such that the variance

of the target variable is minimized within

the groups and maximized across the

groups. The advantage of CHAID is that

the output is highly visual and easy to

interpret. The development of the decision,

or classification tree, starts with identifying

the target variable or dependent variable

which would be considered the root.

CHAID analysis splits the target into two or

more categories that are called the initial, or

Guide to Segmentation for Survival Models using SAS

EXLservice.com | 3

Page 4: Guide to Segmentation for Survival Models using SAS€¦ · of survival data to generate intuitive segmentation trees. The importance of segmentation in any kind of modelling exercise

parent nodes, and then the nodes are split

using statistical algorithms into child nodes.

The methodology outlined in this paper

is somewhat inspired by CHAID, but it has

been adapted to suit the requirements of

a time series data structure. This paper

provides a step by step guide to choosing

the appropriate predictor or predictors to

segment the population, and highlights

how to use a Log-Rank test to decide the

potential candidates for segmentation so

that the underlying survivor functions of the

sub groups are statistically different from

each other.

A. DATA STRUCTURE

The first step is to develop an appropriate

segmentation structure, so that separate

survival models can be built for each

of these segments. The segmentation

structure should ensure that accounts

with similar default patterns are grouped

together.

The data available for this example analysis

is at account and monthly level, whereby

for each month (to be referred as snapshot

from now), there is a dataset containing

all the non-defaulted accounts as of that

snapshot, and their characteristics like

month on book (MOB), delinquency status

(DELQ), utilization (UTIL), balance (BAL),

payments (PMT), full-payer indicator (FULL_

PAY_IND) and so on. In addition, there are

two variables which denote the default

performance of that account using the

most recent date till which data is available:

default indicator (t_PD), which is a binary

variable taking a value of 1 if the account

has ever defaulted and 0 otherwise, and

default month (def_month), which denotes

the month when the account defaulted,

taking a value of 0 if the account has never

defaulted.

An example of a snippet of the data

structure for a particular snapshot (in this

case, May 2014) is illustrated below. A similar

dataset will be available for other snapshots

as well.

Snapshot Account ID

MOB DELQ UTIL BAL PMT FULL_PAY_IND

t_PD def_month

201405 A1 36 1 80 1000 20 0 1 14

201405 A2 60 0 50 800 100 1 0 0

201405 A3 .5 2 90 920 40 0 1 3

Table 1: Snapshot of Account Data

For Example Purposes Only

Guide to Segmentation for Survival Models using SAS

EXLservice.com | 4

Page 5: Guide to Segmentation for Survival Models using SAS€¦ · of survival data to generate intuitive segmentation trees. The importance of segmentation in any kind of modelling exercise

A1, A2 and A3 have a non-default status as

of May 2014. Assuming that data is available

until December 2015, each account can

be observed for a performance window of

a period of 19 months from the snapshot

date. A1 and A3 hit a default status in this

performance window, as indicated by the

value of t_PD. The variable def_month

takes a value of 14 for A1 and 3 for A3,

implying that they default in July 2015 and

August 2014 respectively. On the other

hand A2 does not default over the entire

performance window, and hence both t_PD

and def_month are 0 for this account. In the

context of survival analysis, A2 is a censored

case, as the study ends before default

occurs. A1 and A3, instead have uncensored

default times.

The next step is to convert the data into a

format which can be easily handled by the

survival analysis procedures in SAS, be it

LIFETEST, LIFEREG or PHREG. For each

account in the sample, there must be one

variable (named event_duration in this

example) that contains either the time that

an event occurred or, for censored cases,

the last time at which that account was

observed, both measured from the chosen

origin. A second variable is required to

denote the status of the account at the time

recorded in the event_duration variable.

Fortunately, this variable is already available

in the data (t_PD) which takes a value of 1 for

uncensored cases and 0 for censored ones.

The variable event_duration can be created

by a simple data step within SAS as follows:

*CREATING EVENT_DURATION VARIABLE;

%MACRO INCL_DATE(SAMPDATE = , TERM = );

DATA X_&SAMPDATE.;

SET Y_&SAMPDATE.;

IF DEF_MONTH = 0 THEN EVENT_DURATION =

TERM;ELSE EVENT_DURATION = DEF_MONTH;

RUN;

%MEND;

The variable TERM in the above macro is

simply the difference between the snapshot

date and the last date of study. Depending

on the snapshot date, the above macro can

be invoked as follows:

%INCL_DATE(SAMPDATE = 201409, TERM = 15);

%INCL_DATE(SAMPDATE = 201203, TERM = 45);

The data for each month is then appended

to generate a master data on which the

segmentation exercise is to be carried out.

B. IDENTIFICATION OF POTENTIAL CANDIDATES FOR SEGMENTATION

Once the data has been converted to the

desired format, the next task is to identify

a set of potential candidates that can be

used to segment the population using the

account characteristics available. Business

intuition comes in handy at this stage, as

there may be certain variables that are

Guide to Segmentation for Survival Models using SAS

EXLservice.com | 5

Page 6: Guide to Segmentation for Survival Models using SAS€¦ · of survival data to generate intuitive segmentation trees. The importance of segmentation in any kind of modelling exercise

important based on policies, underwriting

strategies etc. Another approach is to

shortlist the variables by fitting a survival

model on the entire population using all

possible predictors. This can be achieved in

SAS using the stepwise selection methods

within PROC PHREG. Variables that are highly

significant in terms of p-value of Chi-Square

in this model are most likely to be good

candidates for segmenting the population.

The segmentation structure is also

governed by data availability. There are

certain subgroups of the population which

have more information available than

others, and it is a good idea to develop a

separate model for this segment which

optimally utilizes all of the available extra

information.

Irrespective of whether a variable is

shortlisted through business intuition or

statistical techniques, it is essential to

convert continuous variables to categorical

in order to be able to compare the survivor

functions across different categories of the

variable. This can be achieved by grouping

the accounts into ten equal bins based

on the values of the concerned variable.

Adjacent groups can then be clubbed if

the default rate in the next X months are

similar (X can typically be 12 months or 18

months, depending on the length of the

performance window of the latest snapshot

available).

C. GENERATION OF SEGMENTATION STRUCTURE

I. Theoretical Background

of Statistical Tests used

A Log-Rank Test was used as an approach

for sub segmenting the population. It is

based on non-parametric hypothesis tests

to compare the survival distributions of two

or more samples. It basically compares

estimates of hazard function of the groups

at each observed event time, or unique

time when any individual from any group

experiences the event, the null hypothesis

being that the hazard functions for all

groups are equal for all study time.

The idea behind segmentation is to divide

the population in a way such that the

survival functions are statistically different

across the sub categories. The following is

Guide to Segmentation for Survival Models using SAS

EXLservice.com | 6

Page 7: Guide to Segmentation for Survival Models using SAS€¦ · of survival data to generate intuitive segmentation trees. The importance of segmentation in any kind of modelling exercise

how the Log-Rank test can be represented

mathematically.

H0: h1 (t)= h2 (t)= ... =hk (t) for all t ≤ τ

H1: At least one hj (t)is different for some t ≤ τ

τ is the largest time during which each

group has at least one individual at risk, hj

(t) represents the hazard function at time

interval t for group k. The Log-Rank Test will

compare the hazards from each group at

each event time between 0 and τ.

If all hazards are equal for all groups, then

it is assumed that the proportion of each

group experiencing the event at any given

time τi will be equal to the proportion of the

overall population experiencing the event at

that same time:

= for all event timesti, i=1,2,3,... .. ..D, j=1,2,3,... .. ..k

Where:

dij = Number of events experienced by group j at event time ti

Yij = Number of individuals at risk in group j just prior to time ti

di = T otal number of events experienced by the entire study population at event time ti

Yi = Total number of persons at risk in the entire study population just prior to time ti

The Log-Rank test begins by calculating a

statistic representing the sum of weighted

differences between dij

Yij and di

Yi

at each

event time ti for each group j=1 through k.

For the Log-Rank test, the weights applied

to these differences are all equal to 1, so

each event time has an equal weighting

on the value of the statistics. The statistics

calculated for the k groups are linearly

dependent, and therefore only (k-1) may

be used to calculate a test statistic. To

calculate the test statistic, (k-1) of the

statistics are formed into a vector called Z.

The variances and covariances for these

(k-1) statistics are placed into a variance-

covariance matrix called ∑. A test statistic is

then calculated as:

x2=Z(∑-1)Zt

This has a chi-squared distribution with

(k-1) degrees of freedom when the null

hypothesis is true.

The Log-Rank test allows for two types of

non perfect survival data, left truncated

data and right censored data. If censored

observations are not present in the data

then Wilcoxon Rank sum test is more

appropriate.

II. Application of the Log-Rank Test

The LIFETEST procedure in SAS can be

used to generate the Log-Rank test for

comparison of survival patterns across

different groups.

dij

Yij

di

Yi

Guide to Segmentation for Survival Models using SAS

EXLservice.com | 7

Page 8: Guide to Segmentation for Survival Models using SAS€¦ · of survival data to generate intuitive segmentation trees. The importance of segmentation in any kind of modelling exercise

Continuing with the credit card example

as before, once a few categorical variables

have been shortlisted through either

business intuition or statistical techniques

as potential candidates for segmenting

the population, a Log-Rank test will be

applied to each of them to test whether

the survivor/hazard functions are different

across the categories of that variable. The

variable having the highest Chi- Square

will be used to create the first split of the

population.

Before even going ahead with the

segmentation methodology, it is advised to

summarize the data across the shortlisted

variables for ease of computation.

Assuming that there are 4 shortlisted

variables, MOB, DELQ, UTIL and BAL, this

can be achieved in SAS through a simple

SQL procedure. Continuous variables

like utilization and balance need to be

converted to categorical variables– UTIL_

FMT and BAL_FMT.

*SUMMARIZING THE DATA;

PROC SQL;

CREATE TABLE SUMMARY1 AS

SELECT EVENT_DURATION,UTIL_

FMT,DELQ,MOB,BAL_FMT,FULL_PAY_IND,T_PD,

COUNT(*) AS NUMBER,

SUM(T_PD = 1) AS DEFAULTS

FROM STACKED_DATA

GROUP BY 1,2,3,4,5,6,7 ;

END;

Next, Log-Rank test will be computed

iteratively for each of the four selected

variables by specifying them in the STRATA

option of PROC LIFETEST. A separate

survivor function is then estimated for each

stratum, and tests of the homogeneity of

strata are performed. The precise SAS code

is as follows:

*LIFETEST FOR EACH VARIABLE;

ODS OUTPUT SURVDIFF = SD HOMTESTS = HT;

PROC LIFETEST DATA = SUMMARY1 METHOD = LT

INTERVALS = 0 TO 108 BY 2;

TIME EVENT_DURATION*T_PD(0);

STRATA VAR_NAME/ADJUST = TUKEY;

FREQ NUMBER;

RUN;

It is essential to configure some options of

the LIFETEST procedure before executing it:

• In the TIME statement, the survival time

variable, EVENT_DURATION, is crossed

with the censoring variable, T_PD, with

the value 0 indicating censoring. Hence

the values of EVENT_DURATION are

considered censored if the corresponding

values of T_PD are 0. Otherwise, they are

considered as event times.

• In the STRATA statement, the variable

Guide to Segmentation for Survival Models using SAS

EXLservice.com | 8

Page 9: Guide to Segmentation for Survival Models using SAS€¦ · of survival data to generate intuitive segmentation trees. The importance of segmentation in any kind of modelling exercise

name is specified, which indicates that the

data are to be divided into strata based

on the values of that particular variable. In

this example, a separate PROC LIFETEST

is run for each of the five shortlisted

variables - UTIL_FMT, DELQ, MOB, BAL_

FMT and FULL_PAY_IND.

• The METHOD option specifies the

method to be used to compute the

survival function estimates. LT refers to

the life table (actuarial estimates). This

method is preferred when the number of

observations is large2 .

• The INTERVALS option specifies interval

endpoints for life-table estimates. Each

interval contains its lower endpoint but

does not contain its upper endpoint.

Hence the specification in the above code

produces the set of intervals

{[0,2), [2,4), ...............[106,108), {108, ∞)}

• The FREQ statement is useful for

producing life tables when the data

are already in the form of a summary

data set. The FREQ statement identifies

a variable (NUMBER in this case) that

contains the frequency of occurrence of

each observation. PROC LIFETEST treats

each observation as if it appeared n times,

where n is the value of the FREQ variable

for the observation.

Once the LIFETEST procedure is run for

each of the shortlisted variables and the

output datasets are stored in the dataset

named HT, they are appended together to

create a final table having the Chi-Square

test results for each variable.

DATA HT1;

SET HT;

LENGTH VAR $30.;

VAR = “VAR_SEG.”;

WHERE TEST = “Log-Rank”;

RUN;

The Chi-Square test results for each

variable are then appended to create a

table like the below. The table is sorted in

order of descending Chi-Square values.

The top variable, DELQ, is used as the first

segmentation split.

Level 1

Test ChiSq DF ProbChiSq Var

Log-Rank 734,420 3 <.0001 DELQ

Log-Rank 457,622 5 <.0001 UTIL_FMT

Log-Rank 340,373 5 <.0001 BAL_FMT

Log-Rank 295,331 9 <.0001 MOB

Log-Rank 294,356 1 <.0001 FULL_PAY_IND

Table 2: Chi-Square Test Results

For Example Purposes Only

Guide to Segmentation for Survival Models using SAS

EXLservice.com | 9

Page 10: Guide to Segmentation for Survival Models using SAS€¦ · of survival data to generate intuitive segmentation trees. The importance of segmentation in any kind of modelling exercise

Now, DELQ has 4 categories: cycle 0, 1, 2 and 3.

Since cycle 0 comprises of around 95%

of the non-default population, the data is

further divided into two categories: Inorder

(cycle 0) and Delinquent (cycle 1+). Since the

Delinquent population is relatively small, it

is not split further. The Inorder population

is then considered, and the segmentation

exercise is carried on this subset, using the

remaining variables.

The InOrder population is further split

into Full Payer and Revolver population

according to the top splitter in this subset,

FULL_PAY_IND. Each of this subset can

further be split using the remaining

variables, following the same steps as

before. It should be kept in mind that

any of the final nodes post segmentation

should have sufficient volume for the

model to be robust. In this example, the

final segmentation structure is obtained by

further splitting each of the Full Payer and

Revolver population by MOB.

Level 2: DELQ = 0 (InOrder)

Test ChiSq DF ProbChiSq Var

Log-Rank 275,918 1 <.0001 FULL_PAY_IND

Log-Rank 230,824 5 <.0001 UTIL_FMT

Log-Rank 175,167 9 <.0001 MOB

Log-Rank 169,771 5 <.0001 BAL_FMT

Table 3: InOrder Population Segmentation

For Example Purposes Only

Guide to Segmentation for Survival Models using SAS

EXLservice.com | 10

Page 11: Guide to Segmentation for Survival Models using SAS€¦ · of survival data to generate intuitive segmentation trees. The importance of segmentation in any kind of modelling exercise

D. ANALYSIS OF SEGMENTATION PERFORMANCE

The blue highlighted boxes in Figure 1

are the final segments for this population.

Once the final segmentation structure has

been decided, it makes sense to check

the survival distributions across the five

segments. The following program can be

used for that, assuming that the variable

“pd_seg_ind_sm_2” captures the new

segmentation structure:

ods output SurvDiff = SD HomTests = HT ; proc lifetest data=<DATA> method=lt

intervals= 0 to 108 by 2 plots = (s,h);

time event_duration * t_pd(0) ;

Strata pd_seg_ind_sm_2/Adjust = Tukey

;

run;

The ADJUST option (new in SAS 9.2) tells

PROC LIFETEST to produce p-values for all

ten pairwise comparisons of the five strata

and then to report p-values that have been

adjusted for multiple comparisons using the

Tukey’s method. Results are shown in Table 4.

Table 4 shows the overall chi-square

tests of the null hypothesis that the

survivor functions are identical across the

five segments. All three tests are highly

significant, unanimously rejecting the null

hypothesis and providing evidence that at

least one of the five stratum hazard plots is

Cards (Stacked Sample)Volume percentage : 100%

12M Default Rate: 6.43%60M Default Rate: 27.22%

DELQ >0Volume Percentage : 4.0%

12M Default Rate : 62.70%60M Default Rate: 93.92%

FULL_PAY_IND = 1Volume Percentage : 52.44%

12M Default Rate : 0.75%60M Default Rate : 3.69%

FULL_PAY_IND = 0Volume Percentage : 43.55%

12M Default Rate : 9.28%60M Default Rate: 42.32%

MOB < XVolume Percentage : 6.11%

12M Default Rate : 3.29%60M Default Rate : 18.71%

MOB >= XVolume Percentage : 46.34%

12M Default Rate : 0.33%60 M Default Rate : 3.03%

MOB < XVolume Percentage : 5.41%

12M Default Rate : 24.96%60M Default Rate : 87.68%

MOB >= XVolume Percentage : 38.15%

12M Default Rate : 6.47%60M Default Rate : 36.43%

Figure 1: Final Segmentation Tree

DELQ = 0Volume Percenatge : 96.0%

12M Default Rate: 3.62%60M Default Rate: 21.76%

Test ChiSq DF ProbChiSq

Log-Rank 559,397 4 <.0001

Wilcoxon 643,746 4 <.0001

-2Log(LR) 455,316 4 <.0001

Table 4: Comparison of Results

For Example Purposes Only

Guide to Segmentation for Survival Models using SAS

EXLservice.com | 11

Page 12: Guide to Segmentation for Survival Models using SAS€¦ · of survival data to generate intuitive segmentation trees. The importance of segmentation in any kind of modelling exercise

significantly different from others for some

value of t≤ τ

The second output in Table 5 shows the Log

Rank tests comparing each possible pair

of strata. All the tests are significant both

using the raw p-values and after the Tukey

adjustment, suggesting that each segment

is significantly different from another. This

rules out the possibility to collapse the

segments.

Adjustment for Multiple Comparisons for the Log-Rank Test

Strata ComparisonChi-Square

p-values

pd_seg_ind_sm_2 pd_seg_ind_sm_2 Raw Tukey-Kramer

1 2 160,134 <.0001 <.0001

1 3 82,894 <.0001 <.0001

1 4 29,425 <.0001 <.0001

1 5 124,817 <.0001 <.0001

2 3 335,678 <.0001 <.0001

2 4 138,374 <.0001 <.0001

2 5 388,489 <.0001 <.0001

3 4 95 <.0001 <.0001

3 5 3,176 <.0001 <.0001

4 5 393 <.0001 <.0001

Table 5: Log Rank Test Results

For Example Purposes Only

Guide to Segmentation for Survival Models using SAS

EXLservice.com | 12

Page 13: Guide to Segmentation for Survival Models using SAS€¦ · of survival data to generate intuitive segmentation trees. The importance of segmentation in any kind of modelling exercise

The graph in Figure 2 shows some evidence

of difference in survival functions across the

five strata, thereby supporting the results

already obtained from the Chi-Square tests.

Finally, since the default rates of each of

the five segments at two time intervals, 12

month and 60 months, are considerably

different across the segments, this indicates

that the segments are different in terms of

the default rates.

Conclusion and LimitationsThis approach also has its own set of

limitations:

First, the test statistics for the Log-Rank test

are based on large-sample approximations

and gives good results when the sample

size is large. The number of comparison

segments should not be allowed to get too

large to avoid having segments with too few

subjects. Each group should contain at least

30 subjects, preferably more for the best

results.

Secondly, the Log-Rank test is more

powerful for detecting differences of the

form S1 (t) = [ S2 (t)]Ƴ, where Ƴ is some

Figure 2: Survival Graphs for Chosen Segments

1.00

0.75

0.50

0.25

0.00

Su

rviv

al D

istr

ibu

tio

n F

un

cti

on

0 20 40 60 80 100

STRATA: pd_seg_ind_sm_2=1pd_seg_ind_sm_2=3pd_seg_ind_sm_2=5

pd_seg_ind_sm_2=2pd_seg_ind_sm_2=4

Guide to Segmentation for Survival Models using SAS

EXLservice.com | 13

Page 14: Guide to Segmentation for Survival Models using SAS€¦ · of survival data to generate intuitive segmentation trees. The importance of segmentation in any kind of modelling exercise

positive number other than 1.0. This

equation defines a proportional hazards

model, and the log rank test is not

particularly good at detecting differences

when survival curves cross.

Segmentation is a unique aspect in

modelling in that it blends art and science

in almost equal measures. There are times

when a segmentation structure based

entirely on statistical measures does not

add enough value; however, it will be

effective only when these numbers are

coupled with business requirements and

common sense as was demonstrated in

the example discussed above. Dividing

the population into five different groups

and building separate survival models on

each of these groups yielded better results

instead of building a single standalone

model on the entire population, as these

groups are inherently different in terms of

the survival patterns.

Survival analysis can be applied to build

models for time of default on credit cards.

This knowledge helps the issuer to pre-

empt the attrition and devise customer

engagement strategies. We here proposed

to create an intuitive segmentation

structure on a large dataset of credit card

accounts before the onset of the actual

modelling exercise by using the Log-Rank

test to compare the hazard function across

the different sub groups. The program used

in this paper serves as a fast, efficient way

to churn through a large quantity of data to

provide the client the necessary information

needed for a final decision on modelling

splits.

References1 Allison, P. D. (2010). Survival Analysis using SAS®:

A Practical Guide, Second Edition. Cary,NC: SAS Institute Inc.

2 Bellotti, T., & Crook, J. (2007, May 7). Credit Scoring With Macroeconomic Variables Using Survival Analysis.

3 Man, R. (2014, May 9). Survival analysis in credit scoring: A framework for PD estimation.

4 Pazdera, J., Rychnovsky, M., & Zahradnik, P. (2009, Feb 1). Survival analysis in credit scoring.

5 Sayles, H., & Soulakova, J. (n.d.). Log-Rank Test for More tan Two Groups.

6 Weldon, G., & Zidun, H. (n.d.). Segmentation of Data Prior to Modeling. Atlanta: Merkle,Inc.

End Notes1 Refer to (Bellotti & Crook, 2007), (Pazdera,

Rychnovsky, & Zahradnik, 2009), (Man, 2014).

2 The Kaplan – Meier method of estimating survivor functions is more suitable when sample size is small, and event times are measured with precision. This is in fact the default method in PROC LIFETEST.

Guide to Segmentation for Survival Models using SAS

EXLservice.com | 14

Page 15: Guide to Segmentation for Survival Models using SAS€¦ · of survival data to generate intuitive segmentation trees. The importance of segmentation in any kind of modelling exercise

GLOBAL HEADQUARTERS280 Park Avenue, 38th Floor, New York, NY 10017

T: +1.212.277.7100 • F: +1.212.277.7111

United States • United Kingdom • Czech Republic • Romania • Bulgaria • India • Philippines • Colombia • South Africa

Email us: [email protected] On the web: EXLservice.com

EXL (NASDAQ: EXLS) is a leading operations management and analytics company that designs and enables

agile, customer-centric operating models to help clients improve their revenue growth and profitability. Our

delivery model provides market-leading business outcomes using EXL’s proprietary Business EXLerator

Framework®, cutting-edge analytics, digital transformation and domain expertise. At EXL, we look deeper to

help companies improve global operations, enhance data-driven insights, increase customer satisfaction,

and manage risk and compliance. EXL serves the insurance, healthcare, banking and financial services,

utilities, travel, transportation and logistics industries. Headquartered in New York, New York, EXL has

more than 27,000 professionals in locations throughout the United States, Europe, Asia (primarily India and

Philippines), South America, Australia and South Africa.

© 2017 ExlService Holdings, Inc. All Rights Reserved.

For more information, see www.exlservice.com/legal-disclaimer