credit risk estimation

8/2/2019 Credit Risk Estimation

1/73

Masters Thesis presented to obtain the Master of Science degree:

On Credit Scoring Estimation

by

Karel Komorad

(180174)

Submitted to:

Prof. Dr. Wolfgang Hardle

Institute for Statistics and Econometrics

Humboldt University

Spandauer Str. 1

D10178 Berlin

Berlin, December 18, 2002


2/73

AcknowledgmentOn this place I would like to express my gratitude to persons who contributed to rise of this

thesis and who influenced my attitude to statistics and to study in general. On the first

place I would like to thank to my advisor Professor Wolfgang H ardle, who inspired me to

think in another than the mathematician levels only and gave me the key to connect the

real world and the world of theoretical books. Further, my thanks to Mgr. Zdenek Hlavka,

Ph.D. for his readiness to help at any time. Without his support this thesis would not exist

at all.

The two years I spent at the Institute for Statistics and Econometrics at Humboldt-

University in Berlin gave me more experience than any years before, and I am grateful to all

colleagues of this institute, especially to Prof. Dr. Bernd Ronz, Axel Werwatz, Ph.D. and

Ing. Pavel Czek, Ph.D., who taught me a lot of the programming mystiques.And last, but not least, my gratitude to Julka Smoljaninova, who gave me the strength

to finish the work once commenced and never stopped to trust.

Declaration of AuthorshipI hereby confirm that I have authored this master thesis independently and without use of

others than the indicated resources.

All passages, which are literally or in general matter taken out of publications or other

resources, are marked as such.

Berlin, December 18, 2002

Karel Komorad


3/73

Abstract

Credit scoring methods became standard tool of banks and other financial insti-

tutions, direct marketing retailers and advertising companies to estimate whether an

applicant for credit/goods will pay back his liabilities. In this thesis we give a shortoverview of credit scoring and its methods. We investigate the usage of some of these

methods and their performance on a data set from a French bank. Our results indicate

that the methods, namely the logistic regression, multi-layer perceptron (MLP) and

radial basis function (RBF) neural networks give very similar results, however, the

traditional logit model seems to be the best one. We also describe RBF architecture

and a simple RBF program we implemented in the statistical computing environment

XploRe.


4/73

Contents

1 Introduction 6

2 Credit Scoring in Overview 8

2.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Credit Scoring Today . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Data Set Description 12

4 Credit Scoring Methods 27

4.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Multi-layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3 Radial Basis Function Neural Network . . . . . . . . . . . . . . . . . . . 38

5 Other Methods 41

5.1 Probit Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 Semiparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.3 Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.4 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . 43

5.5 Panel Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.6 Hazard Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.7 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.8 Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.9 Treed Logits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6 Summary and Conclusion 45

7 Extensions 46

Appendix 47

A Radial Basis Function Neural Networks 47

A.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

A.2 RBF Neural Networks in XploRe . . . . . . . . . . . . . . . . . . . . . . 51

A.3 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

A.4 Detailed Description of the Algorithm . . . . . . . . . . . . . . . . . . . 61

3


5/73

B Suggestions for Improvements in XploRe 65

B.1 grdotd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

B.2 nnrpredict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

B.3 nnrnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

C CD contents 69

4
http://www.xplore-stat.de/help/nnrnet.htmlhttp://www.xplore-stat.de/help/nnrpredict.htmlhttp://www.xplore-stat.de/help/grdotd.html


6/73

Notation

X random variable

X random vectorX data matrix

xi = (xi,1, . . . , xi,p) i-th observation of a random vector X Rp

transposition

french.dat file name

grdotd quantlet name

rbfnet parameter name

Abbreviations

ANN artificial neural network

Basel II the new Basel Capital Accord

CART classification and regression trees

ECOA Equal Credit Opportunity Act

GLM generalized linear modelGPLM generalized partially linear model

LDA linear discriminant analysis

MSE mean squared error

RBF radial basis function

VaR value at risk

5
http://www.xplore-stat.de/help/grdotd.html


7/73

1 Introduction

One of the basic tasks which any finance institution must deal with, is to minimize

its credit risk. Scoring methods traditionally estimate the creditworthiness of a credit

card applicant. They predict the probability that an applicant or existing borrowerwill default or become delinquent. Credit scoring is studying the creditworthiness of:

any of the many forms of commerce under which an individual obtains

money or goods or services on condition of a promise to repay the money

or to pay for the goods or services, along with a fee (the interest), at some

specific future date or datesLewis (1994)

The statistical methods we will present are based on a particular amount of histor-

ical data. Computers allow to treat large data sets and to come to a decision quicker,

cheaper and more realistic than the primary (historic) judgment made by credit experts

or loan officers. These methods give even a better prediction, if used correctly.

Bank credit card issuers in the U.S. lose about $1 milliard each year in fraud, mostly

from stolen cards. They also lose another $3 milliard in fraudulent bankruptcy filings.

However merchants absorb more than $10 milliard each year in credit card related

fraud 1.

The founders of credit scoring, Bill Fair and Earl Isaac designed a complete billing

system for one of the first credit cards, Carte Blanche, in 1957 and built the first creditscoring system for American Investments one year later. Since then credit scoring

became a broad system predicting consumer behaviour and spread into many other

areas, e.g. consumer lending (especially on credit cards), mortgage lending, small and

medium business loan lending, direct marketing or advertising. Back et al. (1996) used

the same procedures to predict the failure of companies.

Methods used for credit scoring include various statistical procedures, like the most

commonly used logistic regression and its alternative probit regression, (linear) dis-

criminant analysis, fashionable artificial neural networks or genetic algorithms, furtherlinear programming, nonparametric classification trees or semiparametric regression.

The aim of this thesis is to give an overview of credit scoring and to compare

various statistical methods by using them on a data set from a French bank. We want

to compare the ability to distinguish between two subgroups of a highly complicated

sample, and to see the performance and computational severity of these procedures.

1http://www.cardweb.com/cardlearn/stat.html

6


8/73

Number of Credit Cards

1995 1996 1997 1998 1999

200

250

CreditCards(per1000habitants)

Number of Transactions

1995 1996 1997 1998 1999

5000

6000

7000

8000

9000

Transactions(per1000habitants)

Figure 1.1: Number of credit cards and number of transactions from a creditcard in the European Union between 19951999 (http://www.eurofinas.org).

Credit scoring gains new importance, when thinking about the New Basel Capital

Accord. The so called Basel II replaces the current 1988 Capital Accord and focuses

on techniques that allow banks and supervisors to evaluate properly the various risks

that banks face. Thus credit scoring may contribute to the internal assessment process

of an institution, what is desirable.

This master thesis is organized as follows: Firstly, we give a short overview of creditscoring, advert something from its history, refer accompanying problems and mention

the way credit scoring may go in the future. The data set for the analysis is described in

the third section. Section 4 explains the statistical methods used (it is particularly the

logistic regression, semiparametric regression, multi-layer perceptron neural network

and radial basis function neural network) and apply them on the data set. Other

methods used for credit scoring are presented in Section 5. Summary of our analyzes

with the conclusion is given in Section 6. Section 7 remarks some possible extensions

to this topic. The method of radial basis function neural networks, we programmedin C language for the purpose of this thesis, is described in Appendix A thoroughly.

Appendix B gives some suggestions for improvements in XploRe we discovered during

work on this thesis. Finally, Appendix C lists files stored on compact disk enclosed to

this thesis.

7


9/73

2 Credit Scoring in Overview

Risk forecasting is the topic number one in modern finance. Apart from the portfolio

management, pricing options (and other financial instruments) or bond pricing, credit

scoring represents another important set of procedures estimating and reducing creditrisk. It involves techniques that help financial organizations to decide whether or not to

grant a credit to applicants. Basically, credit scoring tries to distinguish two different

subgroups in the data sample. The aim is to choose such a method which is computable

in real time and predicts sufficiently precise.

There is a vast number of articles treating credit scoring in recent issues of trade

publications in the credit and banking area. An exhaustive overview of literature

to credit scoring can be found in Thomas (2000), Mester (1997) or in Kaiser and

Szczesny (2000a). Professor D.J. Hand, head of statistics section in the department ofmathematics at Imperial College, published also lot of books to this topic.

2.1 History

The statistical techniques used for credit scoring are based on the idea of discrimination

between several groups in a data sample. These procedures originate in the thirties and

forties of the previous century (Fisher, 1936; Durand, 1941). At that time some of the

finance houses and mail order firms were having difficulties with their credit manage-

ment. Decision whether to give loans or send merchandise to the applicants were made

judgmentally by credit analysts. The decision procedure was nonuniform, subjective

and opaque, it depended on the rules of each financial house and on the personal and

empirical knowledge of each single clerk. With the rising number of people applying

for a credit card in the late 1960s it was impossible to stay by credit analysts onlyan

automated system was necessary. The first consultancy was formed in San Francisco

by Bill Fair and Earl Isaac in the late 1950s. Their system spread fast as the financial

institutions found out that using credit scoring was cheaper, faster, more objective,

and mainly much better predictive than any judgmental scheme. It is estimated that

the default rates dropped by 50% (Thomas, 2000) after the implementation of credit

scoring. Another advantage of the usage of credit scoring is that it allows lenders to

underwrite and monitor loans without actually meeting the borrower.

The success of credit scoring in credit cards issuing was a significant sign for the

banks to use scoring methods to other products like personal loans, mortgage loans,

small business loans etc. However, commercial lending is more heterogeneous, its

documentation is not standardized within or across institutions and thus the results

8


10/73

are not so clear. The growth of direct marketing has led to the use of scorecards to

improve the response rate to advertising campaigns in the 1990s.

2.2 Problems

Most of the problems one must face when using credit scoring are rather technical than

theoretical nature. First of all, one should think of the data necessary to implement

the scoring. It should include as many relevant factors as possible. It is a trade-off

between expensive data and between low accuracy due to not enough information.

Banks collect the data from their internal sources (from the applicants previous credit

history), from external sources (questionnaires, interviews with the applicants) and

from third parties. From the applicants background the following information is usu-

ally collected: age, gender, marital status, nationality, education, number of children,job, income, lease rental charges, etc. The following questions from applicants credit

history are especially interesting: Has the applicant already a credit?, How much

did he borrowed?, Has the applicant ever delayed his payment?, Does he ask for

another credit as well? Under third parties we understand special houses oriented in

Number of Credit Cards (per 1000 habitants)

200

400

600

800

Belgium France Germany Italy Sweden U.K.

Figure 2.1: Number of credit cards per 1000 persons in the year 1994 (thin

line) and 1999(thick line) (http://www.eurofinas.org).

9


11/73

collecting credit information about potential clients. The variables entering the credit

scoring procedures should be chosen carefully, as the amount of the data may be vast

indeed and thus computationally problematic. For instance, the German central bank

(Deutsche Bundesbank) lists about 325000 units. Since most of the attributes in credit

scoring are categorical, imposing dummy variables gives a matrix with several millions

of elements (Enache (1998) mentions 180 variables in his analysis).

Muller et al. (2002) treats a very important feature of credit scoring data. There is

usually no information on the performance of rejected customers. This causes bias in

the sample. Hand and Henley (1993) concluded that it cannot be overcome unless one

can assume particular relationship between the distributions of the good and bad

clients which holds for both the accepted and the rejected applicants. This problem may

be solved by some organizations if they accept everybody for a short time. Afterward

they can build a scorecard based on the unbiased data sample. However, this is possible

only for retailers, mail order firms or advertising companies, not for banks and financial

institutions.

The American banks have another problem. The law does not allow to use infor-

mation about race, nationality, religion, gender or marital status to build a scorecard.

It is stated in the Equal Credit Opportunity Act (ECOA) and in the Consumer Credit

Protection Act. Moreover, the attribute age plays a special role. It can be used,

unless people older than 62 years are discriminated. Legal fundamentals for consumer

credit business in Germany are given in Schnurr (1997).

2.3 Credit Scoring Today

As mentioned above, credit scoring methods are widely used to estimate and to min-

imize credit risk. Mail order companies, advertising companies, banks and other fi-

nancial institutions use these methods to score their clients, applicants and potential

customers. There is effort to precise all procedures used to estimate and decrease credit

risk. Both the U.S. Federal Home Loan Mortgage Corporation and the U.S. Federal

National Mortgage Corporation have encouraged mortgage lenders to use credit scoring

which should provide consistency across underwriters.

Also the international banks supervision appeals to precise banks internal assess-

ments: The Basel Committee on Banking Supervision is an international organization

which formulates broad supervisory standards and guidelines for banks. It encourages

convergence toward common approaches and common standards. The Committees

members come from Belgium, Canada, France, Germany, Italy, Japan, Luxembourg,

10


12/73

the Netherlands, Spain, Sweden, Switzerland, United Kingdom and United States. In

1988, the Committee decided to introduce a capital measurement system (the Basel

Capital Accord). This framework has been progressively introduced not only in mem-

ber countries but also in other countries with active international banks. In June 1999,

the Committee issued a proposal for a New Capital Adequacy Framework to replace

the 1988 Accord (http://www.bis.org). The proposed capital framework consists of

three pillars:

1. minimum capital requirements,

2. supervisory review of internal assessment process and capital adequacy,

3. effective use of disclosure to strengthen market discipline.

The New Basel Capital Accord is to be implemented till 2004. Consequently, Basel

II (The New Capital Accord) gives more emphasis on banks own internal methodolo-

gies. Therefore credit scoring and its methods can become subject of banks extensive

interest, as they will try to make their internal assessments as precise and correct as

possible.

11


13/73

3 Data Set Description

In this section we shortly describe the data set used in our analysis and show some

of its basic characteristics to give a better insight in the sample. The data set an-

alyzed in this thesis stems from a French bank. However, the source is confidentialand therefore names of all variables have been removed, categorical values have been

changed to meaningless symbols and metric variables have been standardized to mean

0 and variance 1. The same data set was in background of Muller and Ronz (1999)

and Hardle et al. (2001). The original file, french.dat, contains 8830 observations

with one response variable, 8 metric and 15 categorical predictor variables. We have

X1

-2 0 2 4

0

0.

1

0.

2

0.

3

0.

4

0.

5

X4

0 2 4 6

0

0.

5

1

X7

0 20 40 60

0

0.5

1

1.5

X2

-2 0 2 4 6

0

0.

1

0.

2

0.3

0.

4

0.

5

X5

0 20 40 60

0

0.

5

1

1.

5

2

X8

0 5 10 15

0

0.5

1

1.

5

X3

0 2 4 6

0

0.

2

0.

4

0.

6

0.

8

X6

0 10 20 30

0

0.

5

1

Figure 3.1: Density dot plots for the original metric variables X1, . . . , X 8.

12


14/73

"=================================================="

" Variable X5"

"=================================================="

" | Frequency Percent Cumulative "

"--------------------------------------------------"

" -0.537 | 3815 0.617 0.617"

" 0.203 | 934 0.151 0.768"

" 0.943 | 907 0.147 0.915"

" 1.682 | 431 0.070 0.985"

" 2.422 | 93 0.015 1.000"

"--------------------------------------------------"

" | 6180 1.000"

"=================================================="

Table 3.1: Frequency Table of the metric variable X5.

removed observations with response classified as 9, class used originally for testing.

Since we do not know the real classification, we cannot use this class for the purpose

of our analysis. The remaining data contain 6672 observations. In addition we have

changed classes of the independent categorical variables from 1, . . . , K to 0, . . . , K 1

and ordered them in accordance with the number of categories. Let Y denote the

response variable, X1, . . . , X 8, in sequence, the metric variables and X9, . . . , X 23 the

categorical variables.

The density dot plots 2 in Figure 3.1 show estimated densities of the metric variables

and indicate some suspicious outlying values. The problem is that the usual outliers

tests assume normal distribution and that testing of the normality is affected by these

outliers (Ronz, 1998). Note that the last observations (number 6662 and higher) take

the lower extremes in almost all metric variables. Since the metric variables were

already standardized we decided to restrict them to the range [3, 3] in order to get

rid of the outliers. Thereby we get a new subsample containing only 6180 cases. Density

dot plots of the metric variables for this data set are shown in Figure 3.2 and we can

see that the shape of the variables densities became better. Table 3.1 shows frequencies

of the variable X5, which is obviously discrete.

For the purpose of our analysis we have randomly divided the data sample into

two subsamples About two thirds of the whole data set (4135 observations) builds

the first subsample, the TRAIN sample. It will be used for model estimation. The

second subsample, TEST, with 2045 observation will be used to get some overall pro-

2For the density dot plots in this section we used quantlet myGrdotd.xpl which is a modified version of

the faulty original grdotd. For more information see Appendix B.1.

13
http://www.xplore-stat.de/help/grdotd.html


15/73

X1

-1 0 1 2 3

0

0.

2

0.

4

X4

-1 0 1 2 3

0

0.5

1

X7

0 1 2 3

0

0.

5

1

1.

5

2

2.

5

X2

-1 0 1 2 3

0

0.

2

0.

4

0.

6

X5

0 1 2

0

0.

5

1

1.

5

X8

0 1 2 3

0

1

2

3

4

5

X3

-1 0 1 2 3

0

0.

2

0.

4

0.

6

0.

8

X6

0 1 2 3

0

0.

5

1

Figure 3.2: Density dot plots for all metric variables as used in the analysis.

cedure to compare particular methods and to check the predictive power of the models

used. This will be based on misclassification rates. The subsamples are stored in files

data-train.dat and data-test.dat.

At this stage it is worth to recall that we do not know anything about the economic

interpretation of the predictors. We also do not know, what the response variable means

(is it a credit card-, loan- or mortgage-application?). And even the coding is primarily

unclearstands 1 for client is creditworthy or for client has some problems with

repaying the debt? The frequencies of the response variable, summarized in Table 3.2,

tell us more. Since only about 6% of the data set is classified as 1, we will call this class:

clients that have some problems with repaying their liability. Our sample is in this

14


16/73

TRAIN sample TEST sample

0 3888 (94.0%) 1920 (93.9%) 5808 (94.0%)

1 247 (6.0%) 125 (6.1%) 372 (6.0%)

total 4135 2045 6180

Table 3.2: Frequencies of two response outcomes.

sense unbiased, because the percentage of faulty loans in consumer credit commonly

varies between 1% and 7% (Arminger et al., 1997). Note that many credit scoring

analyzes are using data sets with overrepresented rate of bad loans (West, 2000).

However, Desai et al. (1996) use 3 data sets from 3 credit unions in the Southeastern

US which consist of 81.58%, 74.02% and 78.85% of good loans respectively. The

data set of Fahrmeir et al (1984) contains 70% of good credits. Enache (1998)

analyzed 38.000 applications, 16.8% of them were rejected. Cardweb, the U.S. payment

card information network (http://www.cardweb.com), mentions that about 78% of U.S.

households are considered creditworthy.

Let us now examine the variables in detail. All descriptive statistics and graphics

used in this section are computed by the quantlet fr descr.data.xpl. We start with

metric variables. Table 3.3 and 3.4 show for the TRAIN and TEST data sets their

basic characteristics: minimum, first quartile, median, third quartile, maximum, mean

and standard error. These statistics express in numbers our first finding from the dot

density plots, namely that the data are extremely right-skewed. Figure 3.3 and 3.4

show box plots for the TRAIN and TEST data set. The left box plot in each display

stands for the good clients and the right one for the bad clients. From the box

plots we cannot see any substantial differences between these two groups.

As next, we examine the categorical variables. Frequencies for dichotomous cate-

gorical variables (with two outcomes only) are shown in Table 3.5. Figures 3.53.13

show bar charts for the variables X15X23. The upper displays correspond to the

TRAIN data set, the lower displays correspond to the TEST data set. Left displays

reveal the outcomes when the response is 0 and the displays on the right side when

Y = 1. Remarkable is the change in variable X23. While the third category is most

plentiful by the creditworthy clients and the ninth category has only about one half

observations, in the case of non-creditworthy clients the ninth category increases its

relative number and the third category is up to about two thirds of the ninth category.

Characteristics for variables X15X23 are summarized in Tables 3.63.14.

15


17/73

X1

-1

0

1

2

3

X4

-1

0

1

2

3

X7

0

1

2

3

X2

-1

0

1

2

3

X5

0

1

2

X8

0

1

2

3

X3

-1

0

1

2

3

X6

0

1

2

3

Figure 3.3: Box plots of metric variables in the TRAIN subsample.

Min. 25% Q. Median 75% Q. Max. Mean Std.Err.

X1 1.519 0.766 0.349 0.403 2.994 0.119 0.892

X2 1.188 0.810 0.307 0.323 2.968 0.122 0.847

X3 0.830 0.695 0.426 0.113 2.940 0.083 0.851

X4 0.825 0.694 0.432 0.223 2.973 0.113 0.816

X5 0.537 0.537 0.537 0.203 2.422 0.019 0.772

X6 0.626 0.363 0.167 0.117 2.962 0.069 0.492X7 0.302 0.302 0.302 0.138 2.924 0.030 0.408

X8 0.346 0.211 0.211 0.211 2.835 0.106 0.340

Table 3.3: Minimum, 1st quartile, median, 3rd quartile, maximum, mean and

the standard error of metric variables from the TRAIN subsample.

16


18/73

X1

-1

0

1

2

3

X4

-1

0

1

2

3

X7

0

1

2

X2

-1

0

1

2

3

X5

0

1

2

X8

-0.

5

0

0.

5

1

1.5

2

2.

5

X3

-1

0

1

2

3

X6

0

1

2

3

Figure 3.4: Box plots of metric variables in the TEST subsample.

Min. 25% Q. Median 75% Q. Max. Mean Std.Err.

X1 1.519 0.766 0.182 0.487 2.994 0.062 0.892

X2 1.188 0.810 0.307 0.449 2.968 0.072 0.880

X3 0.830 0.695 0.291 0.247 2.940 0.037 0.889

X4 0.825 0.694 0.432 0.223 2.973 0.086 0.832

X5 0.537 0.537 0.537 0.203 2.422 0.012 0.780

X6 0.626 0.375 0.180 0.109 2.858 0.083 0.473X7 0.302 0.302 0.302 0.138 2.631 0.029 0.404

X8 0.211 0.211 0.211 0.211 2.631 0.099 0.367

Table 3.4: Minimum, 1st quartile, median, 3rd quartile, maximum, mean and

the standard error of metric variables from the TEST subsample.

17


19/73


X9: 0 1145 (27.7%) 620 (30.3%) 1765 (28.6%)

X9: 1 2990 (72.3%) 1425 (69.7%) 4415 (71.4%)

X10: 0 708 (17.1%) 361 (17.7%) 1069 (17.3%)

X10: 1 3427 (82.9%) 1684 (82.3%) 5111 (82.7%)

X11: 0 569 (13.8%) 308 (15.1%) 877 (14.2%)

X11: 1 3566 (86.2%) 1737 (84.9%) 5303 (85.8%)

X12: 0 2521 (61.0%) 1286 (62.9%) 3807 (61.6%)

X12: 1 1614 (39.0%) 759 (37.1%) 2373 (38.4%)

X13: 0 217 (5.2%) 100 (4.9%) 317 (5.1%)

X13: 1 3918 (94.8%) 1945 (95.1%) 5863 (94.9%)

X14: 0 3965 (95.9%) 1959 (95.8%) 5924 (95.9%)

X14: 1 170 (4.1%) 86 (4.2%) 256 (4.1%)

Table 3.5: Outcome frequencies of dichotomous categorical variables X9X14.

TRAIN sample (Y = 0)

0

500

1000

1500

2000

2500

X15

TEST sample (Y = 0)

0

500

1000

X15


0

50

100

150

X15

TEST sample (Y = 1)

0

50

100

X15

Figure 3.5: Bar charts of the categorical variable X15.

18


20/73


0 1229 (29.7%) 590 (28.9%) 1819 (29.4%)

1 245 (5.9%) 131 (6.4%) 376 (6.1%)

2 2661 (64.4%) 1324 (64.7%) 3985 (64.5%)

Table 3.6: Outcome frequencies of the categorical variable X15.


0

500

1000

1500

2

000

2500

3000

X16

TEST sample (Y = 0)

0

500

1000

1500

X16


0

50

100

150

X16

TEST sample (Y = 1)

0

50

X16



0 3071 (74.3%) 1513 (74.0%) 4584 (74.2%)

1 535 (12.9%) 293 (14.3%) 828 (13.4%)

2 529 (12.8%) 239 (11.7%) 768 (12.4%)


19


21/73


0

500

1000

1500

X17

TEST sample (Y = 0)

0

200

400

600

800

X17


0

50

100

X17

TEST sample (Y = 1)

0

20

40

60

X17



0 1440 (34.8%) 728 (35.6%) 2168 (35.1%)1 223 (5.4%) 118 (5.8%) 341 (5.5%)

2 639 (15.5%) 311 (15.2%) 950 (15.4%)

3 1833 (44.3%) 888 (43.4%) 2721 (44.0%)


20


22/73


0

500

1000

X18

TEST sample (Y = 0)

0

200

400

600

X18


0

50

X18

TEST sample (Y = 1)

0

10

20

30

40

X18



0 874 (21.1%) 441 (21.6%) 1315 (21.3%)

1 809 (19.6%) 378 (18.5%) 1187 (19.2%)

2 595 (14.4%) 319 (15.6%) 914 (14.8%)

3 232 (5.6%) 125 (6.1%) 357 (5.8%)

4 120 (2.9%) 77 (3.8%) 197 (3.2%)

5 1505 (36.4%) 705 (34.5%) 2210 (35.8%)


21


23/73


0

500

1000

X19

TEST sample (Y = 0)

0

100

200

300

400

500

X19


0

20

40

60

X19

TEST sample (Y = 1)

0

10

20

30

X19



0 369 (8.9%) 209 (10.2%) 578 (9.4%)

1 634 (15.3%) 295 (14.4%) 929 (15.0%)

2 1093 (26.4%) 548 (26.8%) 1641 (26.6%)

3 255 (6.2%) 132 (6.5%) 387 (6.3%)

4 1145 (27.7%) 552 (27.0%) 1697 (27.5%)

5 639 (15.5%) 309 (15.1%) 948 (15.3%)


22


24/73


0

500

1000

1500

2000

2500

X20

TEST sample (Y = 0)

0

500

1000

X20


0

50

100

X20

TEST sample (Y = 1)

0

10

20

30

40

50

X20



0 710 (17.2%) 370 (18.1%) 1080 (17.5%)

1 267 (6.5%) 151 (7.4%) 418 (6.8%)

2 372 (9.0%) 198 (9.7%) 570 (9.2%)

3 255 (6.2%) 121 (5.9%) 376 (6.1%)

4 60 (1.5%) 25 (1.2%) 85 (1.4%)

5 2471 (59.8%) 1180 (57.7%) 3651 (59.1%)


23


25/73


0

500

1000

X21

TEST sample (Y = 0)

0

100

200

300

400

500

X21


0

20

40

60

X21

TEST sample (Y = 1)

0

10

20

30

40

X21



0 737 (17.8%) 387 (18.9%) 1124 (18.2%)

1 412 (10.0%) 183 (8.9%) 595 (9.6%)

2 708 (17.1%) 318 (15.6%) 1026 (16.6%)

3 461 (11.1%) 214 (10.5%) 675 (10.9%)

4 1069 (25.9%) 554 (27.1%) 1623 (26.3%)

5 473 (11.4%) 237 (11.6%) 710 (11.5%)

6 275 (6.7%) 152 (7.4%) 427 (6.9%)


24


26/73


0

500

1000

X22

TEST sample (Y = 0)

0

100

200

300

400

500

X22


0

50

X22

TEST sample (Y = 1)

0

10

20

30

40

X22



0 383 (9.3%) 198 (9.7%) 581 (9.4%)

1 420 (10.2%) 228 (11.1%) 648 (10.5%)

2 1189 (28.8%) 572 (28.0%) 1761 (28.5%)

3 202 (4.9%) 78 (3.8%) 280 (4.5%)

4 210 (5.1%) 95 (4.6%) 305 (4.9%)

5 317 (7.7%) 175 (8.6%) 492 (8.0%)

6 218 (5.3%) 97 (4.7%) 315 (5.1%)

7 227 (5.5%) 112 (5.5%) 339 (5.5%)

8 143 (3.5%) 69 (3.4%) 212 (3.4%)

9 826 (20.0%) 421 (20.6%) 1247 (20.2%)


25


27/73


0

500

1000

X23

TEST sample (Y = 0)

0

100

200

300

400

500

X23


0

10

20

30

40

50

X23

TEST sample (Y = 1)

0

10

20

30

X23



0 279 (6.7%) 145 (7.1%) 424 (6.9%)

1 298 (7.2%) 167 (8.2%) 465 (7.5%)

2 1025 (24.8%) 510 (24.9%) 1535 (24.8%)

3 375 (9.1%) 180 (8.8%) 555 (9.0%)

4 226 (5.5%) 99 (4.8%) 325 (5.3%)

5 104 (2.5%) 59 (2.9%) 163 (2.6%)

6 269 (6.5%) 132 (6.5%) 401 (6.5%)

7 389 (9.4%) 194 (9.5%) 583 (9.4%)

8 561 (13.6%) 263 (12.9%) 824 (13.3%)

9 65 (1.6%) 25 (1.2%) 90 (1.5%)

10 544 (13.2%) 271 (13.3%) 815 (13.2%)


26


28/73

4 Credit Scoring Methods

We suppose that we are given a set {xi }Ni=1 of N observations of a random vector

X in Rp. That is, there are p independent variables (predictors) X1, . . . , X p and one

dependent variable (response) Y. Each observation is thus a row vector (xi , yi), i =

1, 2, . . . , N and the whole data set can be written in a matrix form:

X =

x1,1 . . . x1,p...

...

xN,1 . . . xN,p

Y =

y1...

yN

In our particular case we have two data sets with NTRAIN = 4135 and NTEST = 2045.

The dimension of the predictor space is either p = 23 (or p = 61 after introducing

dummy variables).

4.1 Logistic Regression

Logistic regression stems from a wider class of models. It is a special case of the gener-

alized linear model (GLM). The logistic regression model assumes that the conditional

expectation E[Y|X = x] of Y is given in the following way:

E[Y|x] = G(0 + x) =

e0+x

1 + e0+x,

where 0 is the regression constant, = (1, . . . p) are the regression coefficients,

and G(.) is the logistic cumulative distribution function (also called the link function).

The distribution of Y (Bernoulli distribution) belongs to the exponential family of

distributions:

f(yi, ) = P(Y = yi|X = xi) = yi(1 )1yi , i = 1, 2, . . . , N

where = P(Y = 1|x) is the probability that the applicant will not be able to pay his

liabilities and 1 = P(Y = 0|x) is the probability of a creditworthy applicant.The parameters of a GLM are, in general, estimated by maximum-likelihood method

by maximizing the term:

L(0, 1, . . . , p) =Ni=1

yi(1 )1yi

=

Ni=1

G(0 +

xi)yi(1 G(0 +

xi))(1yi)

.

The estimating procedure is done by the Newton-Raphson algorithm. Computational

details are given in Hardle et al. (2000b).

27


29/73

Variable Std.Err. t-value Variable Std.Err. t-value

Const. -2.747 0.066 -41.811 Const. -2.757 0.066 -41.976

X1 0.191 0.069 2.771 X5 0.093 0.082 1.135

Const. -2.823 0.071 -39.873 Const. -2.754 0.066 -41.861

X2 -0.313 0.088 -3.539 X6 0.035 0.131 0.267

Const. -2.767 0.067 -41.518 Const. -2.830 0.071 -39.635

X3 -0.097 0.082 -1.189 X7 -0.872 0.217 -4.013

Const. -2.752 0.066 -41.721 Const. -2.948 0.099 -29.807

X4 0.047 0.078 0.596 X8 -1.306 0.431 -3.032

Table 4.1: Logistic regression coefficients for each metric variable separately.

Bold parameter values are significant at 5%.

Results

The estimations are based on the TRAIN data set. Since the logistic regression cannot

deal with categorical variables directly, we must recode each of the variables X9X23

with a set of new contrast variables (the first category, 0, is taken as the reference

category for each variable). In this manner we get a larger data set with one response,

8 metric and 53 dummy variables (stored in files fr train C.dat and fr test C.dat).

Let Xa(b) denote the b-th contrast variable for the original variable Xa, e.g. X23 is

then recoded into contrast variables X23(1), X23(2), ... X23(10). Firstly we tried to

fit the response on each variable separately. Table 4.1 shows the estimation results:

the parameters, their standard error and t-value for the metric variables. X1, X2,

X7 and X8 are significant at 5% (and thus emphasized by bold letters), X3X6 are

insignificant.

Results for categorical variables estimated separately are given in Table 4.2. Pa-

rameters significant at 5% are emphasized by bold letters again. Significant binary

variables are X9, X10 and X14. No variable with more than two categories has sig-

nificant coefficients for all dummy variables. All coefficients for X16 and X17 are

insignificant.

From the significant variables the categorical variable X10 has the lowest deviance

(1802.6). The sequence of other significant variables sorted with respect to the deviance

(in decreasing order) is: X10, X20, X22, X23, X21 and the corresponding R2 values

decrease from 3.7% to 1.5%. This is really a very poor performance. Hence we use

28


30/73

Variable Std.Err. t-value Variable Std.Err. t-value

Const. -2.403 0.107 -22.426 Const. -2.121 0.121 -17.475

X9(1) -0.524 0.136 -3.864 X20(1) -0.392 0.262 -1.496

Const. -1.864 0.110 -16.910 X20(2) -1.048 0.290 -3.613

X10(1) -1.206 0.138 -8.737 X20(3) -0.343 0.263 -1.304Const. -2.609 0.166 -15.727 X20(4) 0.932 0.328 2.836

X11(1) -0.172 0.181 -0.954 X20(5) -1.024 0.158 -6.481

Const. -2.841 0.087 -32.563 Const. -3.160 0.186 -16.951

X12(1) 0.206 0.132 1.557 X21(1) -0.049 0.316 -0.155

Const. -2.344 0.240 -9.759 X21(2) 0.142 0.258 0.549

X13(1) -0.440 0.250 -1.763 X21(3) 1.008 0.241 4.184

Const. -2.792 0.068 -41.015 X21(4) 0.561 0.222 2.528

X14(1) 0.659 0.258 2.549 X21(5) 0.274 0.277 0.987

Const.-3.140

0.143 -21.952X

21(6)0.667

0.294 2.271X15(1) 0.174 0.329 0.528 Const. -3.133 0.255 -12.267

X15(2) 0.540 0.162 3.329 X22(1) -0.096 0.361 -0.266

Const. -2.771 0.077 -36.160 X22(2) 0.678 0.277 2.445

X16(1) 0.255 0.181 1.405 X22(3) -0.354 0.487 -0.726

X16(2) -0.192 0.215 -0.892 X22(4) 0.638 0.365 1.749

Const. -2.833 0.115 -24.628 X22(5) 0.262 0.357 0.735

X17(1) -0.034 0.318 -0.106 X22(6) 0.894 0.343 2.604

X17(2) -0.077 0.213 -0.363 X22(7) -1.590 0.755 -2.107

X17(3) 0.192 0.148 1.297 X22(8) 0.989 0.374 2.645

Const. -3.147 0.170 -18.492 X22(9) 0.255 0.299 0.854

X18(1) 0.568 0.219 2.596 Const. -2.561 0.232 -11.035

X18(2) 0.246 0.251 0.982 X23(1) -0.308 0.346 -0.890

X18(3) 0.891 0.281 3.168 X23(2) -0.811 0.290 -2.794

X18(4) 0.635 0.386 1.645 X23(3) -0.264 0.323 -0.816

X18(5) 0.416 0.201 2.065 X23(4) -0.622 0.412 -1.509

Const. -3.030 0.248 -12.204 X23(5) -0.425 0.514 -0.826

X19(1) 0.636 0.287 2.217 X23(6) 0.092 0.325 0.284

X19(2) 0.253 0.280 0.904 X23(7) -0.161 0.313 -0.513

X19(3) 0.670 0.334 2.009 X23(8) 0.342 0.272 1.257

X19(4) 0.025 0.285 0.086 X23(9) -1.598 1.034 -1.545

X19(5) 0.241 0.301 0.802 X23(10) 0.054 0.283 0.191

Table 4.2: Logistic regression coefficients for each categorical variable sepa-

rately. Bold parameter values are significant at 5%.

29


31/73

GLM: logistic fit, n=4135

-5 0

Index eta

0

0.2

0.4

0.6

0.8

1

Linkmu,

Responsesy

Figure 4.1: Predicted probabilities.

quantlet glmforward 3 to find a more appropriate model. The quantlet starts with a

null model (containing intercept only) and adds particular variables consequently. The

measure of goodness of the model is the Akaikes criterion. In this way we can identify

subsets of independent variables that are good predictors of the response. Note that

this quantlet must evaluate up to p(p+1)2 models, what means in our case62632 = 1953

models, and therefore the computation takes a while. The forward stepwise procedure

suggest to include in the model the following variables: X1, X2, X5, X7X10, X12

X14, X16(2), X18(3), X18(5), X19(1), X19(3), X20(2), X20(4), X20(5), X21(3),

X21(4), X21(6), X22(2), X22(4), X22(6)X22(9), X23(2), X23(8), X23(9). The

Akaikes information criterion of this model is 1643.6. It is remarkable that the variableX23(3) is insignificant, even though it seemed to be quite predictive from Figure 3.13.

Table 4.3 shows the estimated parameters, their standard error and corresponding t-

value in this model. Since by using dummy variables one considers all possible effects,

modelling for the categorical variables cannot be further improved, but the influence of

metric variables may be better investigated by using semiparametric model. Figure 4.1

shows a plot of X vs. Y together with a plot of X vs. the link function G(X).

3 Files o-logit*.* contain the complete computer output related to this section.

30
http://www.xplore-stat.de/help/glmforward.html


32/73

Variable Std.Err. t-value

Const. -2.153 0.3653 -5.8936

X1 0.1952 0.1126 1.7338

X2 -0.3533 0.0967 -3.6518X5 0.1618 0.1047 1.546

X7 -0.8900 0.2198 -4.0488

X8 -0.958 0.4080 -2.3482

X9(1) -0.3269 0.1706 -1.9162

X10(1) -0.9780 0.1519 -6.4356

X12(1) 0.2551 0.1598 1.5961

X13(1) -0.511 0.2678 -1.9091

X14(1) 0.6736 0.2850 2.3636

X16(2) -0.3144 0.2351 -1.3372

X18(3) 0.4531 0.265 1.7056

X18(5) 0.5652 0.1940 2.913

X19(1) 0.4681 0.1768 2.6472

X19(3) 0.5394 0.2546 2.1181

X20(2) -0.8286 0.2982 -2.7789

X20(4) 1.262 0.362 3.4868

X20(5) -0.8635 0.1506 -5.7339

X21(3) 0.7086 0.1949 3.6355

X21(4) 0.2385 0.1687 1.4136

X21(6) 0.5004 0.2640 1.8953

X22(2) 0.6722 0.1839 3.6541

X22(4) 0.7068 0.3161 2.236

X22(6) 0.9538 0.2967 3.2146

X22(7) -1.341 0.6702 -2.0007

X22(8) 0.880 0.4261 2.0666

X22(9) 0.350 0.2182 1.6064

X23(2) -0.6462 0.2001 -3.2293

X23(8) 0.3820 0.1766 2.163

X23(9) -1.552 0.9915 -1.566

Table 4.3: Logistic regression coefficients of the model suggested by

glmforward. Bold parameter values are significant at 5%.

31


33/73

Prediction

We use the model described in the previous paragraph (suggested by the glmforward

quantlet) to estimate the outcomes of the TEST data set and to compute the misclas-

sification rate.

Firstly we check the prediction on the TRAIN data set which built the model.

Table 4.4 shows the results when observations with probability higher than 0.5 are

assigned to be non-creditworthy clients. For comparison we show results using prior

probabilities in Table 4.5. At the first sight, using prior probabilities gives worse

resultsthe overall misclassification rate of 8.95% is higher than the rate when using 0.5

threshold (6.0%). But in fact, the latter misclassifies 74.9% of the bad clients, while

the former more than 93.9%! Threshold 0.5 gains better overall misclassification rate

due to the low number of misclassified good clients (0.4%), using prior probabilities

misclassifies 4.76% of them. From the banks point of view it is worse to grant a loan

to a non-creditworthy applicant than to reject a creditworthy applicant and thus we

will further use prior probabilities to decide on the creditworthiness of the client.

Then we use the model on the testing data, stored in the file data-test C.dat.

Table 4.6 shows the results. In the TEST data set 98 applicants of 1920 good

applicants were denoted as bad, 101 of the 125 bad clients were assigned as good

and thus 199 applicants of the entire TEST sample were misclassified, what results in

the overall misclassification rate about 9.7%.

observed predicted misclass. overall

0 1 misclass.

0 3872 16 0.41% 248

1 232 15 93.93% (6.00%)

Table 4.4: Misclassification rates of the logit model for the TRAIN data set.

observed predicted misclass. overall0 1 misclass.

0 3703 185 4.76% 370

1 185 62 74.90% (8.95%)

Table 4.5: Misclassification rates of the logit model for the TRAIN data set

(prior probabilities).

32


34/73


0 1 misclass.

0 1822 98 5.10% 199

1 101 24 80.80% (9.73%)

Table 4.6: Misclassification rates of the logit model for the TEST data set

(prior probabilities).

Discussion

Since logit model belongs to traditional techniques, one may find the logistic regression

in almost every paper treating credit scoring, at least as a reference method for com-

parison with other models. In general, logistic regression is easy to fit and works well inpractice. However binary output models (like logit or probit) describe best the output

category, which occur most frequent. Therefore each output category should have at

least 5% of observations. If an output class has too small number of observations, one

should use so-called rare event models (Kaiser and Szczesny, 2000a).

4.2 Multi-layer Perceptron

The multi-layer perceptron (MLP) is a simple feed-forward neural network with an

input layer, several hidden layers and one output layer. It means that information can

only flow forward from the input units to the hidden layer and then to the output

unit(s). MLP network is the most often used architecture of neural networks, there is

a great deal of publications concerning MLP network, see for example Bishop (1995).

For the purpose of credit scoring a MLP with one hidden layer and one or two output

units only is sufficient. Its basic structure is illustrated in Figure 4.2. The value of the

output unit can be expressed:

f(x) = F2

w(2)0 +

rj=1

w(2)j F1

w(1)j0 +

pi=1

w(1)ji xi

,

where xi are the the input units, w(1)ji and w

(2)j are the weights of the hidden and

output layer respectively and F1 and F2 are the transfer functions from the input to

the hidden layer and from the hidden to the output layer respectively. The transfer

function is usually sigmoid, e.g. logistic function. The parameters for the network are

determined iteratively, commonly via the backpropagation procedure.

33


35/73

Input layer Hidden layer Output layer

qqq

dddddddd

hhhhhhhh

xp

x2

x1

qqq

lllll

lll

zr

z2

z1

y = f(x)

w(1)

w(2)

Figure 4.2: A multi-layer perceptron network with one hidden layer and one

output unit.

Results

For the probabilistic models the softmax transfer function for the outputs is intended.

However, we found out faulty usage of this function in XploRe, it is commented in

Appendix B.2. Therefore we used a MLP network with logistic transfer function for

the output, quadratic least squares error function and no skip connections.

We split the TRAIN sample into two subsamples stored in the files data-nn-1.dat

(with 2779 observations) and data-nn-2.dat (with 1356 observations) respectively.

The first subsample is used for the actual network training, it looks for a MLP network

with the minimal mean squared error (MSE). The latter subsample is validating and


0 2579 50 1.9% 110

1 60 90 40.0% (4.0%)

Table 4.7: Misclassification rates of the 23-12-1 MLP network for the small

training sample.

34


36/73


0 1 misclass.

0 1197 62 4.9% 148

1 86 11 88.7 % (10.9%)


validating sample.


0 1 misclass.

0 2569 60 2.3% 120

1 60 90 40.0% ( 4.3%)


training sample.


0 1 misclass.

0 1205 54 4.3% 135

1 81 16 83.5% (10.0%)


validating sample.

it should avoid the overfitting so that the model is not built for one particular sample

only.

We computed many models and looked at the MSE and the misclassification rates

of the validation data set. The misclassification rates are computed due to prior prob-abilities gained from the small training sample. Finally we chose a MLP networks with

12 units in the hidden layer, which reached the MSE of 243.58. The final 23-12-1 MLP

network 4 is saved in files mlp1.*.

Tables 4.7 and 4.8 show the misclassification rates which the final MLP network

reaches in the small training and in the validating sample respectively. The resulting

MLP network predicts more than 98% of the good clients correctly. Out of the bad

4Here 23-12-1 denotes 23 input units, 12 units in the hidden layer and 1 output unit.

35


37/73

clients 40% are misclassified. Altogether there are 4% misclassified clients. The results

get slightly worse, when using the MLP network on the small validation set. Almost

5% of the good clients and more than 88% of the bad clients are misclassified!

The problem by neural networks is missing procedure choosing significant input

units. Thus we let us inspire by the logistic regression and restrict the input layer

to 17 units only, as suggested in subsection 4.1 by the quantlet glmforward. That

is, we built the network only on the knowledge of the variables X1, X2, X5, X7

X10, X12X14, X16 and X18X23. The resulting network has 13 hidden units and

the MSE of 310.35. This network is stored in the files mlp2.*. Table 4.9 shows its

misclassification rates. This restricted MLP network has higher misclassification rate

by the good clients from the small training sample (2.3%), but the performance in

the small validating sample is a little bit better. 4.3% of the good clients and only

83.5% of the bad clients are misclassified (Table 4.10).

Prediction

Before we compute the misclassification rates of the TEST data set, we check the

prediction on the whole data from the TRAIN sample which built the model. Both

results for the 23-12-1 MLP and 17-13-1 MLP network are shown in Table 4.11 and

Table 4.12 respectively. The restricted network has better prediction by the bad


0 3771 117 3.0% 263

1 146 101 59.1% (6.4%)

Table 4.11: Misclassification rates of the 23-12-1 MLP network for the TRAIN

data set.


0 3751 137 3.5% 275

1 138 109 55.9% (6.7%)

Table 4.12: Misclassification rates of the 17-13-1 MLP network for the TRAIN

data set.

36


38/73


0 1 misclass.

0 1815 105 5.5% 215

1 110 15 88.0% (10.5%)

Table 4.13: Misclassification rates of the 23-12-1 MLP network for the TEST

data set.


0 1 misclass.

0 1813 107 5.6% 217

1 110 15 88.0% (10.6%)

Table 4.14: Misclassification rates of the 17-13-1 MLP network for the TEST

data set.

clientsit misclassified 8 clients (3.2%) less than the full network, on the other side

by the good clients it misclassifies 20 clients (0.5%) more than the full network.

Altogether the restricted 17-13-1 MLP network describes the TRAIN data set slightly

worse than the full 23-12-1 MLP network. It misclassifies 12 clients more (0.3%).

Note that the number of misclassified clients in the small training sample plus the

number of misclassified clients in the small validation sample is not exactly equal to

the number of misclassified clients in the TRAIN data set (no matter which network

110 + 148 = 263 and 120 + 135 = 275). This is due to the prior probabilities we used.

We are denoting a client as bad, if its neural networks output function is greater

than the rate of good clients in the corresponding training sample, that is 94.6% in

the small training sample and 94.0% in the TRAIN data set.

Table 4.13 and Table 4.14 show the final results of prediction the TEST data set.

There is no difference in predicting the bad clients. The restricted 17-13-1 MLP

network misclassifies only 2 clients more than the full 23-12-1 MLP network. Thus

these two neural networks seem to give almost the same results.

Discussion

Neural Networks represent very flexible models with good performance. Although

there are various architectures of neural networks, more than 50% of applications are

37


39/73

using the multi-layer perceptron (MLP) network, which is both simple and well known.

Problematic is choosing the number of units in the hidden layer. West (2000) uses an

analogy of the forward stepwise procedure in the logistic regression. The so called

cascade learning starts with one neuron in the hidden layer and adds other neurons

as long as the performance is being better. As one can see, logistic regression may be

classified as a simple MLP with one processing unit in one hidden layer and logistic

function as the sigmoid activation function.

4.3 Radial Basis Function Neural Network

The radial basis function (RBF) network is another architecture of feed-forward neural

networks, which has in principle only one hidden layer. The hidden units are also

called clusters, as the observations are clustered to one of the hidden units whichare represented by radially symmetric functions. The weights of the hidden layer then

represent centers of these clusters (mean of the radial functions). At the first stage

these centers as well as deviances of the radial basis functions are to be found (via the

so called unsupervised learning). As next, weights of the output layer are determined

via supervised learning (similarly like in the MLP networks). RBF neural networks are

explained in Appendix A comprehensively.


0 1 misclass.

0 2505 124 4.7% 248

1 124 26 82.7% (8.9%)

Table 4.15: Misclassification rates of the 23-100-1 RBF network for the small

training sample.


0 1 misclass.

0 1200 59 4.7% 142

1 83 14 85.6% (10.5%)


validation sample.

38


40/73

Results

The RBF neural network is built in the same way as the MLP network. We trained

the network on the small training sample with 2779 observations (data-nn-1.dat) in

order to minimize the mean squared error and at the same time checked the overfitting

by evaluating the model on the small validating sample (data-nn-2.dat).

In the same manner as by the MLP network we tried many different networks with

various learning parameters and different number of hidden units, till we got optimal

results. There are two models again, one with 23 input units (the full model) and

another one with only 17 inputs, as suggested in Section 4.1

The first model uses 100 clusters, we denote it 23-100-1 RBF neural network and

save it in rbf1.rbf). The latter contains only 80 clusters and we denote it 17-80-

1 RBF network (stored in rbf2.rbf). Misclassification rates for the 23-100-1 RBF

network are given in Table 4.15 for the small training sample and in Table 4.16 for the

validating sample. The 17-80-1 RBF neural networks misclassification rates are shown

in Table 4.17 and 4.18. In comparison with MLP networks, both the 23-100-1 and

17-80-1 RBF networks misclassify in the small training sample almost twice as much

as the MLP does. However, the misclassification rates in the validation sample are

reasonable: 10.5% and 10.8% of overall misclassified observations.


0 1 misclass.0 2510 119 4.5% 238

1 119 31 79.3% (8.6%)


training sample.


0 1 misclass.0 1198 61 4.8% 146

1 85 12 87.6% (10.8%)


validation sample.

39


41/73

Prediction

Table 4.19 shows the results of the 23-100-1 RBF neural network. It misclassifies exactly

the same number of observations as the 23-12-1 MLP network: 10.5%. Anyway, the

RBF network predicts better the bad applicants there are 109 misclassifications

(87.2%). Out of the good applicants 106 (5.5%) were misclassified. Results for the

17-80-1 RBF network are shown in Table 4.20. It performs even a little bit better

106 of bad and 103 of good applicants were misclassified. That is altogether 209

(10.2%) misclassified applicants.


0 1 misclass.

0 1814 106 5.5% 215

1 109 16 87.2% (10.5%)

Table 4.19: Misclassification rates of the 23-100-1 RBF network for the TEST

data set.


0 1 misclass.

0 1817 103 5.4% 2091 106 19 84.8% (10.2%)

Table 4.20: Misclassification rates of the 17-80-1 RBF network for the TEST

data set.

Discussion

Radial basis function neural networks are supposed to give better prediction than theMLP networks. That is true indeed, however, their performance is only slightly better

(8 misclassified observations less than in the MLP) in our case. Due to the unsupervised

learning the computation of RBF networks proceeds very fast. One may notice the

misclassification rates of the small training sample. While MLP networks tend to

overfit, the results from the RBF networks are more robustthe misclassification rates

between the small training and the validation sample differ only in about 2 percent

points.

40


42/73

5 Other Methods

In the vast amount of publications treating credit scoring there are many other tech-

niques which may also be used. This section gives a short overview of some of these

methods and refers to other literature.

5.1 Probit Regression

Sometimes an alternative to the logit model is used. The probit model is another

variant of generalized linear models. It is derived by letting the link function be the

standard normal distribution function. Since the logistic link function is closely ap-

proximate to that of normal random variable, the results are similar.

5.2 Semiparametric Regression

In order to give more attention to metric variables, one may estimate them nonpara-

metrically. Hardle et al. (2000a) show that semiparametric methods perform better

than the logistic regression. The resulting model consist then of a linear and a nonlinear

part:

E[Y|X] = E[Y|(V, W)] = G(V + m(W)),

where = (1, . . . , v) is parameter vector for the categorical variables and m(.) is

a smooth real function which can be estimated nonparametrically. In practice for the

nonparametric part one chooses only those continuous explanatory variables, which

have the most influence on the dependent variable Y. Estimators for and m(.) are

computed by semiparametric maximum likelihood (this is reviewed in Hardle et al.

(2000b)).

5.3 Classification Trees

Classification tree (usually summarized under a more general name classification and

regression treesCART) is a nonparametric method to analyze categorical dependent

variables as a function of metric and/or categorical explanatory variables. They has

become a standard tool for developing credit scoring systems, since they are easily

interpretable and may be demonstrated graphically.

Classification and regression trees uses recursive partitioning algorithm (split-and-

conquer): The basic idea is to split the sample into two subsamples each of which

contains only cases from one response category and then repeatedly split each of the

41


43/73

subsamples. The subsamples are called nodes, the entire sample is called root node.

Firstly, one looks for the explanatory variable which splits the sample in a node into

two subgroups in such a way, that these children nodes are inside as homogeneous as

possible and differ from each other as much as possible. The splitting rules are done

according to some statistical criteria. Usually the tree is split till it is overfitted and

afterward the tree is pruned back to get a more robust model. Once the final tree is

determined, each terminal node is classified to be either good or bad depending

on the majority of observations in that node. Figure 5.1 shows an example of a clas-

sification tree used for credit scoring. There are 1000 clients in the beginning (140 of

them are bad). The procedure splits this sample due to a particular value of some

variables, till seven final groups are determined. The upper number in each box rep-

resent the number of observations, the lower number stands for the number of bad

clients in this node. Either 0 or 1 above the boxes denote terminal nodes to be the one

of either good or bad clients. The basic monograph explaining CART is Breimann

et al. (1983). Thomas (2000) mentions that in comparison with linear discriminant

analysis (LDA), CART are better for predictors with interactions, while LDA is better

for predictors with intercorrelations.

Example of a classification tree

1000

previous default

marriaged income

income income

another credit

1 0

1 0 1

0 1

16 14

80 220 20

550 100

9 1

45 5 15

10 55

yes

yes> $2,970

> $2,680 > $2,820

yes

Figure 5.1: Example of a fictive classification tree.

42


44/73

5.4 Linear Discriminant Analysis

Discriminant analysis is a set of methods for separating subgroups in data using some

discriminant rules, it is well explained in Hardle and Simar (2002). Many papers

treating credit scoring use also discriminant analysis method to distinguish goodand bad clients in the sample (Back et al., 1996) or (Desai et al., 1996). However,

linear discriminant approach assumes metric variables which are normally distributed

and that the variance matrix in each of the groups is the same. Most of the variables are

categorical in credit scoring, the metric variables are usually not normally distributed

(see the Section 3) and moreover, there is no reason to assume, that the good and

bad clients have same variance matrices (Case of not equal matrices may be solved

nonlinearly). From this point of view any usage of discriminant analysis in credit

scoring framework is faulty, however, Desai et al. (1996) states, that the empirical

performance of linear discriminant analysis is relatively good.

5.5 Panel Data Analysis

Logit and probit models may be extended into panel data analysis. Panel data denote

data, which have not only cross section, but also the time dimension. In the credit

scoring it means, that for every client we have also observations in the time t. As a

rule, panel data contains more data than only cross section- or time series data and

therefore we may get more exact results. Panel data analysis for credit scoring isexplained in Kaiser and Szczesny (2000a)

5.6 Hazard Regression

Hazard regression models analyze survival data. Particularly they estimate probabili-

ties, that the observed unit stays in the current state and solve the problem of censored

data. In credit scoring it complies with the fact, that we either know that the client

became delinquent at some time t, or that the client is still creditworthywhat does

not necessary mean that he will not become delinquent at a future time. Kaiser and

Szczesny (2000b) describe hazard regression in credit scoring and other related topics

in detail.

5.7 Genetic Algorithms

Genetic algorithms are another general optimization schemes based on biological analo-

gies (as the artificial neural networks are) described in the early 1990s. They simulate

43


45/73

Darwinian evolution. In fact they are recombining possible vector solutions (so called

chromosomes) from a space of candidate solutions (population of chromosomes) and

try to maximize some fitness function (determines how good a chromosome is). The

recombination is done via three operators whose names originate from the biological

background: reproduction, cross-over and mutation. Genetic algorithms are mentioned

in the analysis of Back et al. (1996) or in Thomas (2000).

5.8 Linear Programming

Linear programming is searching for a cut-off point c and a weight vector (w1, . . . , wp),

so that the scalar product of vector of observations and the weight vector wxi is

above this cut-off point for the not creditworthy clients and below this point for the

creditworthy clients. Sum of errors

ei is minimized with respect to the unknownparameters:

min e1 + e2 + ... + enG+nB

subject to

w1xi1 + w2xi2 + ... + wpxip c ei , 1 i nB

w1xi1 + w2xi2 + ... + wpxip c + ei , nB + 1 i nG + nB = N

ei 0 , 1 i nG + nB

where nB is the number of bad and nG the number of good clients. Linear pro-

gramming for credit scoring is described in Thomas (2000).

5.9 Treed Logits

Chipman et al. (2001) studied the prediction problem in direct marketing They gen-

eralized and conjoined logistic regression and the CART techniques. The space of all

observations is partitioned through a binary tree. In each bottom node a different logit

model is fitted(instead of simply observed frequencies of response). Treed logits are

interpretable, small and prevent the overfitting.

44


46/73

6 Summary and Conclusion

Credit scoring represent a set of common techniques to decide whether a bank should

grant a loan (issue a credit card) to an applicant or not. We have presented sev-

eral methods and showed their usage in XploRe. The results of these methods aresummarized in Table 6.1.

Artificial neural networks are very flexible models, however they provide slightly

worse performance than the traditional logistic regression, which misclassifies 10 (0.5%)

observations less than the best of neural network models. Logit misclassifies 98 good

clients and denotes them as bad, 101 of the bad clients are granted the loan, as

they are supposed to be creditworthy. Additionally, logistic regression dispose with

statistical tests to identify how important are each of the predictor variables. Our

analysis thus showed that neural network models did not manage to beat logit inprediction. However, due to the flat-maximum effect (Lovie and Lovie, 1986) one is

unlikely to achieve a great deal of improvement from better statistical modelling on

the same set of matching variables.

misclass. logit MLP MLP RBF RBF

(23-12-1) (17-13-1) (23-100-1) (17-80-1)

good 98 105 107 106 103

bad 101 110 110 109 106

overall 199 215 217 215 209

overall in % 9.7% 10.5% 10.6% 10.5% 10.2%

Table 6.1: Misclassification rates of the methods tested.

45


47/73

7 Extensions

The objective of most credit scoring models is to minimize the misclassification rate

or the expected default rate. However, one should pay more attention to the term

off misclassification rate. Usually, the overall misclassification rates are compared todecide about the prediction of various models. However, there are two types of misclas-

sification. Firstly, one may denote a non-creditworthy client as creditworthy and the

other way round it is possible to denote a creditworthy client as non-creditworthy. The

latter is loss of profit, but it is not as bad as the former mistake, which means direct

loss for the bank. Therefore the bank is not trying to minimize the misclassification

rate, but to maximize its profit. One possible solution would be implementation of a

cost matrix. The preferred method minimize then the term L = r1w1 + r2w2, where r1

and r2 are number of bad applicants classified as creditworthy and number of goodapplicants classified as not creditworthy respectively. w1, w2 are weights mirroring the

loss and profit lost. These weights are to be estimated. Unfortunately, there are not

many papers on this topic yet. Tam and Kiang (1992) show that incorporating the

cost matrix into neural networks and discriminant analysis is possible.

At present the emphasis is on changing the objectives from trying to minimize the

chance a client will default on any particular product to looking at how the firm can

maximize the profit it can make from that client. As an example we can mention

insurances sold on loans (Stanghellini, 1999). Thus a non-creditworthy client may be

profitable if he buys the insurance on his loan and become delinquent relatively later.

Credit scoring has better performance than decisioning made by loan officers. How-

ever, one bad property of scoring methods is, that they are static and estimate at par-

ticular time. There are another tools which should be used. Jacobson and Roszbach

(1998) showed that banks using credit scoring grant loans inconsistent with default risk

minimization. They suggest value at risk (VaR) as a more adequate measure of losses

than default risk. Therefore next research should concentrate more on such topics. We

will study methods of credit scoring in near future again and in a more detail in a

diploma thesis at the Charles University in Prague.

46


48/73

Appendix

A Radial Basis Function Neural Networks

This section describes radial basis function (RBF) neural networks and explains

their implementation in XploRe. In general, artificial neural networks (ANN)

appeared in the 40s, but their progress and usage of these theories was enabled

in the last decade, especially with the development of personal computers. Nowa-

days they are very widely used in many research and commercial fields and may

be successfully applied in credit scoring as well. The fashion of neural networks

originating in brain nerve cells let grow up new terminology although the roots

of neural networks stretch in much older techniques. Table A.1 shows basic ter-

minology with different names for the same subject in artificial neural networks

and statistics framework:

statistics neural networks

model network

estimation learning

regression supervised learning

interpolation generalizationobservations training set

parameters weights

independent variables inputs

dependent variables outputs

Table A.1: Comparison of the neural networks and statistician terminology.

The simplest example of an ANN is perceptron. Multi-layer perceptrons were

described in Section 4.2. For a thorough explanation see Bishop (1995). Radial

basis function neural networks are, alike the MLPs, feed-forward networks. It

means that the signal in the network is passed forward only. But apart from

the MLPs, RBF networks stand for a class of neural network models, in which

the hidden units are activated according to the distance between the input units.

RBF networks combine two different types of learning: supervised and unsuper-

vised. First, at the hidden layer training, one conjoints the input vector into

several clusters (unsupervised learning) and afterward, at the output

47


49/73

Input layer Hidden layer Output layer

x4

x3

x2

x1

w

bias

z3

z2

z1

y = f(x)

.

............

.........

......

.........

.

......

..

..

..

..

.

..

..

..

..

..

..

.

...

.........

.........

......

.........

.

......

..

..

..

..

.

.

..

..

..

..

.

..

.

.........

...

..

.......

......

..........

......

..

..

..

..

.

....

.

..

..

..

.

Figure A.1: RBF scheme for p = 4, q = 1, r = 3. Dashed lines show various

weights of the connections.

layer training, the output of the RBF network is determined by supervised learn-

ing. While for the supervised learning we have both the independent variables and

response variable(s), the unsupervised learning must work without the knowledge

of response variable. This concept will be more clear from the next subsection.

Training of a RBF network can be essential faster than the methods used to

train MLP networks. Further the multi-layer feed-forward network trained with

backpropagation does not yield the approximating capabilities of RBF networks.

Therefore the theory of RBF neural networks is still the subject of extensive

ongoing research (Orr, 1999). One remarkable feature of RBF neural networks

is:

RBF networks possess the property of best approximation. An ap-

proximation scheme has this property if, in the set of approximating

functions (i.e. the set of functions corresponding to all possible choices

of the adjustable parameters) there is one function which has minimum

approximating error for any given function to be approximated. This

property is not shared by MLPs.Bishop (1995)

48


50/73

A.1 The Model

Radial basis function neural networks have one hidden layer only. Each of the

hidden units (the so called clusters) implements a radial function. The output

units are weighted sums of clusters outputs. This is illustrated in Figure A.1.We suppose that we are given a set {xi }

Ni=1 of N observations from a p-

dimensional space. That is, we are given p input variables X1, . . . , X p and, in

general, q output variables Y1, . . . , Y q. Let f : Rp Rq denote the function we

want to approximate via the RBF neural network. The output of a RBF neural

network with p input units, r clusters and one output unit is:

f(x) =

r

j=1 [w0 + wjj(x)]

,

where w0 is the weight of the bias, wj j = 1, . . . r are the output weights, j is a

radially symmetric function with two parameters cj , j and is output transfer

function. During training the observation points are joined into r clusters firstly,

for instance by a K-means clustering algorithm. Each cluster is represented

by a radial function. Radially symmetric function (.) is required to fulfill the

condition that if xi = xj then (xi) = (xj). The norm . is usually

taken to be L2-norm. The well known radially symmetric function is the p-variate

Gaussian function:

j(x) = exp

x cj

22j

, j > 0 .

Another popular activation function is the generalized inverse multi-quadric func-

tion:

j(x) = (x cj + 2j )

, j > 0, > 0 .

Both of them have the property that 0 as x . Other possible choices

of radially symmetric functions are the thin-plate spline function:

j(x) = x cj ln

x cj

,

or the generalized multi-quadric function:

j(x) = (x cj + 2j )

, j > 0, 1 > > 0 .

The last two functions have the property that as x . The even

mentioned radially symmetric functions are plotted in Figure A.2. However,

49


51/73

Gaussian

-5 0 5

X

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Y

Thin-plate Spline

-5 0 5

X

0

10

Y

Generalized Inverse Multi-quadric

-5 0 5

X

-0.4-0.2

00.20.40.60.8

11.21.41.61.8

22.2

2.42.62.8

33.23

.43.63.8

44.24.44.64.8

Y

Generalized Multi-quadric

-5 0 5

X

0

1

2

3

4

5

Y

Figure A.2: Radially symmetric activation functions.

both theoretical and empirical studies show that estimation results are relatively

insensitive to the exact form of the radially symmetric activation function. The

most commonly used radial function is the p-variate Gaussian function. It has

not only the attractive property of product separability (it can be rewritten as

a product of p univariate functions), but also other useful analytical properties.

We must estimate centers (cj) and widths (j) of these clusters. The clusters

weights are actually coordinates of the cluster centers. To get the proper number

of these clusters one may grow to the suitable size starting form one cluster

or alternatively, one starts with as many clusters as observations and trims the

clusters as desired. However, the number of clusters (number of units in the

hidden layer) is typically much less than N. After initial training, when the

clusters are already known, one may apply the supervised learning to estimate

weights of the output units. Determination of the clusters weights (i.e. centers of

the clusters) and widths at the first stage may be seen as a parametric approach.

50


52/73

The second stage of training may be seen as a nonparametric approach and

therefore RBF networks take the place of a semiparametric procedure.

Finally, let us mention, that the radially symmetric constraint on the activa-

tion function is sometimes violated in order to decrease the number of units in

the hidden layer or to improve the performance of the network. For instance the

multivariate Gaussian function may be generalized to the elliptically symmet-

ric function by replacing the L2 norm by the Mahalanobis distance (Hardle and

Simar, 2002).

A.2 RBF Neural Networks in XploRe

In this section we shortly describe how to run RBF neural network in XploRe.

{inp,net,err} = rbftrain(x,y,clust,learn,epochs,mMSE,activ)

trains a radial basis function neural network (slow)

{inp,net,err} = rbftrain2(x,y,clust,learn,epochs,mMSE,activ)

trains a radial basis function neural network (fast)

The quantlets rbftrain and rbftrain2 build a radial basis function neural net-

work. They use the same algorithm, the only difference is, that the former is

written directly in XploRe while the latter uses a dynamically linked library writ-

ten in C programming language. Account of this fact, rbftrain works slowly,

but allows the user to change the source code directly, to add new features and

to see what is exactly happening. On the other hand, rbftrain2 is a closed

product, which cannot be changed, but runs fast.

The input parameters x and y are the input and output variables respectively.

We assume that x and y have dimensions Np and Nq respectively. Number of

units in the hidden layer is given by the parameter clust, it must be determined

by the user, usually clust


53/73

for training output weights. Each of these learning rates must be from the range

(0, 1). The vector epochs has two rows. The first row is the number of training

epochs for the hidden layer and the second row contains number of epochs to

train the output layer. The training is stopped either when the output units

were already trained epochs[2]-times or when the mean squared error reaches

the value given by the parameter mMSE. The optional input parameter activ

determines whether the bipolar sigmoid function (activ = 1):

1 eW

1 + eW,

should be used instead of the default binary activation function ( activ = 0):

credit risk estimation

Documents