credit risk estimation
TRANSCRIPT
-
8/2/2019 Credit Risk Estimation
1/73
Masters Thesis presented to obtain the Master of Science degree:
On Credit Scoring Estimation
by
Karel Komorad
(180174)
Submitted to:
Prof. Dr. Wolfgang Hardle
Institute for Statistics and Econometrics
Humboldt University
Spandauer Str. 1
D10178 Berlin
Berlin, December 18, 2002
-
8/2/2019 Credit Risk Estimation
2/73
AcknowledgmentOn this place I would like to express my gratitude to persons who contributed to rise of this
thesis and who influenced my attitude to statistics and to study in general. On the first
place I would like to thank to my advisor Professor Wolfgang H ardle, who inspired me to
think in another than the mathematician levels only and gave me the key to connect the
real world and the world of theoretical books. Further, my thanks to Mgr. Zdenek Hlavka,
Ph.D. for his readiness to help at any time. Without his support this thesis would not exist
at all.
The two years I spent at the Institute for Statistics and Econometrics at Humboldt-
University in Berlin gave me more experience than any years before, and I am grateful to all
colleagues of this institute, especially to Prof. Dr. Bernd Ronz, Axel Werwatz, Ph.D. and
Ing. Pavel Czek, Ph.D., who taught me a lot of the programming mystiques.And last, but not least, my gratitude to Julka Smoljaninova, who gave me the strength
to finish the work once commenced and never stopped to trust.
Declaration of AuthorshipI hereby confirm that I have authored this master thesis independently and without use of
others than the indicated resources.
All passages, which are literally or in general matter taken out of publications or other
resources, are marked as such.
Berlin, December 18, 2002
Karel Komorad
-
8/2/2019 Credit Risk Estimation
3/73
Abstract
Credit scoring methods became standard tool of banks and other financial insti-
tutions, direct marketing retailers and advertising companies to estimate whether an
applicant for credit/goods will pay back his liabilities. In this thesis we give a shortoverview of credit scoring and its methods. We investigate the usage of some of these
methods and their performance on a data set from a French bank. Our results indicate
that the methods, namely the logistic regression, multi-layer perceptron (MLP) and
radial basis function (RBF) neural networks give very similar results, however, the
traditional logit model seems to be the best one. We also describe RBF architecture
and a simple RBF program we implemented in the statistical computing environment
XploRe.
-
8/2/2019 Credit Risk Estimation
4/73
Contents
1 Introduction 6
2 Credit Scoring in Overview 8
2.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Credit Scoring Today . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Data Set Description 12
4 Credit Scoring Methods 27
4.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Multi-layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Radial Basis Function Neural Network . . . . . . . . . . . . . . . . . . . 38
5 Other Methods 41
5.1 Probit Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Semiparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5 Panel Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.6 Hazard Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.7 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.8 Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.9 Treed Logits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Summary and Conclusion 45
7 Extensions 46
Appendix 47
A Radial Basis Function Neural Networks 47
A.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A.2 RBF Neural Networks in XploRe . . . . . . . . . . . . . . . . . . . . . . 51
A.3 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
A.4 Detailed Description of the Algorithm . . . . . . . . . . . . . . . . . . . 61
3
-
8/2/2019 Credit Risk Estimation
5/73
B Suggestions for Improvements in XploRe 65
B.1 grdotd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
B.2 nnrpredict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
B.3 nnrnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
C CD contents 69
4
http://www.xplore-stat.de/help/nnrnet.htmlhttp://www.xplore-stat.de/help/nnrpredict.htmlhttp://www.xplore-stat.de/help/grdotd.html -
8/2/2019 Credit Risk Estimation
6/73
Notation
X random variable
X random vectorX data matrix
xi = (xi,1, . . . , xi,p) i-th observation of a random vector X Rp
transposition
french.dat file name
grdotd quantlet name
rbfnet parameter name
Abbreviations
ANN artificial neural network
Basel II the new Basel Capital Accord
CART classification and regression trees
ECOA Equal Credit Opportunity Act
GLM generalized linear modelGPLM generalized partially linear model
LDA linear discriminant analysis
MSE mean squared error
RBF radial basis function
VaR value at risk
5
http://www.xplore-stat.de/help/grdotd.html -
8/2/2019 Credit Risk Estimation
7/73
1 Introduction
One of the basic tasks which any finance institution must deal with, is to minimize
its credit risk. Scoring methods traditionally estimate the creditworthiness of a credit
card applicant. They predict the probability that an applicant or existing borrowerwill default or become delinquent. Credit scoring is studying the creditworthiness of:
any of the many forms of commerce under which an individual obtains
money or goods or services on condition of a promise to repay the money
or to pay for the goods or services, along with a fee (the interest), at some
specific future date or datesLewis (1994)
The statistical methods we will present are based on a particular amount of histor-
ical data. Computers allow to treat large data sets and to come to a decision quicker,
cheaper and more realistic than the primary (historic) judgment made by credit experts
or loan officers. These methods give even a better prediction, if used correctly.
Bank credit card issuers in the U.S. lose about $1 milliard each year in fraud, mostly
from stolen cards. They also lose another $3 milliard in fraudulent bankruptcy filings.
However merchants absorb more than $10 milliard each year in credit card related
fraud 1.
The founders of credit scoring, Bill Fair and Earl Isaac designed a complete billing
system for one of the first credit cards, Carte Blanche, in 1957 and built the first creditscoring system for American Investments one year later. Since then credit scoring
became a broad system predicting consumer behaviour and spread into many other
areas, e.g. consumer lending (especially on credit cards), mortgage lending, small and
medium business loan lending, direct marketing or advertising. Back et al. (1996) used
the same procedures to predict the failure of companies.
Methods used for credit scoring include various statistical procedures, like the most
commonly used logistic regression and its alternative probit regression, (linear) dis-
criminant analysis, fashionable artificial neural networks or genetic algorithms, furtherlinear programming, nonparametric classification trees or semiparametric regression.
The aim of this thesis is to give an overview of credit scoring and to compare
various statistical methods by using them on a data set from a French bank. We want
to compare the ability to distinguish between two subgroups of a highly complicated
sample, and to see the performance and computational severity of these procedures.
1http://www.cardweb.com/cardlearn/stat.html
6
-
8/2/2019 Credit Risk Estimation
8/73
Number of Credit Cards
1995 1996 1997 1998 1999
200
250
CreditCards(per1000habitants)
Number of Transactions
1995 1996 1997 1998 1999
5000
6000
7000
8000
9000
Transactions(per1000habitants)
Figure 1.1: Number of credit cards and number of transactions from a creditcard in the European Union between 19951999 (http://www.eurofinas.org).
Credit scoring gains new importance, when thinking about the New Basel Capital
Accord. The so called Basel II replaces the current 1988 Capital Accord and focuses
on techniques that allow banks and supervisors to evaluate properly the various risks
that banks face. Thus credit scoring may contribute to the internal assessment process
of an institution, what is desirable.
This master thesis is organized as follows: Firstly, we give a short overview of creditscoring, advert something from its history, refer accompanying problems and mention
the way credit scoring may go in the future. The data set for the analysis is described in
the third section. Section 4 explains the statistical methods used (it is particularly the
logistic regression, semiparametric regression, multi-layer perceptron neural network
and radial basis function neural network) and apply them on the data set. Other
methods used for credit scoring are presented in Section 5. Summary of our analyzes
with the conclusion is given in Section 6. Section 7 remarks some possible extensions
to this topic. The method of radial basis function neural networks, we programmedin C language for the purpose of this thesis, is described in Appendix A thoroughly.
Appendix B gives some suggestions for improvements in XploRe we discovered during
work on this thesis. Finally, Appendix C lists files stored on compact disk enclosed to
this thesis.
7
-
8/2/2019 Credit Risk Estimation
9/73
2 Credit Scoring in Overview
Risk forecasting is the topic number one in modern finance. Apart from the portfolio
management, pricing options (and other financial instruments) or bond pricing, credit
scoring represents another important set of procedures estimating and reducing creditrisk. It involves techniques that help financial organizations to decide whether or not to
grant a credit to applicants. Basically, credit scoring tries to distinguish two different
subgroups in the data sample. The aim is to choose such a method which is computable
in real time and predicts sufficiently precise.
There is a vast number of articles treating credit scoring in recent issues of trade
publications in the credit and banking area. An exhaustive overview of literature
to credit scoring can be found in Thomas (2000), Mester (1997) or in Kaiser and
Szczesny (2000a). Professor D.J. Hand, head of statistics section in the department ofmathematics at Imperial College, published also lot of books to this topic.
2.1 History
The statistical techniques used for credit scoring are based on the idea of discrimination
between several groups in a data sample. These procedures originate in the thirties and
forties of the previous century (Fisher, 1936; Durand, 1941). At that time some of the
finance houses and mail order firms were having difficulties with their credit manage-
ment. Decision whether to give loans or send merchandise to the applicants were made
judgmentally by credit analysts. The decision procedure was nonuniform, subjective
and opaque, it depended on the rules of each financial house and on the personal and
empirical knowledge of each single clerk. With the rising number of people applying
for a credit card in the late 1960s it was impossible to stay by credit analysts onlyan
automated system was necessary. The first consultancy was formed in San Francisco
by Bill Fair and Earl Isaac in the late 1950s. Their system spread fast as the financial
institutions found out that using credit scoring was cheaper, faster, more objective,
and mainly much better predictive than any judgmental scheme. It is estimated that
the default rates dropped by 50% (Thomas, 2000) after the implementation of credit
scoring. Another advantage of the usage of credit scoring is that it allows lenders to
underwrite and monitor loans without actually meeting the borrower.
The success of credit scoring in credit cards issuing was a significant sign for the
banks to use scoring methods to other products like personal loans, mortgage loans,
small business loans etc. However, commercial lending is more heterogeneous, its
documentation is not standardized within or across institutions and thus the results
8
-
8/2/2019 Credit Risk Estimation
10/73
are not so clear. The growth of direct marketing has led to the use of scorecards to
improve the response rate to advertising campaigns in the 1990s.
2.2 Problems
Most of the problems one must face when using credit scoring are rather technical than
theoretical nature. First of all, one should think of the data necessary to implement
the scoring. It should include as many relevant factors as possible. It is a trade-off
between expensive data and between low accuracy due to not enough information.
Banks collect the data from their internal sources (from the applicants previous credit
history), from external sources (questionnaires, interviews with the applicants) and
from third parties. From the applicants background the following information is usu-
ally collected: age, gender, marital status, nationality, education, number of children,job, income, lease rental charges, etc. The following questions from applicants credit
history are especially interesting: Has the applicant already a credit?, How much
did he borrowed?, Has the applicant ever delayed his payment?, Does he ask for
another credit as well? Under third parties we understand special houses oriented in
Number of Credit Cards (per 1000 habitants)
200
400
600
800
Belgium France Germany Italy Sweden U.K.
Figure 2.1: Number of credit cards per 1000 persons in the year 1994 (thin
line) and 1999(thick line) (http://www.eurofinas.org).
9
-
8/2/2019 Credit Risk Estimation
11/73
collecting credit information about potential clients. The variables entering the credit
scoring procedures should be chosen carefully, as the amount of the data may be vast
indeed and thus computationally problematic. For instance, the German central bank
(Deutsche Bundesbank) lists about 325000 units. Since most of the attributes in credit
scoring are categorical, imposing dummy variables gives a matrix with several millions
of elements (Enache (1998) mentions 180 variables in his analysis).
Muller et al. (2002) treats a very important feature of credit scoring data. There is
usually no information on the performance of rejected customers. This causes bias in
the sample. Hand and Henley (1993) concluded that it cannot be overcome unless one
can assume particular relationship between the distributions of the good and bad
clients which holds for both the accepted and the rejected applicants. This problem may
be solved by some organizations if they accept everybody for a short time. Afterward
they can build a scorecard based on the unbiased data sample. However, this is possible
only for retailers, mail order firms or advertising companies, not for banks and financial
institutions.
The American banks have another problem. The law does not allow to use infor-
mation about race, nationality, religion, gender or marital status to build a scorecard.
It is stated in the Equal Credit Opportunity Act (ECOA) and in the Consumer Credit
Protection Act. Moreover, the attribute age plays a special role. It can be used,
unless people older than 62 years are discriminated. Legal fundamentals for consumer
credit business in Germany are given in Schnurr (1997).
2.3 Credit Scoring Today
As mentioned above, credit scoring methods are widely used to estimate and to min-
imize credit risk. Mail order companies, advertising companies, banks and other fi-
nancial institutions use these methods to score their clients, applicants and potential
customers. There is effort to precise all procedures used to estimate and decrease credit
risk. Both the U.S. Federal Home Loan Mortgage Corporation and the U.S. Federal
National Mortgage Corporation have encouraged mortgage lenders to use credit scoring
which should provide consistency across underwriters.
Also the international banks supervision appeals to precise banks internal assess-
ments: The Basel Committee on Banking Supervision is an international organization
which formulates broad supervisory standards and guidelines for banks. It encourages
convergence toward common approaches and common standards. The Committees
members come from Belgium, Canada, France, Germany, Italy, Japan, Luxembourg,
10
-
8/2/2019 Credit Risk Estimation
12/73
the Netherlands, Spain, Sweden, Switzerland, United Kingdom and United States. In
1988, the Committee decided to introduce a capital measurement system (the Basel
Capital Accord). This framework has been progressively introduced not only in mem-
ber countries but also in other countries with active international banks. In June 1999,
the Committee issued a proposal for a New Capital Adequacy Framework to replace
the 1988 Accord (http://www.bis.org). The proposed capital framework consists of
three pillars:
1. minimum capital requirements,
2. supervisory review of internal assessment process and capital adequacy,
3. effective use of disclosure to strengthen market discipline.
The New Basel Capital Accord is to be implemented till 2004. Consequently, Basel
II (The New Capital Accord) gives more emphasis on banks own internal methodolo-
gies. Therefore credit scoring and its methods can become subject of banks extensive
interest, as they will try to make their internal assessments as precise and correct as
possible.
11
-
8/2/2019 Credit Risk Estimation
13/73
3 Data Set Description
In this section we shortly describe the data set used in our analysis and show some
of its basic characteristics to give a better insight in the sample. The data set an-
alyzed in this thesis stems from a French bank. However, the source is confidentialand therefore names of all variables have been removed, categorical values have been
changed to meaningless symbols and metric variables have been standardized to mean
0 and variance 1. The same data set was in background of Muller and Ronz (1999)
and Hardle et al. (2001). The original file, french.dat, contains 8830 observations
with one response variable, 8 metric and 15 categorical predictor variables. We have
X1
-2 0 2 4
0
0.
1
0.
2
0.
3
0.
4
0.
5
X4
0 2 4 6
0
0.
5
1
X7
0 20 40 60
0
0.5
1
1.5
X2
-2 0 2 4 6
0
0.
1
0.
2
0.3
0.
4
0.
5
X5
0 20 40 60
0
0.
5
1
1.
5
2
X8
0 5 10 15
0
0.5
1
1.
5
X3
0 2 4 6
0
0.
2
0.
4
0.
6
0.
8
X6
0 10 20 30
0
0.
5
1
Figure 3.1: Density dot plots for the original metric variables X1, . . . , X 8.
12
-
8/2/2019 Credit Risk Estimation
14/73
"=================================================="
" Variable X5"
"=================================================="
" | Frequency Percent Cumulative "
"--------------------------------------------------"
" -0.537 | 3815 0.617 0.617"
" 0.203 | 934 0.151 0.768"
" 0.943 | 907 0.147 0.915"
" 1.682 | 431 0.070 0.985"
" 2.422 | 93 0.015 1.000"
"--------------------------------------------------"
" | 6180 1.000"
"=================================================="
Table 3.1: Frequency Table of the metric variable X5.
removed observations with response classified as 9, class used originally for testing.
Since we do not know the real classification, we cannot use this class for the purpose
of our analysis. The remaining data contain 6672 observations. In addition we have
changed classes of the independent categorical variables from 1, . . . , K to 0, . . . , K 1
and ordered them in accordance with the number of categories. Let Y denote the
response variable, X1, . . . , X 8, in sequence, the metric variables and X9, . . . , X 23 the
categorical variables.
The density dot plots 2 in Figure 3.1 show estimated densities of the metric variables
and indicate some suspicious outlying values. The problem is that the usual outliers
tests assume normal distribution and that testing of the normality is affected by these
outliers (Ronz, 1998). Note that the last observations (number 6662 and higher) take
the lower extremes in almost all metric variables. Since the metric variables were
already standardized we decided to restrict them to the range [3, 3] in order to get
rid of the outliers. Thereby we get a new subsample containing only 6180 cases. Density
dot plots of the metric variables for this data set are shown in Figure 3.2 and we can
see that the shape of the variables densities became better. Table 3.1 shows frequencies
of the variable X5, which is obviously discrete.
For the purpose of our analysis we have randomly divided the data sample into
two subsamples About two thirds of the whole data set (4135 observations) builds
the first subsample, the TRAIN sample. It will be used for model estimation. The
second subsample, TEST, with 2045 observation will be used to get some overall pro-
2For the density dot plots in this section we used quantlet myGrdotd.xpl which is a modified version of
the faulty original grdotd. For more information see Appendix B.1.
13
http://www.xplore-stat.de/help/grdotd.html -
8/2/2019 Credit Risk Estimation
15/73
X1
-1 0 1 2 3
0
0.
2
0.
4
X4
-1 0 1 2 3
0
0.5
1
X7
0 1 2 3
0
0.
5
1
1.
5
2
2.
5
X2
-1 0 1 2 3
0
0.
2
0.
4
0.
6
X5
0 1 2
0
0.
5
1
1.
5
X8
0 1 2 3
0
1
2
3
4
5
X3
-1 0 1 2 3
0
0.
2
0.
4
0.
6
0.
8
X6
0 1 2 3
0
0.
5
1
Figure 3.2: Density dot plots for all metric variables as used in the analysis.
cedure to compare particular methods and to check the predictive power of the models
used. This will be based on misclassification rates. The subsamples are stored in files
data-train.dat and data-test.dat.
At this stage it is worth to recall that we do not know anything about the economic
interpretation of the predictors. We also do not know, what the response variable means
(is it a credit card-, loan- or mortgage-application?). And even the coding is primarily
unclearstands 1 for client is creditworthy or for client has some problems with
repaying the debt? The frequencies of the response variable, summarized in Table 3.2,
tell us more. Since only about 6% of the data set is classified as 1, we will call this class:
clients that have some problems with repaying their liability. Our sample is in this
14
-
8/2/2019 Credit Risk Estimation
16/73
TRAIN sample TEST sample
0 3888 (94.0%) 1920 (93.9%) 5808 (94.0%)
1 247 (6.0%) 125 (6.1%) 372 (6.0%)
total 4135 2045 6180
Table 3.2: Frequencies of two response outcomes.
sense unbiased, because the percentage of faulty loans in consumer credit commonly
varies between 1% and 7% (Arminger et al., 1997). Note that many credit scoring
analyzes are using data sets with overrepresented rate of bad loans (West, 2000).
However, Desai et al. (1996) use 3 data sets from 3 credit unions in the Southeastern
US which consist of 81.58%, 74.02% and 78.85% of good loans respectively. The
data set of Fahrmeir et al (1984) contains 70% of good credits. Enache (1998)
analyzed 38.000 applications, 16.8% of them were rejected. Cardweb, the U.S. payment
card information network (http://www.cardweb.com), mentions that about 78% of U.S.
households are considered creditworthy.
Let us now examine the variables in detail. All descriptive statistics and graphics
used in this section are computed by the quantlet fr descr.data.xpl. We start with
metric variables. Table 3.3 and 3.4 show for the TRAIN and TEST data sets their
basic characteristics: minimum, first quartile, median, third quartile, maximum, mean
and standard error. These statistics express in numbers our first finding from the dot
density plots, namely that the data are extremely right-skewed. Figure 3.3 and 3.4
show box plots for the TRAIN and TEST data set. The left box plot in each display
stands for the good clients and the right one for the bad clients. From the box
plots we cannot see any substantial differences between these two groups.
As next, we examine the categorical variables. Frequencies for dichotomous cate-
gorical variables (with two outcomes only) are shown in Table 3.5. Figures 3.53.13
show bar charts for the variables X15X23. The upper displays correspond to the
TRAIN data set, the lower displays correspond to the TEST data set. Left displays
reveal the outcomes when the response is 0 and the displays on the right side when
Y = 1. Remarkable is the change in variable X23. While the third category is most
plentiful by the creditworthy clients and the ninth category has only about one half
observations, in the case of non-creditworthy clients the ninth category increases its
relative number and the third category is up to about two thirds of the ninth category.
Characteristics for variables X15X23 are summarized in Tables 3.63.14.
15
-
8/2/2019 Credit Risk Estimation
17/73
X1
-1
0
1
2
3
X4
-1
0
1
2
3
X7
0
1
2
3
X2
-1
0
1
2
3
X5
0
1
2
X8
0
1
2
3
X3
-1
0
1
2
3
X6
0
1
2
3
Figure 3.3: Box plots of metric variables in the TRAIN subsample.
Min. 25% Q. Median 75% Q. Max. Mean Std.Err.
X1 1.519 0.766 0.349 0.403 2.994 0.119 0.892
X2 1.188 0.810 0.307 0.323 2.968 0.122 0.847
X3 0.830 0.695 0.426 0.113 2.940 0.083 0.851
X4 0.825 0.694 0.432 0.223 2.973 0.113 0.816
X5 0.537 0.537 0.537 0.203 2.422 0.019 0.772
X6 0.626 0.363 0.167 0.117 2.962 0.069 0.492X7 0.302 0.302 0.302 0.138 2.924 0.030 0.408
X8 0.346 0.211 0.211 0.211 2.835 0.106 0.340
Table 3.3: Minimum, 1st quartile, median, 3rd quartile, maximum, mean and
the standard error of metric variables from the TRAIN subsample.
16
-
8/2/2019 Credit Risk Estimation
18/73
X1
-1
0
1
2
3
X4
-1
0
1
2
3
X7
0
1
2
X2
-1
0
1
2
3
X5
0
1
2
X8
-0.
5
0
0.
5
1
1.5
2
2.
5
X3
-1
0
1
2
3
X6
0
1
2
3
Figure 3.4: Box plots of metric variables in the TEST subsample.
Min. 25% Q. Median 75% Q. Max. Mean Std.Err.
X1 1.519 0.766 0.182 0.487 2.994 0.062 0.892
X2 1.188 0.810 0.307 0.449 2.968 0.072 0.880
X3 0.830 0.695 0.291 0.247 2.940 0.037 0.889
X4 0.825 0.694 0.432 0.223 2.973 0.086 0.832
X5 0.537 0.537 0.537 0.203 2.422 0.012 0.780
X6 0.626 0.375 0.180 0.109 2.858 0.083 0.473X7 0.302 0.302 0.302 0.138 2.631 0.029 0.404
X8 0.211 0.211 0.211 0.211 2.631 0.099 0.367
Table 3.4: Minimum, 1st quartile, median, 3rd quartile, maximum, mean and
the standard error of metric variables from the TEST subsample.
17
-
8/2/2019 Credit Risk Estimation
19/73
TRAIN sample TEST sample
X9: 0 1145 (27.7%) 620 (30.3%) 1765 (28.6%)
X9: 1 2990 (72.3%) 1425 (69.7%) 4415 (71.4%)
X10: 0 708 (17.1%) 361 (17.7%) 1069 (17.3%)
X10: 1 3427 (82.9%) 1684 (82.3%) 5111 (82.7%)
X11: 0 569 (13.8%) 308 (15.1%) 877 (14.2%)
X11: 1 3566 (86.2%) 1737 (84.9%) 5303 (85.8%)
X12: 0 2521 (61.0%) 1286 (62.9%) 3807 (61.6%)
X12: 1 1614 (39.0%) 759 (37.1%) 2373 (38.4%)
X13: 0 217 (5.2%) 100 (4.9%) 317 (5.1%)
X13: 1 3918 (94.8%) 1945 (95.1%) 5863 (94.9%)
X14: 0 3965 (95.9%) 1959 (95.8%) 5924 (95.9%)
X14: 1 170 (4.1%) 86 (4.2%) 256 (4.1%)
Table 3.5: Outcome frequencies of dichotomous categorical variables X9X14.
TRAIN sample (Y = 0)
0
500
1000
1500
2000
2500
X15
TEST sample (Y = 0)
0
500
1000
X15
TRAIN sample (Y = 1)
0
50
100
150
X15
TEST sample (Y = 1)
0
50
100
X15
Figure 3.5: Bar charts of the categorical variable X15.
18
-
8/2/2019 Credit Risk Estimation
20/73
TRAIN sample TEST sample
0 1229 (29.7%) 590 (28.9%) 1819 (29.4%)
1 245 (5.9%) 131 (6.4%) 376 (6.1%)
2 2661 (64.4%) 1324 (64.7%) 3985 (64.5%)
Table 3.6: Outcome frequencies of the categorical variable X15.
TRAIN sample (Y = 0)
0
500
1000
1500
2
000
2500
3000
X16
TEST sample (Y = 0)
0
500
1000
1500
X16
TRAIN sample (Y = 1)
0
50
100
150
X16
TEST sample (Y = 1)
0
50
X16
Figure 3.6: Bar charts of the categorical variable X16.
TRAIN sample TEST sample
0 3071 (74.3%) 1513 (74.0%) 4584 (74.2%)
1 535 (12.9%) 293 (14.3%) 828 (13.4%)
2 529 (12.8%) 239 (11.7%) 768 (12.4%)
Table 3.7: Outcome frequencies of the categorical variable X16.
19
-
8/2/2019 Credit Risk Estimation
21/73
TRAIN sample (Y = 0)
0
500
1000
1500
X17
TEST sample (Y = 0)
0
200
400
600
800
X17
TRAIN sample (Y = 1)
0
50
100
X17
TEST sample (Y = 1)
0
20
40
60
X17
Figure 3.7: Bar charts of the categorical variable X17.
TRAIN sample TEST sample
0 1440 (34.8%) 728 (35.6%) 2168 (35.1%)1 223 (5.4%) 118 (5.8%) 341 (5.5%)
2 639 (15.5%) 311 (15.2%) 950 (15.4%)
3 1833 (44.3%) 888 (43.4%) 2721 (44.0%)
Table 3.8: Outcome frequencies of the categorical variable X17.
20
-
8/2/2019 Credit Risk Estimation
22/73
TRAIN sample (Y = 0)
0
500
1000
X18
TEST sample (Y = 0)
0
200
400
600
X18
TRAIN sample (Y = 1)
0
50
X18
TEST sample (Y = 1)
0
10
20
30
40
X18
Figure 3.8: Bar charts of the categorical variable X18.
TRAIN sample TEST sample
0 874 (21.1%) 441 (21.6%) 1315 (21.3%)
1 809 (19.6%) 378 (18.5%) 1187 (19.2%)
2 595 (14.4%) 319 (15.6%) 914 (14.8%)
3 232 (5.6%) 125 (6.1%) 357 (5.8%)
4 120 (2.9%) 77 (3.8%) 197 (3.2%)
5 1505 (36.4%) 705 (34.5%) 2210 (35.8%)
Table 3.9: Outcome frequencies of the categorical variable X18.
21
-
8/2/2019 Credit Risk Estimation
23/73
TRAIN sample (Y = 0)
0
500
1000
X19
TEST sample (Y = 0)
0
100
200
300
400
500
X19
TRAIN sample (Y = 1)
0
20
40
60
X19
TEST sample (Y = 1)
0
10
20
30
X19
Figure 3.9: Bar charts of the categorical variable X19.
TRAIN sample TEST sample
0 369 (8.9%) 209 (10.2%) 578 (9.4%)
1 634 (15.3%) 295 (14.4%) 929 (15.0%)
2 1093 (26.4%) 548 (26.8%) 1641 (26.6%)
3 255 (6.2%) 132 (6.5%) 387 (6.3%)
4 1145 (27.7%) 552 (27.0%) 1697 (27.5%)
5 639 (15.5%) 309 (15.1%) 948 (15.3%)
Table 3.10: Outcome frequencies of the categorical variable X19.
22
-
8/2/2019 Credit Risk Estimation
24/73
TRAIN sample (Y = 0)
0
500
1000
1500
2000
2500
X20
TEST sample (Y = 0)
0
500
1000
X20
TRAIN sample (Y = 1)
0
50
100
X20
TEST sample (Y = 1)
0
10
20
30
40
50
X20
Figure 3.10: Bar charts of the categorical variable X20.
TRAIN sample TEST sample
0 710 (17.2%) 370 (18.1%) 1080 (17.5%)
1 267 (6.5%) 151 (7.4%) 418 (6.8%)
2 372 (9.0%) 198 (9.7%) 570 (9.2%)
3 255 (6.2%) 121 (5.9%) 376 (6.1%)
4 60 (1.5%) 25 (1.2%) 85 (1.4%)
5 2471 (59.8%) 1180 (57.7%) 3651 (59.1%)
Table 3.11: Outcome frequencies of the categorical variable X20.
23
-
8/2/2019 Credit Risk Estimation
25/73
TRAIN sample (Y = 0)
0
500
1000
X21
TEST sample (Y = 0)
0
100
200
300
400
500
X21
TRAIN sample (Y = 1)
0
20
40
60
X21
TEST sample (Y = 1)
0
10
20
30
40
X21
Figure 3.11: Bar charts of the categorical variable X21.
TRAIN sample TEST sample
0 737 (17.8%) 387 (18.9%) 1124 (18.2%)
1 412 (10.0%) 183 (8.9%) 595 (9.6%)
2 708 (17.1%) 318 (15.6%) 1026 (16.6%)
3 461 (11.1%) 214 (10.5%) 675 (10.9%)
4 1069 (25.9%) 554 (27.1%) 1623 (26.3%)
5 473 (11.4%) 237 (11.6%) 710 (11.5%)
6 275 (6.7%) 152 (7.4%) 427 (6.9%)
Table 3.12: Outcome frequencies of the categorical variable X21.
24
-
8/2/2019 Credit Risk Estimation
26/73
TRAIN sample (Y = 0)
0
500
1000
X22
TEST sample (Y = 0)
0
100
200
300
400
500
X22
TRAIN sample (Y = 1)
0
50
X22
TEST sample (Y = 1)
0
10
20
30
40
X22
Figure 3.12: Bar charts of the categorical variable X22.
TRAIN sample TEST sample
0 383 (9.3%) 198 (9.7%) 581 (9.4%)
1 420 (10.2%) 228 (11.1%) 648 (10.5%)
2 1189 (28.8%) 572 (28.0%) 1761 (28.5%)
3 202 (4.9%) 78 (3.8%) 280 (4.5%)
4 210 (5.1%) 95 (4.6%) 305 (4.9%)
5 317 (7.7%) 175 (8.6%) 492 (8.0%)
6 218 (5.3%) 97 (4.7%) 315 (5.1%)
7 227 (5.5%) 112 (5.5%) 339 (5.5%)
8 143 (3.5%) 69 (3.4%) 212 (3.4%)
9 826 (20.0%) 421 (20.6%) 1247 (20.2%)
Table 3.13: Outcome frequencies of the categorical variable X22.
25
-
8/2/2019 Credit Risk Estimation
27/73
TRAIN sample (Y = 0)
0
500
1000
X23
TEST sample (Y = 0)
0
100
200
300
400
500
X23
TRAIN sample (Y = 1)
0
10
20
30
40
50
X23
TEST sample (Y = 1)
0
10
20
30
X23
Figure 3.13: Bar charts of the categorical variable X23.
TRAIN sample TEST sample
0 279 (6.7%) 145 (7.1%) 424 (6.9%)
1 298 (7.2%) 167 (8.2%) 465 (7.5%)
2 1025 (24.8%) 510 (24.9%) 1535 (24.8%)
3 375 (9.1%) 180 (8.8%) 555 (9.0%)
4 226 (5.5%) 99 (4.8%) 325 (5.3%)
5 104 (2.5%) 59 (2.9%) 163 (2.6%)
6 269 (6.5%) 132 (6.5%) 401 (6.5%)
7 389 (9.4%) 194 (9.5%) 583 (9.4%)
8 561 (13.6%) 263 (12.9%) 824 (13.3%)
9 65 (1.6%) 25 (1.2%) 90 (1.5%)
10 544 (13.2%) 271 (13.3%) 815 (13.2%)
Table 3.14: Outcome frequencies of the categorical variable X23.
26
-
8/2/2019 Credit Risk Estimation
28/73
4 Credit Scoring Methods
We suppose that we are given a set {xi }Ni=1 of N observations of a random vector
X in Rp. That is, there are p independent variables (predictors) X1, . . . , X p and one
dependent variable (response) Y. Each observation is thus a row vector (xi , yi), i =
1, 2, . . . , N and the whole data set can be written in a matrix form:
X =
x1,1 . . . x1,p...
...
xN,1 . . . xN,p
Y =
y1...
yN
In our particular case we have two data sets with NTRAIN = 4135 and NTEST = 2045.
The dimension of the predictor space is either p = 23 (or p = 61 after introducing
dummy variables).
4.1 Logistic Regression
Logistic regression stems from a wider class of models. It is a special case of the gener-
alized linear model (GLM). The logistic regression model assumes that the conditional
expectation E[Y|X = x] of Y is given in the following way:
E[Y|x] = G(0 + x) =
e0+x
1 + e0+x,
where 0 is the regression constant, = (1, . . . p) are the regression coefficients,
and G(.) is the logistic cumulative distribution function (also called the link function).
The distribution of Y (Bernoulli distribution) belongs to the exponential family of
distributions:
f(yi, ) = P(Y = yi|X = xi) = yi(1 )1yi , i = 1, 2, . . . , N
where = P(Y = 1|x) is the probability that the applicant will not be able to pay his
liabilities and 1 = P(Y = 0|x) is the probability of a creditworthy applicant.The parameters of a GLM are, in general, estimated by maximum-likelihood method
by maximizing the term:
L(0, 1, . . . , p) =Ni=1
yi(1 )1yi
=
Ni=1
G(0 +
xi)yi(1 G(0 +
xi))(1yi)
.
The estimating procedure is done by the Newton-Raphson algorithm. Computational
details are given in Hardle et al. (2000b).
27
-
8/2/2019 Credit Risk Estimation
29/73
Variable Std.Err. t-value Variable Std.Err. t-value
Const. -2.747 0.066 -41.811 Const. -2.757 0.066 -41.976
X1 0.191 0.069 2.771 X5 0.093 0.082 1.135
Const. -2.823 0.071 -39.873 Const. -2.754 0.066 -41.861
X2 -0.313 0.088 -3.539 X6 0.035 0.131 0.267
Const. -2.767 0.067 -41.518 Const. -2.830 0.071 -39.635
X3 -0.097 0.082 -1.189 X7 -0.872 0.217 -4.013
Const. -2.752 0.066 -41.721 Const. -2.948 0.099 -29.807
X4 0.047 0.078 0.596 X8 -1.306 0.431 -3.032
Table 4.1: Logistic regression coefficients for each metric variable separately.
Bold parameter values are significant at 5%.
Results
The estimations are based on the TRAIN data set. Since the logistic regression cannot
deal with categorical variables directly, we must recode each of the variables X9X23
with a set of new contrast variables (the first category, 0, is taken as the reference
category for each variable). In this manner we get a larger data set with one response,
8 metric and 53 dummy variables (stored in files fr train C.dat and fr test C.dat).
Let Xa(b) denote the b-th contrast variable for the original variable Xa, e.g. X23 is
then recoded into contrast variables X23(1), X23(2), ... X23(10). Firstly we tried to
fit the response on each variable separately. Table 4.1 shows the estimation results:
the parameters, their standard error and t-value for the metric variables. X1, X2,
X7 and X8 are significant at 5% (and thus emphasized by bold letters), X3X6 are
insignificant.
Results for categorical variables estimated separately are given in Table 4.2. Pa-
rameters significant at 5% are emphasized by bold letters again. Significant binary
variables are X9, X10 and X14. No variable with more than two categories has sig-
nificant coefficients for all dummy variables. All coefficients for X16 and X17 are
insignificant.
From the significant variables the categorical variable X10 has the lowest deviance
(1802.6). The sequence of other significant variables sorted with respect to the deviance
(in decreasing order) is: X10, X20, X22, X23, X21 and the corresponding R2 values
decrease from 3.7% to 1.5%. This is really a very poor performance. Hence we use
28
-
8/2/2019 Credit Risk Estimation
30/73
Variable Std.Err. t-value Variable Std.Err. t-value
Const. -2.403 0.107 -22.426 Const. -2.121 0.121 -17.475
X9(1) -0.524 0.136 -3.864 X20(1) -0.392 0.262 -1.496
Const. -1.864 0.110 -16.910 X20(2) -1.048 0.290 -3.613
X10(1) -1.206 0.138 -8.737 X20(3) -0.343 0.263 -1.304Const. -2.609 0.166 -15.727 X20(4) 0.932 0.328 2.836
X11(1) -0.172 0.181 -0.954 X20(5) -1.024 0.158 -6.481
Const. -2.841 0.087 -32.563 Const. -3.160 0.186 -16.951
X12(1) 0.206 0.132 1.557 X21(1) -0.049 0.316 -0.155
Const. -2.344 0.240 -9.759 X21(2) 0.142 0.258 0.549
X13(1) -0.440 0.250 -1.763 X21(3) 1.008 0.241 4.184
Const. -2.792 0.068 -41.015 X21(4) 0.561 0.222 2.528
X14(1) 0.659 0.258 2.549 X21(5) 0.274 0.277 0.987
Const.-3.140
0.143 -21.952X
21(6)0.667
0.294 2.271X15(1) 0.174 0.329 0.528 Const. -3.133 0.255 -12.267
X15(2) 0.540 0.162 3.329 X22(1) -0.096 0.361 -0.266
Const. -2.771 0.077 -36.160 X22(2) 0.678 0.277 2.445
X16(1) 0.255 0.181 1.405 X22(3) -0.354 0.487 -0.726
X16(2) -0.192 0.215 -0.892 X22(4) 0.638 0.365 1.749
Const. -2.833 0.115 -24.628 X22(5) 0.262 0.357 0.735
X17(1) -0.034 0.318 -0.106 X22(6) 0.894 0.343 2.604
X17(2) -0.077 0.213 -0.363 X22(7) -1.590 0.755 -2.107
X17(3) 0.192 0.148 1.297 X22(8) 0.989 0.374 2.645
Const. -3.147 0.170 -18.492 X22(9) 0.255 0.299 0.854
X18(1) 0.568 0.219 2.596 Const. -2.561 0.232 -11.035
X18(2) 0.246 0.251 0.982 X23(1) -0.308 0.346 -0.890
X18(3) 0.891 0.281 3.168 X23(2) -0.811 0.290 -2.794
X18(4) 0.635 0.386 1.645 X23(3) -0.264 0.323 -0.816
X18(5) 0.416 0.201 2.065 X23(4) -0.622 0.412 -1.509
Const. -3.030 0.248 -12.204 X23(5) -0.425 0.514 -0.826
X19(1) 0.636 0.287 2.217 X23(6) 0.092 0.325 0.284
X19(2) 0.253 0.280 0.904 X23(7) -0.161 0.313 -0.513
X19(3) 0.670 0.334 2.009 X23(8) 0.342 0.272 1.257
X19(4) 0.025 0.285 0.086 X23(9) -1.598 1.034 -1.545
X19(5) 0.241 0.301 0.802 X23(10) 0.054 0.283 0.191
Table 4.2: Logistic regression coefficients for each categorical variable sepa-
rately. Bold parameter values are significant at 5%.
29
-
8/2/2019 Credit Risk Estimation
31/73
GLM: logistic fit, n=4135
-5 0
Index eta
0
0.2
0.4
0.6
0.8
1
Linkmu,
Responsesy
Figure 4.1: Predicted probabilities.
quantlet glmforward 3 to find a more appropriate model. The quantlet starts with a
null model (containing intercept only) and adds particular variables consequently. The
measure of goodness of the model is the Akaikes criterion. In this way we can identify
subsets of independent variables that are good predictors of the response. Note that
this quantlet must evaluate up to p(p+1)2 models, what means in our case62632 = 1953
models, and therefore the computation takes a while. The forward stepwise procedure
suggest to include in the model the following variables: X1, X2, X5, X7X10, X12
X14, X16(2), X18(3), X18(5), X19(1), X19(3), X20(2), X20(4), X20(5), X21(3),
X21(4), X21(6), X22(2), X22(4), X22(6)X22(9), X23(2), X23(8), X23(9). The
Akaikes information criterion of this model is 1643.6. It is remarkable that the variableX23(3) is insignificant, even though it seemed to be quite predictive from Figure 3.13.
Table 4.3 shows the estimated parameters, their standard error and corresponding t-
value in this model. Since by using dummy variables one considers all possible effects,
modelling for the categorical variables cannot be further improved, but the influence of
metric variables may be better investigated by using semiparametric model. Figure 4.1
shows a plot of X vs. Y together with a plot of X vs. the link function G(X).
3 Files o-logit*.* contain the complete computer output related to this section.
30
http://www.xplore-stat.de/help/glmforward.html -
8/2/2019 Credit Risk Estimation
32/73
Variable Std.Err. t-value
Const. -2.153 0.3653 -5.8936
X1 0.1952 0.1126 1.7338
X2 -0.3533 0.0967 -3.6518X5 0.1618 0.1047 1.546
X7 -0.8900 0.2198 -4.0488
X8 -0.958 0.4080 -2.3482
X9(1) -0.3269 0.1706 -1.9162
X10(1) -0.9780 0.1519 -6.4356
X12(1) 0.2551 0.1598 1.5961
X13(1) -0.511 0.2678 -1.9091
X14(1) 0.6736 0.2850 2.3636
X16(2) -0.3144 0.2351 -1.3372
X18(3) 0.4531 0.265 1.7056
X18(5) 0.5652 0.1940 2.913
X19(1) 0.4681 0.1768 2.6472
X19(3) 0.5394 0.2546 2.1181
X20(2) -0.8286 0.2982 -2.7789
X20(4) 1.262 0.362 3.4868
X20(5) -0.8635 0.1506 -5.7339
X21(3) 0.7086 0.1949 3.6355
X21(4) 0.2385 0.1687 1.4136
X21(6) 0.5004 0.2640 1.8953
X22(2) 0.6722 0.1839 3.6541
X22(4) 0.7068 0.3161 2.236
X22(6) 0.9538 0.2967 3.2146
X22(7) -1.341 0.6702 -2.0007
X22(8) 0.880 0.4261 2.0666
X22(9) 0.350 0.2182 1.6064
X23(2) -0.6462 0.2001 -3.2293
X23(8) 0.3820 0.1766 2.163
X23(9) -1.552 0.9915 -1.566
Table 4.3: Logistic regression coefficients of the model suggested by
glmforward. Bold parameter values are significant at 5%.
31
http://www.xplore-stat.de/help/glmforward.html -
8/2/2019 Credit Risk Estimation
33/73
Prediction
We use the model described in the previous paragraph (suggested by the glmforward
quantlet) to estimate the outcomes of the TEST data set and to compute the misclas-
sification rate.
Firstly we check the prediction on the TRAIN data set which built the model.
Table 4.4 shows the results when observations with probability higher than 0.5 are
assigned to be non-creditworthy clients. For comparison we show results using prior
probabilities in Table 4.5. At the first sight, using prior probabilities gives worse
resultsthe overall misclassification rate of 8.95% is higher than the rate when using 0.5
threshold (6.0%). But in fact, the latter misclassifies 74.9% of the bad clients, while
the former more than 93.9%! Threshold 0.5 gains better overall misclassification rate
due to the low number of misclassified good clients (0.4%), using prior probabilities
misclassifies 4.76% of them. From the banks point of view it is worse to grant a loan
to a non-creditworthy applicant than to reject a creditworthy applicant and thus we
will further use prior probabilities to decide on the creditworthiness of the client.
Then we use the model on the testing data, stored in the file data-test C.dat.
Table 4.6 shows the results. In the TEST data set 98 applicants of 1920 good
applicants were denoted as bad, 101 of the 125 bad clients were assigned as good
and thus 199 applicants of the entire TEST sample were misclassified, what results in
the overall misclassification rate about 9.7%.
observed predicted misclass. overall
0 1 misclass.
0 3872 16 0.41% 248
1 232 15 93.93% (6.00%)
Table 4.4: Misclassification rates of the logit model for the TRAIN data set.
observed predicted misclass. overall0 1 misclass.
0 3703 185 4.76% 370
1 185 62 74.90% (8.95%)
Table 4.5: Misclassification rates of the logit model for the TRAIN data set
(prior probabilities).
32
http://www.xplore-stat.de/help/glmforward.html -
8/2/2019 Credit Risk Estimation
34/73
observed predicted misclass. overall
0 1 misclass.
0 1822 98 5.10% 199
1 101 24 80.80% (9.73%)
Table 4.6: Misclassification rates of the logit model for the TEST data set
(prior probabilities).
Discussion
Since logit model belongs to traditional techniques, one may find the logistic regression
in almost every paper treating credit scoring, at least as a reference method for com-
parison with other models. In general, logistic regression is easy to fit and works well inpractice. However binary output models (like logit or probit) describe best the output
category, which occur most frequent. Therefore each output category should have at
least 5% of observations. If an output class has too small number of observations, one
should use so-called rare event models (Kaiser and Szczesny, 2000a).
4.2 Multi-layer Perceptron
The multi-layer perceptron (MLP) is a simple feed-forward neural network with an
input layer, several hidden layers and one output layer. It means that information can
only flow forward from the input units to the hidden layer and then to the output
unit(s). MLP network is the most often used architecture of neural networks, there is
a great deal of publications concerning MLP network, see for example Bishop (1995).
For the purpose of credit scoring a MLP with one hidden layer and one or two output
units only is sufficient. Its basic structure is illustrated in Figure 4.2. The value of the
output unit can be expressed:
f(x) = F2
w(2)0 +
rj=1
w(2)j F1
w(1)j0 +
pi=1
w(1)ji xi
,
where xi are the the input units, w(1)ji and w
(2)j are the weights of the hidden and
output layer respectively and F1 and F2 are the transfer functions from the input to
the hidden layer and from the hidden to the output layer respectively. The transfer
function is usually sigmoid, e.g. logistic function. The parameters for the network are
determined iteratively, commonly via the backpropagation procedure.
33
-
8/2/2019 Credit Risk Estimation
35/73
Input layer Hidden layer Output layer
qqq
dddddddd
hhhhhhhh
xp
x2
x1
qqq
lllll
lll
zr
z2
z1
y = f(x)
w(1)
w(2)
Figure 4.2: A multi-layer perceptron network with one hidden layer and one
output unit.
Results
For the probabilistic models the softmax transfer function for the outputs is intended.
However, we found out faulty usage of this function in XploRe, it is commented in
Appendix B.2. Therefore we used a MLP network with logistic transfer function for
the output, quadratic least squares error function and no skip connections.
We split the TRAIN sample into two subsamples stored in the files data-nn-1.dat
(with 2779 observations) and data-nn-2.dat (with 1356 observations) respectively.
The first subsample is used for the actual network training, it looks for a MLP network
with the minimal mean squared error (MSE). The latter subsample is validating and
observed predicted misclass. overall0 1 misclass.
0 2579 50 1.9% 110
1 60 90 40.0% (4.0%)
Table 4.7: Misclassification rates of the 23-12-1 MLP network for the small
training sample.
34
-
8/2/2019 Credit Risk Estimation
36/73
observed predicted misclass. overall
0 1 misclass.
0 1197 62 4.9% 148
1 86 11 88.7 % (10.9%)
Table 4.8: Misclassification rates of the 23-12-1 MLP network for the small
validating sample.
observed predicted misclass. overall
0 1 misclass.
0 2569 60 2.3% 120
1 60 90 40.0% ( 4.3%)
Table 4.9: Misclassification rates of the 17-13-1 MLP network for the small
training sample.
observed predicted misclass. overall
0 1 misclass.
0 1205 54 4.3% 135
1 81 16 83.5% (10.0%)
Table 4.10: Misclassification rates of the 17-13-1 MLP network for the small
validating sample.
it should avoid the overfitting so that the model is not built for one particular sample
only.
We computed many models and looked at the MSE and the misclassification rates
of the validation data set. The misclassification rates are computed due to prior prob-abilities gained from the small training sample. Finally we chose a MLP networks with
12 units in the hidden layer, which reached the MSE of 243.58. The final 23-12-1 MLP
network 4 is saved in files mlp1.*.
Tables 4.7 and 4.8 show the misclassification rates which the final MLP network
reaches in the small training and in the validating sample respectively. The resulting
MLP network predicts more than 98% of the good clients correctly. Out of the bad
4Here 23-12-1 denotes 23 input units, 12 units in the hidden layer and 1 output unit.
35
-
8/2/2019 Credit Risk Estimation
37/73
clients 40% are misclassified. Altogether there are 4% misclassified clients. The results
get slightly worse, when using the MLP network on the small validation set. Almost
5% of the good clients and more than 88% of the bad clients are misclassified!
The problem by neural networks is missing procedure choosing significant input
units. Thus we let us inspire by the logistic regression and restrict the input layer
to 17 units only, as suggested in subsection 4.1 by the quantlet glmforward. That
is, we built the network only on the knowledge of the variables X1, X2, X5, X7
X10, X12X14, X16 and X18X23. The resulting network has 13 hidden units and
the MSE of 310.35. This network is stored in the files mlp2.*. Table 4.9 shows its
misclassification rates. This restricted MLP network has higher misclassification rate
by the good clients from the small training sample (2.3%), but the performance in
the small validating sample is a little bit better. 4.3% of the good clients and only
83.5% of the bad clients are misclassified (Table 4.10).
Prediction
Before we compute the misclassification rates of the TEST data set, we check the
prediction on the whole data from the TRAIN sample which built the model. Both
results for the 23-12-1 MLP and 17-13-1 MLP network are shown in Table 4.11 and
Table 4.12 respectively. The restricted network has better prediction by the bad
observed predicted misclass. overall0 1 misclass.
0 3771 117 3.0% 263
1 146 101 59.1% (6.4%)
Table 4.11: Misclassification rates of the 23-12-1 MLP network for the TRAIN
data set.
observed predicted misclass. overall0 1 misclass.
0 3751 137 3.5% 275
1 138 109 55.9% (6.7%)
Table 4.12: Misclassification rates of the 17-13-1 MLP network for the TRAIN
data set.
36
http://www.xplore-stat.de/help/glmforward.html -
8/2/2019 Credit Risk Estimation
38/73
observed predicted misclass. overall
0 1 misclass.
0 1815 105 5.5% 215
1 110 15 88.0% (10.5%)
Table 4.13: Misclassification rates of the 23-12-1 MLP network for the TEST
data set.
observed predicted misclass. overall
0 1 misclass.
0 1813 107 5.6% 217
1 110 15 88.0% (10.6%)
Table 4.14: Misclassification rates of the 17-13-1 MLP network for the TEST
data set.
clientsit misclassified 8 clients (3.2%) less than the full network, on the other side
by the good clients it misclassifies 20 clients (0.5%) more than the full network.
Altogether the restricted 17-13-1 MLP network describes the TRAIN data set slightly
worse than the full 23-12-1 MLP network. It misclassifies 12 clients more (0.3%).
Note that the number of misclassified clients in the small training sample plus the
number of misclassified clients in the small validation sample is not exactly equal to
the number of misclassified clients in the TRAIN data set (no matter which network
110 + 148 = 263 and 120 + 135 = 275). This is due to the prior probabilities we used.
We are denoting a client as bad, if its neural networks output function is greater
than the rate of good clients in the corresponding training sample, that is 94.6% in
the small training sample and 94.0% in the TRAIN data set.
Table 4.13 and Table 4.14 show the final results of prediction the TEST data set.
There is no difference in predicting the bad clients. The restricted 17-13-1 MLP
network misclassifies only 2 clients more than the full 23-12-1 MLP network. Thus
these two neural networks seem to give almost the same results.
Discussion
Neural Networks represent very flexible models with good performance. Although
there are various architectures of neural networks, more than 50% of applications are
37
-
8/2/2019 Credit Risk Estimation
39/73
using the multi-layer perceptron (MLP) network, which is both simple and well known.
Problematic is choosing the number of units in the hidden layer. West (2000) uses an
analogy of the forward stepwise procedure in the logistic regression. The so called
cascade learning starts with one neuron in the hidden layer and adds other neurons
as long as the performance is being better. As one can see, logistic regression may be
classified as a simple MLP with one processing unit in one hidden layer and logistic
function as the sigmoid activation function.
4.3 Radial Basis Function Neural Network
The radial basis function (RBF) network is another architecture of feed-forward neural
networks, which has in principle only one hidden layer. The hidden units are also
called clusters, as the observations are clustered to one of the hidden units whichare represented by radially symmetric functions. The weights of the hidden layer then
represent centers of these clusters (mean of the radial functions). At the first stage
these centers as well as deviances of the radial basis functions are to be found (via the
so called unsupervised learning). As next, weights of the output layer are determined
via supervised learning (similarly like in the MLP networks). RBF neural networks are
explained in Appendix A comprehensively.
observed predicted misclass. overall
0 1 misclass.
0 2505 124 4.7% 248
1 124 26 82.7% (8.9%)
Table 4.15: Misclassification rates of the 23-100-1 RBF network for the small
training sample.
observed predicted misclass. overall
0 1 misclass.
0 1200 59 4.7% 142
1 83 14 85.6% (10.5%)
Table 4.16: Misclassification rates of the 23-100-1 RBF network for the small
validation sample.
38
-
8/2/2019 Credit Risk Estimation
40/73
Results
The RBF neural network is built in the same way as the MLP network. We trained
the network on the small training sample with 2779 observations (data-nn-1.dat) in
order to minimize the mean squared error and at the same time checked the overfitting
by evaluating the model on the small validating sample (data-nn-2.dat).
In the same manner as by the MLP network we tried many different networks with
various learning parameters and different number of hidden units, till we got optimal
results. There are two models again, one with 23 input units (the full model) and
another one with only 17 inputs, as suggested in Section 4.1
The first model uses 100 clusters, we denote it 23-100-1 RBF neural network and
save it in rbf1.rbf). The latter contains only 80 clusters and we denote it 17-80-
1 RBF network (stored in rbf2.rbf). Misclassification rates for the 23-100-1 RBF
network are given in Table 4.15 for the small training sample and in Table 4.16 for the
validating sample. The 17-80-1 RBF neural networks misclassification rates are shown
in Table 4.17 and 4.18. In comparison with MLP networks, both the 23-100-1 and
17-80-1 RBF networks misclassify in the small training sample almost twice as much
as the MLP does. However, the misclassification rates in the validation sample are
reasonable: 10.5% and 10.8% of overall misclassified observations.
observed predicted misclass. overall
0 1 misclass.0 2510 119 4.5% 238
1 119 31 79.3% (8.6%)
Table 4.17: Misclassification rates of the 17-80-1 RBF network for the small
training sample.
observed predicted misclass. overall
0 1 misclass.0 1198 61 4.8% 146
1 85 12 87.6% (10.8%)
Table 4.18: Misclassification rates of the 17-80-1 RBF network for the small
validation sample.
39
-
8/2/2019 Credit Risk Estimation
41/73
Prediction
Table 4.19 shows the results of the 23-100-1 RBF neural network. It misclassifies exactly
the same number of observations as the 23-12-1 MLP network: 10.5%. Anyway, the
RBF network predicts better the bad applicants there are 109 misclassifications
(87.2%). Out of the good applicants 106 (5.5%) were misclassified. Results for the
17-80-1 RBF network are shown in Table 4.20. It performs even a little bit better
106 of bad and 103 of good applicants were misclassified. That is altogether 209
(10.2%) misclassified applicants.
observed predicted misclass. overall
0 1 misclass.
0 1814 106 5.5% 215
1 109 16 87.2% (10.5%)
Table 4.19: Misclassification rates of the 23-100-1 RBF network for the TEST
data set.
observed predicted misclass. overall
0 1 misclass.
0 1817 103 5.4% 2091 106 19 84.8% (10.2%)
Table 4.20: Misclassification rates of the 17-80-1 RBF network for the TEST
data set.
Discussion
Radial basis function neural networks are supposed to give better prediction than theMLP networks. That is true indeed, however, their performance is only slightly better
(8 misclassified observations less than in the MLP) in our case. Due to the unsupervised
learning the computation of RBF networks proceeds very fast. One may notice the
misclassification rates of the small training sample. While MLP networks tend to
overfit, the results from the RBF networks are more robustthe misclassification rates
between the small training and the validation sample differ only in about 2 percent
points.
40
-
8/2/2019 Credit Risk Estimation
42/73
5 Other Methods
In the vast amount of publications treating credit scoring there are many other tech-
niques which may also be used. This section gives a short overview of some of these
methods and refers to other literature.
5.1 Probit Regression
Sometimes an alternative to the logit model is used. The probit model is another
variant of generalized linear models. It is derived by letting the link function be the
standard normal distribution function. Since the logistic link function is closely ap-
proximate to that of normal random variable, the results are similar.
5.2 Semiparametric Regression
In order to give more attention to metric variables, one may estimate them nonpara-
metrically. Hardle et al. (2000a) show that semiparametric methods perform better
than the logistic regression. The resulting model consist then of a linear and a nonlinear
part:
E[Y|X] = E[Y|(V, W)] = G(V + m(W)),
where = (1, . . . , v) is parameter vector for the categorical variables and m(.) is
a smooth real function which can be estimated nonparametrically. In practice for the
nonparametric part one chooses only those continuous explanatory variables, which
have the most influence on the dependent variable Y. Estimators for and m(.) are
computed by semiparametric maximum likelihood (this is reviewed in Hardle et al.
(2000b)).
5.3 Classification Trees
Classification tree (usually summarized under a more general name classification and
regression treesCART) is a nonparametric method to analyze categorical dependent
variables as a function of metric and/or categorical explanatory variables. They has
become a standard tool for developing credit scoring systems, since they are easily
interpretable and may be demonstrated graphically.
Classification and regression trees uses recursive partitioning algorithm (split-and-
conquer): The basic idea is to split the sample into two subsamples each of which
contains only cases from one response category and then repeatedly split each of the
41
-
8/2/2019 Credit Risk Estimation
43/73
subsamples. The subsamples are called nodes, the entire sample is called root node.
Firstly, one looks for the explanatory variable which splits the sample in a node into
two subgroups in such a way, that these children nodes are inside as homogeneous as
possible and differ from each other as much as possible. The splitting rules are done
according to some statistical criteria. Usually the tree is split till it is overfitted and
afterward the tree is pruned back to get a more robust model. Once the final tree is
determined, each terminal node is classified to be either good or bad depending
on the majority of observations in that node. Figure 5.1 shows an example of a clas-
sification tree used for credit scoring. There are 1000 clients in the beginning (140 of
them are bad). The procedure splits this sample due to a particular value of some
variables, till seven final groups are determined. The upper number in each box rep-
resent the number of observations, the lower number stands for the number of bad
clients in this node. Either 0 or 1 above the boxes denote terminal nodes to be the one
of either good or bad clients. The basic monograph explaining CART is Breimann
et al. (1983). Thomas (2000) mentions that in comparison with linear discriminant
analysis (LDA), CART are better for predictors with interactions, while LDA is better
for predictors with intercorrelations.
Example of a classification tree
1000
previous default
marriaged income
income income
another credit
1 0
1 0 1
0 1
16 14
80 220 20
550 100
9 1
45 5 15
10 55
yes
yes> $2,970
> $2,680 > $2,820
yes
Figure 5.1: Example of a fictive classification tree.
42
-
8/2/2019 Credit Risk Estimation
44/73
5.4 Linear Discriminant Analysis
Discriminant analysis is a set of methods for separating subgroups in data using some
discriminant rules, it is well explained in Hardle and Simar (2002). Many papers
treating credit scoring use also discriminant analysis method to distinguish goodand bad clients in the sample (Back et al., 1996) or (Desai et al., 1996). However,
linear discriminant approach assumes metric variables which are normally distributed
and that the variance matrix in each of the groups is the same. Most of the variables are
categorical in credit scoring, the metric variables are usually not normally distributed
(see the Section 3) and moreover, there is no reason to assume, that the good and
bad clients have same variance matrices (Case of not equal matrices may be solved
nonlinearly). From this point of view any usage of discriminant analysis in credit
scoring framework is faulty, however, Desai et al. (1996) states, that the empirical
performance of linear discriminant analysis is relatively good.
5.5 Panel Data Analysis
Logit and probit models may be extended into panel data analysis. Panel data denote
data, which have not only cross section, but also the time dimension. In the credit
scoring it means, that for every client we have also observations in the time t. As a
rule, panel data contains more data than only cross section- or time series data and
therefore we may get more exact results. Panel data analysis for credit scoring isexplained in Kaiser and Szczesny (2000a)
5.6 Hazard Regression
Hazard regression models analyze survival data. Particularly they estimate probabili-
ties, that the observed unit stays in the current state and solve the problem of censored
data. In credit scoring it complies with the fact, that we either know that the client
became delinquent at some time t, or that the client is still creditworthywhat does
not necessary mean that he will not become delinquent at a future time. Kaiser and
Szczesny (2000b) describe hazard regression in credit scoring and other related topics
in detail.
5.7 Genetic Algorithms
Genetic algorithms are another general optimization schemes based on biological analo-
gies (as the artificial neural networks are) described in the early 1990s. They simulate
43
-
8/2/2019 Credit Risk Estimation
45/73
Darwinian evolution. In fact they are recombining possible vector solutions (so called
chromosomes) from a space of candidate solutions (population of chromosomes) and
try to maximize some fitness function (determines how good a chromosome is). The
recombination is done via three operators whose names originate from the biological
background: reproduction, cross-over and mutation. Genetic algorithms are mentioned
in the analysis of Back et al. (1996) or in Thomas (2000).
5.8 Linear Programming
Linear programming is searching for a cut-off point c and a weight vector (w1, . . . , wp),
so that the scalar product of vector of observations and the weight vector wxi is
above this cut-off point for the not creditworthy clients and below this point for the
creditworthy clients. Sum of errors
ei is minimized with respect to the unknownparameters:
min e1 + e2 + ... + enG+nB
subject to
w1xi1 + w2xi2 + ... + wpxip c ei , 1 i nB
w1xi1 + w2xi2 + ... + wpxip c + ei , nB + 1 i nG + nB = N
ei 0 , 1 i nG + nB
where nB is the number of bad and nG the number of good clients. Linear pro-
gramming for credit scoring is described in Thomas (2000).
5.9 Treed Logits
Chipman et al. (2001) studied the prediction problem in direct marketing They gen-
eralized and conjoined logistic regression and the CART techniques. The space of all
observations is partitioned through a binary tree. In each bottom node a different logit
model is fitted(instead of simply observed frequencies of response). Treed logits are
interpretable, small and prevent the overfitting.
44
-
8/2/2019 Credit Risk Estimation
46/73
6 Summary and Conclusion
Credit scoring represent a set of common techniques to decide whether a bank should
grant a loan (issue a credit card) to an applicant or not. We have presented sev-
eral methods and showed their usage in XploRe. The results of these methods aresummarized in Table 6.1.
Artificial neural networks are very flexible models, however they provide slightly
worse performance than the traditional logistic regression, which misclassifies 10 (0.5%)
observations less than the best of neural network models. Logit misclassifies 98 good
clients and denotes them as bad, 101 of the bad clients are granted the loan, as
they are supposed to be creditworthy. Additionally, logistic regression dispose with
statistical tests to identify how important are each of the predictor variables. Our
analysis thus showed that neural network models did not manage to beat logit inprediction. However, due to the flat-maximum effect (Lovie and Lovie, 1986) one is
unlikely to achieve a great deal of improvement from better statistical modelling on
the same set of matching variables.
misclass. logit MLP MLP RBF RBF
(23-12-1) (17-13-1) (23-100-1) (17-80-1)
good 98 105 107 106 103
bad 101 110 110 109 106
overall 199 215 217 215 209
overall in % 9.7% 10.5% 10.6% 10.5% 10.2%
Table 6.1: Misclassification rates of the methods tested.
45
-
8/2/2019 Credit Risk Estimation
47/73
7 Extensions
The objective of most credit scoring models is to minimize the misclassification rate
or the expected default rate. However, one should pay more attention to the term
off misclassification rate. Usually, the overall misclassification rates are compared todecide about the prediction of various models. However, there are two types of misclas-
sification. Firstly, one may denote a non-creditworthy client as creditworthy and the
other way round it is possible to denote a creditworthy client as non-creditworthy. The
latter is loss of profit, but it is not as bad as the former mistake, which means direct
loss for the bank. Therefore the bank is not trying to minimize the misclassification
rate, but to maximize its profit. One possible solution would be implementation of a
cost matrix. The preferred method minimize then the term L = r1w1 + r2w2, where r1
and r2 are number of bad applicants classified as creditworthy and number of goodapplicants classified as not creditworthy respectively. w1, w2 are weights mirroring the
loss and profit lost. These weights are to be estimated. Unfortunately, there are not
many papers on this topic yet. Tam and Kiang (1992) show that incorporating the
cost matrix into neural networks and discriminant analysis is possible.
At present the emphasis is on changing the objectives from trying to minimize the
chance a client will default on any particular product to looking at how the firm can
maximize the profit it can make from that client. As an example we can mention
insurances sold on loans (Stanghellini, 1999). Thus a non-creditworthy client may be
profitable if he buys the insurance on his loan and become delinquent relatively later.
Credit scoring has better performance than decisioning made by loan officers. How-
ever, one bad property of scoring methods is, that they are static and estimate at par-
ticular time. There are another tools which should be used. Jacobson and Roszbach
(1998) showed that banks using credit scoring grant loans inconsistent with default risk
minimization. They suggest value at risk (VaR) as a more adequate measure of losses
than default risk. Therefore next research should concentrate more on such topics. We
will study methods of credit scoring in near future again and in a more detail in a
diploma thesis at the Charles University in Prague.
46
-
8/2/2019 Credit Risk Estimation
48/73
Appendix
A Radial Basis Function Neural Networks
This section describes radial basis function (RBF) neural networks and explains
their implementation in XploRe. In general, artificial neural networks (ANN)
appeared in the 40s, but their progress and usage of these theories was enabled
in the last decade, especially with the development of personal computers. Nowa-
days they are very widely used in many research and commercial fields and may
be successfully applied in credit scoring as well. The fashion of neural networks
originating in brain nerve cells let grow up new terminology although the roots
of neural networks stretch in much older techniques. Table A.1 shows basic ter-
minology with different names for the same subject in artificial neural networks
and statistics framework:
statistics neural networks
model network
estimation learning
regression supervised learning
interpolation generalizationobservations training set
parameters weights
independent variables inputs
dependent variables outputs
Table A.1: Comparison of the neural networks and statistician terminology.
The simplest example of an ANN is perceptron. Multi-layer perceptrons were
described in Section 4.2. For a thorough explanation see Bishop (1995). Radial
basis function neural networks are, alike the MLPs, feed-forward networks. It
means that the signal in the network is passed forward only. But apart from
the MLPs, RBF networks stand for a class of neural network models, in which
the hidden units are activated according to the distance between the input units.
RBF networks combine two different types of learning: supervised and unsuper-
vised. First, at the hidden layer training, one conjoints the input vector into
several clusters (unsupervised learning) and afterward, at the output
47
-
8/2/2019 Credit Risk Estimation
49/73
Input layer Hidden layer Output layer
x4
x3
x2
x1
w
bias
z3
z2
z1
y = f(x)
.
............
.........
......
.........
.
......
..
..
..
..
.
..
..
..
..
..
..
.
...
.........
.........
......
.........
.
......
..
..
..
..
.
.
..
..
..
..
.
..
.
.........
...
..
.......
......
..........
......
..
..
..
..
.
....
.
..
..
..
.
Figure A.1: RBF scheme for p = 4, q = 1, r = 3. Dashed lines show various
weights of the connections.
layer training, the output of the RBF network is determined by supervised learn-
ing. While for the supervised learning we have both the independent variables and
response variable(s), the unsupervised learning must work without the knowledge
of response variable. This concept will be more clear from the next subsection.
Training of a RBF network can be essential faster than the methods used to
train MLP networks. Further the multi-layer feed-forward network trained with
backpropagation does not yield the approximating capabilities of RBF networks.
Therefore the theory of RBF neural networks is still the subject of extensive
ongoing research (Orr, 1999). One remarkable feature of RBF neural networks
is:
RBF networks possess the property of best approximation. An ap-
proximation scheme has this property if, in the set of approximating
functions (i.e. the set of functions corresponding to all possible choices
of the adjustable parameters) there is one function which has minimum
approximating error for any given function to be approximated. This
property is not shared by MLPs.Bishop (1995)
48
-
8/2/2019 Credit Risk Estimation
50/73
A.1 The Model
Radial basis function neural networks have one hidden layer only. Each of the
hidden units (the so called clusters) implements a radial function. The output
units are weighted sums of clusters outputs. This is illustrated in Figure A.1.We suppose that we are given a set {xi }
Ni=1 of N observations from a p-
dimensional space. That is, we are given p input variables X1, . . . , X p and, in
general, q output variables Y1, . . . , Y q. Let f : Rp Rq denote the function we
want to approximate via the RBF neural network. The output of a RBF neural
network with p input units, r clusters and one output unit is:
f(x) =
r
j=1 [w0 + wjj(x)]
,
where w0 is the weight of the bias, wj j = 1, . . . r are the output weights, j is a
radially symmetric function with two parameters cj , j and is output transfer
function. During training the observation points are joined into r clusters firstly,
for instance by a K-means clustering algorithm. Each cluster is represented
by a radial function. Radially symmetric function (.) is required to fulfill the
condition that if xi = xj then (xi) = (xj). The norm . is usually
taken to be L2-norm. The well known radially symmetric function is the p-variate
Gaussian function:
j(x) = exp
x cj
22j
, j > 0 .
Another popular activation function is the generalized inverse multi-quadric func-
tion:
j(x) = (x cj + 2j )
, j > 0, > 0 .
Both of them have the property that 0 as x . Other possible choices
of radially symmetric functions are the thin-plate spline function:
j(x) = x cj ln
x cj
,
or the generalized multi-quadric function:
j(x) = (x cj + 2j )
, j > 0, 1 > > 0 .
The last two functions have the property that as x . The even
mentioned radially symmetric functions are plotted in Figure A.2. However,
49
-
8/2/2019 Credit Risk Estimation
51/73
Gaussian
-5 0 5
X
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Y
Thin-plate Spline
-5 0 5
X
0
10
Y
Generalized Inverse Multi-quadric
-5 0 5
X
-0.4-0.2
00.20.40.60.8
11.21.41.61.8
22.2
2.42.62.8
33.23
.43.63.8
44.24.44.64.8
Y
Generalized Multi-quadric
-5 0 5
X
0
1
2
3
4
5
Y
Figure A.2: Radially symmetric activation functions.
both theoretical and empirical studies show that estimation results are relatively
insensitive to the exact form of the radially symmetric activation function. The
most commonly used radial function is the p-variate Gaussian function. It has
not only the attractive property of product separability (it can be rewritten as
a product of p univariate functions), but also other useful analytical properties.
We must estimate centers (cj) and widths (j) of these clusters. The clusters
weights are actually coordinates of the cluster centers. To get the proper number
of these clusters one may grow to the suitable size starting form one cluster
or alternatively, one starts with as many clusters as observations and trims the
clusters as desired. However, the number of clusters (number of units in the
hidden layer) is typically much less than N. After initial training, when the
clusters are already known, one may apply the supervised learning to estimate
weights of the output units. Determination of the clusters weights (i.e. centers of
the clusters) and widths at the first stage may be seen as a parametric approach.
50
-
8/2/2019 Credit Risk Estimation
52/73
The second stage of training may be seen as a nonparametric approach and
therefore RBF networks take the place of a semiparametric procedure.
Finally, let us mention, that the radially symmetric constraint on the activa-
tion function is sometimes violated in order to decrease the number of units in
the hidden layer or to improve the performance of the network. For instance the
multivariate Gaussian function may be generalized to the elliptically symmet-
ric function by replacing the L2 norm by the Mahalanobis distance (Hardle and
Simar, 2002).
A.2 RBF Neural Networks in XploRe
In this section we shortly describe how to run RBF neural network in XploRe.
{inp,net,err} = rbftrain(x,y,clust,learn,epochs,mMSE,activ)
trains a radial basis function neural network (slow)
{inp,net,err} = rbftrain2(x,y,clust,learn,epochs,mMSE,activ)
trains a radial basis function neural network (fast)
The quantlets rbftrain and rbftrain2 build a radial basis function neural net-
work. They use the same algorithm, the only difference is, that the former is
written directly in XploRe while the latter uses a dynamically linked library writ-
ten in C programming language. Account of this fact, rbftrain works slowly,
but allows the user to change the source code directly, to add new features and
to see what is exactly happening. On the other hand, rbftrain2 is a closed
product, which cannot be changed, but runs fast.
The input parameters x and y are the input and output variables respectively.
We assume that x and y have dimensions Np and Nq respectively. Number of
units in the hidden layer is given by the parameter clust, it must be determined
by the user, usually clust
-
8/2/2019 Credit Risk Estimation
53/73
for training output weights. Each of these learning rates must be from the range
(0, 1). The vector epochs has two rows. The first row is the number of training
epochs for the hidden layer and the second row contains number of epochs to
train the output layer. The training is stopped either when the output units
were already trained epochs[2]-times or when the mean squared error reaches
the value given by the parameter mMSE. The optional input parameter activ
determines whether the bipolar sigmoid function (activ = 1):
1 eW
1 + eW,
should be used instead of the default binary activation function ( activ = 0):