credit risk estimation

Upload: mark-terrence-law

Post on 05-Apr-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Credit Risk Estimation

    1/73

    Masters Thesis presented to obtain the Master of Science degree:

    On Credit Scoring Estimation

    by

    Karel Komorad

    (180174)

    Submitted to:

    Prof. Dr. Wolfgang Hardle

    Institute for Statistics and Econometrics

    Humboldt University

    Spandauer Str. 1

    D10178 Berlin

    Berlin, December 18, 2002

  • 8/2/2019 Credit Risk Estimation

    2/73

    AcknowledgmentOn this place I would like to express my gratitude to persons who contributed to rise of this

    thesis and who influenced my attitude to statistics and to study in general. On the first

    place I would like to thank to my advisor Professor Wolfgang H ardle, who inspired me to

    think in another than the mathematician levels only and gave me the key to connect the

    real world and the world of theoretical books. Further, my thanks to Mgr. Zdenek Hlavka,

    Ph.D. for his readiness to help at any time. Without his support this thesis would not exist

    at all.

    The two years I spent at the Institute for Statistics and Econometrics at Humboldt-

    University in Berlin gave me more experience than any years before, and I am grateful to all

    colleagues of this institute, especially to Prof. Dr. Bernd Ronz, Axel Werwatz, Ph.D. and

    Ing. Pavel Czek, Ph.D., who taught me a lot of the programming mystiques.And last, but not least, my gratitude to Julka Smoljaninova, who gave me the strength

    to finish the work once commenced and never stopped to trust.

    Declaration of AuthorshipI hereby confirm that I have authored this master thesis independently and without use of

    others than the indicated resources.

    All passages, which are literally or in general matter taken out of publications or other

    resources, are marked as such.

    Berlin, December 18, 2002

    Karel Komorad

  • 8/2/2019 Credit Risk Estimation

    3/73

    Abstract

    Credit scoring methods became standard tool of banks and other financial insti-

    tutions, direct marketing retailers and advertising companies to estimate whether an

    applicant for credit/goods will pay back his liabilities. In this thesis we give a shortoverview of credit scoring and its methods. We investigate the usage of some of these

    methods and their performance on a data set from a French bank. Our results indicate

    that the methods, namely the logistic regression, multi-layer perceptron (MLP) and

    radial basis function (RBF) neural networks give very similar results, however, the

    traditional logit model seems to be the best one. We also describe RBF architecture

    and a simple RBF program we implemented in the statistical computing environment

    XploRe.

  • 8/2/2019 Credit Risk Estimation

    4/73

    Contents

    1 Introduction 6

    2 Credit Scoring in Overview 8

    2.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.3 Credit Scoring Today . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3 Data Set Description 12

    4 Credit Scoring Methods 27

    4.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    4.2 Multi-layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    4.3 Radial Basis Function Neural Network . . . . . . . . . . . . . . . . . . . 38

    5 Other Methods 41

    5.1 Probit Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    5.2 Semiparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . 41

    5.3 Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    5.4 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . 43

    5.5 Panel Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    5.6 Hazard Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    5.7 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    5.8 Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    5.9 Treed Logits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    6 Summary and Conclusion 45

    7 Extensions 46

    Appendix 47

    A Radial Basis Function Neural Networks 47

    A.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    A.2 RBF Neural Networks in XploRe . . . . . . . . . . . . . . . . . . . . . . 51

    A.3 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    A.4 Detailed Description of the Algorithm . . . . . . . . . . . . . . . . . . . 61

    3

  • 8/2/2019 Credit Risk Estimation

    5/73

    B Suggestions for Improvements in XploRe 65

    B.1 grdotd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    B.2 nnrpredict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    B.3 nnrnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    C CD contents 69

    4

    http://www.xplore-stat.de/help/nnrnet.htmlhttp://www.xplore-stat.de/help/nnrpredict.htmlhttp://www.xplore-stat.de/help/grdotd.html
  • 8/2/2019 Credit Risk Estimation

    6/73

    Notation

    X random variable

    X random vectorX data matrix

    xi = (xi,1, . . . , xi,p) i-th observation of a random vector X Rp

    transposition

    french.dat file name

    grdotd quantlet name

    rbfnet parameter name

    Abbreviations

    ANN artificial neural network

    Basel II the new Basel Capital Accord

    CART classification and regression trees

    ECOA Equal Credit Opportunity Act

    GLM generalized linear modelGPLM generalized partially linear model

    LDA linear discriminant analysis

    MSE mean squared error

    RBF radial basis function

    VaR value at risk

    5

    http://www.xplore-stat.de/help/grdotd.html
  • 8/2/2019 Credit Risk Estimation

    7/73

    1 Introduction

    One of the basic tasks which any finance institution must deal with, is to minimize

    its credit risk. Scoring methods traditionally estimate the creditworthiness of a credit

    card applicant. They predict the probability that an applicant or existing borrowerwill default or become delinquent. Credit scoring is studying the creditworthiness of:

    any of the many forms of commerce under which an individual obtains

    money or goods or services on condition of a promise to repay the money

    or to pay for the goods or services, along with a fee (the interest), at some

    specific future date or datesLewis (1994)

    The statistical methods we will present are based on a particular amount of histor-

    ical data. Computers allow to treat large data sets and to come to a decision quicker,

    cheaper and more realistic than the primary (historic) judgment made by credit experts

    or loan officers. These methods give even a better prediction, if used correctly.

    Bank credit card issuers in the U.S. lose about $1 milliard each year in fraud, mostly

    from stolen cards. They also lose another $3 milliard in fraudulent bankruptcy filings.

    However merchants absorb more than $10 milliard each year in credit card related

    fraud 1.

    The founders of credit scoring, Bill Fair and Earl Isaac designed a complete billing

    system for one of the first credit cards, Carte Blanche, in 1957 and built the first creditscoring system for American Investments one year later. Since then credit scoring

    became a broad system predicting consumer behaviour and spread into many other

    areas, e.g. consumer lending (especially on credit cards), mortgage lending, small and

    medium business loan lending, direct marketing or advertising. Back et al. (1996) used

    the same procedures to predict the failure of companies.

    Methods used for credit scoring include various statistical procedures, like the most

    commonly used logistic regression and its alternative probit regression, (linear) dis-

    criminant analysis, fashionable artificial neural networks or genetic algorithms, furtherlinear programming, nonparametric classification trees or semiparametric regression.

    The aim of this thesis is to give an overview of credit scoring and to compare

    various statistical methods by using them on a data set from a French bank. We want

    to compare the ability to distinguish between two subgroups of a highly complicated

    sample, and to see the performance and computational severity of these procedures.

    1http://www.cardweb.com/cardlearn/stat.html

    6

  • 8/2/2019 Credit Risk Estimation

    8/73

    Number of Credit Cards

    1995 1996 1997 1998 1999

    200

    250

    CreditCards(per1000habitants)

    Number of Transactions

    1995 1996 1997 1998 1999

    5000

    6000

    7000

    8000

    9000

    Transactions(per1000habitants)

    Figure 1.1: Number of credit cards and number of transactions from a creditcard in the European Union between 19951999 (http://www.eurofinas.org).

    Credit scoring gains new importance, when thinking about the New Basel Capital

    Accord. The so called Basel II replaces the current 1988 Capital Accord and focuses

    on techniques that allow banks and supervisors to evaluate properly the various risks

    that banks face. Thus credit scoring may contribute to the internal assessment process

    of an institution, what is desirable.

    This master thesis is organized as follows: Firstly, we give a short overview of creditscoring, advert something from its history, refer accompanying problems and mention

    the way credit scoring may go in the future. The data set for the analysis is described in

    the third section. Section 4 explains the statistical methods used (it is particularly the

    logistic regression, semiparametric regression, multi-layer perceptron neural network

    and radial basis function neural network) and apply them on the data set. Other

    methods used for credit scoring are presented in Section 5. Summary of our analyzes

    with the conclusion is given in Section 6. Section 7 remarks some possible extensions

    to this topic. The method of radial basis function neural networks, we programmedin C language for the purpose of this thesis, is described in Appendix A thoroughly.

    Appendix B gives some suggestions for improvements in XploRe we discovered during

    work on this thesis. Finally, Appendix C lists files stored on compact disk enclosed to

    this thesis.

    7

  • 8/2/2019 Credit Risk Estimation

    9/73

    2 Credit Scoring in Overview

    Risk forecasting is the topic number one in modern finance. Apart from the portfolio

    management, pricing options (and other financial instruments) or bond pricing, credit

    scoring represents another important set of procedures estimating and reducing creditrisk. It involves techniques that help financial organizations to decide whether or not to

    grant a credit to applicants. Basically, credit scoring tries to distinguish two different

    subgroups in the data sample. The aim is to choose such a method which is computable

    in real time and predicts sufficiently precise.

    There is a vast number of articles treating credit scoring in recent issues of trade

    publications in the credit and banking area. An exhaustive overview of literature

    to credit scoring can be found in Thomas (2000), Mester (1997) or in Kaiser and

    Szczesny (2000a). Professor D.J. Hand, head of statistics section in the department ofmathematics at Imperial College, published also lot of books to this topic.

    2.1 History

    The statistical techniques used for credit scoring are based on the idea of discrimination

    between several groups in a data sample. These procedures originate in the thirties and

    forties of the previous century (Fisher, 1936; Durand, 1941). At that time some of the

    finance houses and mail order firms were having difficulties with their credit manage-

    ment. Decision whether to give loans or send merchandise to the applicants were made

    judgmentally by credit analysts. The decision procedure was nonuniform, subjective

    and opaque, it depended on the rules of each financial house and on the personal and

    empirical knowledge of each single clerk. With the rising number of people applying

    for a credit card in the late 1960s it was impossible to stay by credit analysts onlyan

    automated system was necessary. The first consultancy was formed in San Francisco

    by Bill Fair and Earl Isaac in the late 1950s. Their system spread fast as the financial

    institutions found out that using credit scoring was cheaper, faster, more objective,

    and mainly much better predictive than any judgmental scheme. It is estimated that

    the default rates dropped by 50% (Thomas, 2000) after the implementation of credit

    scoring. Another advantage of the usage of credit scoring is that it allows lenders to

    underwrite and monitor loans without actually meeting the borrower.

    The success of credit scoring in credit cards issuing was a significant sign for the

    banks to use scoring methods to other products like personal loans, mortgage loans,

    small business loans etc. However, commercial lending is more heterogeneous, its

    documentation is not standardized within or across institutions and thus the results

    8

  • 8/2/2019 Credit Risk Estimation

    10/73

    are not so clear. The growth of direct marketing has led to the use of scorecards to

    improve the response rate to advertising campaigns in the 1990s.

    2.2 Problems

    Most of the problems one must face when using credit scoring are rather technical than

    theoretical nature. First of all, one should think of the data necessary to implement

    the scoring. It should include as many relevant factors as possible. It is a trade-off

    between expensive data and between low accuracy due to not enough information.

    Banks collect the data from their internal sources (from the applicants previous credit

    history), from external sources (questionnaires, interviews with the applicants) and

    from third parties. From the applicants background the following information is usu-

    ally collected: age, gender, marital status, nationality, education, number of children,job, income, lease rental charges, etc. The following questions from applicants credit

    history are especially interesting: Has the applicant already a credit?, How much

    did he borrowed?, Has the applicant ever delayed his payment?, Does he ask for

    another credit as well? Under third parties we understand special houses oriented in

    Number of Credit Cards (per 1000 habitants)

    200

    400

    600

    800

    Belgium France Germany Italy Sweden U.K.

    Figure 2.1: Number of credit cards per 1000 persons in the year 1994 (thin

    line) and 1999(thick line) (http://www.eurofinas.org).

    9

  • 8/2/2019 Credit Risk Estimation

    11/73

    collecting credit information about potential clients. The variables entering the credit

    scoring procedures should be chosen carefully, as the amount of the data may be vast

    indeed and thus computationally problematic. For instance, the German central bank

    (Deutsche Bundesbank) lists about 325000 units. Since most of the attributes in credit

    scoring are categorical, imposing dummy variables gives a matrix with several millions

    of elements (Enache (1998) mentions 180 variables in his analysis).

    Muller et al. (2002) treats a very important feature of credit scoring data. There is

    usually no information on the performance of rejected customers. This causes bias in

    the sample. Hand and Henley (1993) concluded that it cannot be overcome unless one

    can assume particular relationship between the distributions of the good and bad

    clients which holds for both the accepted and the rejected applicants. This problem may

    be solved by some organizations if they accept everybody for a short time. Afterward

    they can build a scorecard based on the unbiased data sample. However, this is possible

    only for retailers, mail order firms or advertising companies, not for banks and financial

    institutions.

    The American banks have another problem. The law does not allow to use infor-

    mation about race, nationality, religion, gender or marital status to build a scorecard.

    It is stated in the Equal Credit Opportunity Act (ECOA) and in the Consumer Credit

    Protection Act. Moreover, the attribute age plays a special role. It can be used,

    unless people older than 62 years are discriminated. Legal fundamentals for consumer

    credit business in Germany are given in Schnurr (1997).

    2.3 Credit Scoring Today

    As mentioned above, credit scoring methods are widely used to estimate and to min-

    imize credit risk. Mail order companies, advertising companies, banks and other fi-

    nancial institutions use these methods to score their clients, applicants and potential

    customers. There is effort to precise all procedures used to estimate and decrease credit

    risk. Both the U.S. Federal Home Loan Mortgage Corporation and the U.S. Federal

    National Mortgage Corporation have encouraged mortgage lenders to use credit scoring

    which should provide consistency across underwriters.

    Also the international banks supervision appeals to precise banks internal assess-

    ments: The Basel Committee on Banking Supervision is an international organization

    which formulates broad supervisory standards and guidelines for banks. It encourages

    convergence toward common approaches and common standards. The Committees

    members come from Belgium, Canada, France, Germany, Italy, Japan, Luxembourg,

    10

  • 8/2/2019 Credit Risk Estimation

    12/73

    the Netherlands, Spain, Sweden, Switzerland, United Kingdom and United States. In

    1988, the Committee decided to introduce a capital measurement system (the Basel

    Capital Accord). This framework has been progressively introduced not only in mem-

    ber countries but also in other countries with active international banks. In June 1999,

    the Committee issued a proposal for a New Capital Adequacy Framework to replace

    the 1988 Accord (http://www.bis.org). The proposed capital framework consists of

    three pillars:

    1. minimum capital requirements,

    2. supervisory review of internal assessment process and capital adequacy,

    3. effective use of disclosure to strengthen market discipline.

    The New Basel Capital Accord is to be implemented till 2004. Consequently, Basel

    II (The New Capital Accord) gives more emphasis on banks own internal methodolo-

    gies. Therefore credit scoring and its methods can become subject of banks extensive

    interest, as they will try to make their internal assessments as precise and correct as

    possible.

    11

  • 8/2/2019 Credit Risk Estimation

    13/73

    3 Data Set Description

    In this section we shortly describe the data set used in our analysis and show some

    of its basic characteristics to give a better insight in the sample. The data set an-

    alyzed in this thesis stems from a French bank. However, the source is confidentialand therefore names of all variables have been removed, categorical values have been

    changed to meaningless symbols and metric variables have been standardized to mean

    0 and variance 1. The same data set was in background of Muller and Ronz (1999)

    and Hardle et al. (2001). The original file, french.dat, contains 8830 observations

    with one response variable, 8 metric and 15 categorical predictor variables. We have

    X1

    -2 0 2 4

    0

    0.

    1

    0.

    2

    0.

    3

    0.

    4

    0.

    5

    X4

    0 2 4 6

    0

    0.

    5

    1

    X7

    0 20 40 60

    0

    0.5

    1

    1.5

    X2

    -2 0 2 4 6

    0

    0.

    1

    0.

    2

    0.3

    0.

    4

    0.

    5

    X5

    0 20 40 60

    0

    0.

    5

    1

    1.

    5

    2

    X8

    0 5 10 15

    0

    0.5

    1

    1.

    5

    X3

    0 2 4 6

    0

    0.

    2

    0.

    4

    0.

    6

    0.

    8

    X6

    0 10 20 30

    0

    0.

    5

    1

    Figure 3.1: Density dot plots for the original metric variables X1, . . . , X 8.

    12

  • 8/2/2019 Credit Risk Estimation

    14/73

    "=================================================="

    " Variable X5"

    "=================================================="

    " | Frequency Percent Cumulative "

    "--------------------------------------------------"

    " -0.537 | 3815 0.617 0.617"

    " 0.203 | 934 0.151 0.768"

    " 0.943 | 907 0.147 0.915"

    " 1.682 | 431 0.070 0.985"

    " 2.422 | 93 0.015 1.000"

    "--------------------------------------------------"

    " | 6180 1.000"

    "=================================================="

    Table 3.1: Frequency Table of the metric variable X5.

    removed observations with response classified as 9, class used originally for testing.

    Since we do not know the real classification, we cannot use this class for the purpose

    of our analysis. The remaining data contain 6672 observations. In addition we have

    changed classes of the independent categorical variables from 1, . . . , K to 0, . . . , K 1

    and ordered them in accordance with the number of categories. Let Y denote the

    response variable, X1, . . . , X 8, in sequence, the metric variables and X9, . . . , X 23 the

    categorical variables.

    The density dot plots 2 in Figure 3.1 show estimated densities of the metric variables

    and indicate some suspicious outlying values. The problem is that the usual outliers

    tests assume normal distribution and that testing of the normality is affected by these

    outliers (Ronz, 1998). Note that the last observations (number 6662 and higher) take

    the lower extremes in almost all metric variables. Since the metric variables were

    already standardized we decided to restrict them to the range [3, 3] in order to get

    rid of the outliers. Thereby we get a new subsample containing only 6180 cases. Density

    dot plots of the metric variables for this data set are shown in Figure 3.2 and we can

    see that the shape of the variables densities became better. Table 3.1 shows frequencies

    of the variable X5, which is obviously discrete.

    For the purpose of our analysis we have randomly divided the data sample into

    two subsamples About two thirds of the whole data set (4135 observations) builds

    the first subsample, the TRAIN sample. It will be used for model estimation. The

    second subsample, TEST, with 2045 observation will be used to get some overall pro-

    2For the density dot plots in this section we used quantlet myGrdotd.xpl which is a modified version of

    the faulty original grdotd. For more information see Appendix B.1.

    13

    http://www.xplore-stat.de/help/grdotd.html
  • 8/2/2019 Credit Risk Estimation

    15/73

    X1

    -1 0 1 2 3

    0

    0.

    2

    0.

    4

    X4

    -1 0 1 2 3

    0

    0.5

    1

    X7

    0 1 2 3

    0

    0.

    5

    1

    1.

    5

    2

    2.

    5

    X2

    -1 0 1 2 3

    0

    0.

    2

    0.

    4

    0.

    6

    X5

    0 1 2

    0

    0.

    5

    1

    1.

    5

    X8

    0 1 2 3

    0

    1

    2

    3

    4

    5

    X3

    -1 0 1 2 3

    0

    0.

    2

    0.

    4

    0.

    6

    0.

    8

    X6

    0 1 2 3

    0

    0.

    5

    1

    Figure 3.2: Density dot plots for all metric variables as used in the analysis.

    cedure to compare particular methods and to check the predictive power of the models

    used. This will be based on misclassification rates. The subsamples are stored in files

    data-train.dat and data-test.dat.

    At this stage it is worth to recall that we do not know anything about the economic

    interpretation of the predictors. We also do not know, what the response variable means

    (is it a credit card-, loan- or mortgage-application?). And even the coding is primarily

    unclearstands 1 for client is creditworthy or for client has some problems with

    repaying the debt? The frequencies of the response variable, summarized in Table 3.2,

    tell us more. Since only about 6% of the data set is classified as 1, we will call this class:

    clients that have some problems with repaying their liability. Our sample is in this

    14

  • 8/2/2019 Credit Risk Estimation

    16/73

    TRAIN sample TEST sample

    0 3888 (94.0%) 1920 (93.9%) 5808 (94.0%)

    1 247 (6.0%) 125 (6.1%) 372 (6.0%)

    total 4135 2045 6180

    Table 3.2: Frequencies of two response outcomes.

    sense unbiased, because the percentage of faulty loans in consumer credit commonly

    varies between 1% and 7% (Arminger et al., 1997). Note that many credit scoring

    analyzes are using data sets with overrepresented rate of bad loans (West, 2000).

    However, Desai et al. (1996) use 3 data sets from 3 credit unions in the Southeastern

    US which consist of 81.58%, 74.02% and 78.85% of good loans respectively. The

    data set of Fahrmeir et al (1984) contains 70% of good credits. Enache (1998)

    analyzed 38.000 applications, 16.8% of them were rejected. Cardweb, the U.S. payment

    card information network (http://www.cardweb.com), mentions that about 78% of U.S.

    households are considered creditworthy.

    Let us now examine the variables in detail. All descriptive statistics and graphics

    used in this section are computed by the quantlet fr descr.data.xpl. We start with

    metric variables. Table 3.3 and 3.4 show for the TRAIN and TEST data sets their

    basic characteristics: minimum, first quartile, median, third quartile, maximum, mean

    and standard error. These statistics express in numbers our first finding from the dot

    density plots, namely that the data are extremely right-skewed. Figure 3.3 and 3.4

    show box plots for the TRAIN and TEST data set. The left box plot in each display

    stands for the good clients and the right one for the bad clients. From the box

    plots we cannot see any substantial differences between these two groups.

    As next, we examine the categorical variables. Frequencies for dichotomous cate-

    gorical variables (with two outcomes only) are shown in Table 3.5. Figures 3.53.13

    show bar charts for the variables X15X23. The upper displays correspond to the

    TRAIN data set, the lower displays correspond to the TEST data set. Left displays

    reveal the outcomes when the response is 0 and the displays on the right side when

    Y = 1. Remarkable is the change in variable X23. While the third category is most

    plentiful by the creditworthy clients and the ninth category has only about one half

    observations, in the case of non-creditworthy clients the ninth category increases its

    relative number and the third category is up to about two thirds of the ninth category.

    Characteristics for variables X15X23 are summarized in Tables 3.63.14.

    15

  • 8/2/2019 Credit Risk Estimation

    17/73

    X1

    -1

    0

    1

    2

    3

    X4

    -1

    0

    1

    2

    3

    X7

    0

    1

    2

    3

    X2

    -1

    0

    1

    2

    3

    X5

    0

    1

    2

    X8

    0

    1

    2

    3

    X3

    -1

    0

    1

    2

    3

    X6

    0

    1

    2

    3

    Figure 3.3: Box plots of metric variables in the TRAIN subsample.

    Min. 25% Q. Median 75% Q. Max. Mean Std.Err.

    X1 1.519 0.766 0.349 0.403 2.994 0.119 0.892

    X2 1.188 0.810 0.307 0.323 2.968 0.122 0.847

    X3 0.830 0.695 0.426 0.113 2.940 0.083 0.851

    X4 0.825 0.694 0.432 0.223 2.973 0.113 0.816

    X5 0.537 0.537 0.537 0.203 2.422 0.019 0.772

    X6 0.626 0.363 0.167 0.117 2.962 0.069 0.492X7 0.302 0.302 0.302 0.138 2.924 0.030 0.408

    X8 0.346 0.211 0.211 0.211 2.835 0.106 0.340

    Table 3.3: Minimum, 1st quartile, median, 3rd quartile, maximum, mean and

    the standard error of metric variables from the TRAIN subsample.

    16

  • 8/2/2019 Credit Risk Estimation

    18/73

    X1

    -1

    0

    1

    2

    3

    X4

    -1

    0

    1

    2

    3

    X7

    0

    1

    2

    X2

    -1

    0

    1

    2

    3

    X5

    0

    1

    2

    X8

    -0.

    5

    0

    0.

    5

    1

    1.5

    2

    2.

    5

    X3

    -1

    0

    1

    2

    3

    X6

    0

    1

    2

    3

    Figure 3.4: Box plots of metric variables in the TEST subsample.

    Min. 25% Q. Median 75% Q. Max. Mean Std.Err.

    X1 1.519 0.766 0.182 0.487 2.994 0.062 0.892

    X2 1.188 0.810 0.307 0.449 2.968 0.072 0.880

    X3 0.830 0.695 0.291 0.247 2.940 0.037 0.889

    X4 0.825 0.694 0.432 0.223 2.973 0.086 0.832

    X5 0.537 0.537 0.537 0.203 2.422 0.012 0.780

    X6 0.626 0.375 0.180 0.109 2.858 0.083 0.473X7 0.302 0.302 0.302 0.138 2.631 0.029 0.404

    X8 0.211 0.211 0.211 0.211 2.631 0.099 0.367

    Table 3.4: Minimum, 1st quartile, median, 3rd quartile, maximum, mean and

    the standard error of metric variables from the TEST subsample.

    17

  • 8/2/2019 Credit Risk Estimation

    19/73

    TRAIN sample TEST sample

    X9: 0 1145 (27.7%) 620 (30.3%) 1765 (28.6%)

    X9: 1 2990 (72.3%) 1425 (69.7%) 4415 (71.4%)

    X10: 0 708 (17.1%) 361 (17.7%) 1069 (17.3%)

    X10: 1 3427 (82.9%) 1684 (82.3%) 5111 (82.7%)

    X11: 0 569 (13.8%) 308 (15.1%) 877 (14.2%)

    X11: 1 3566 (86.2%) 1737 (84.9%) 5303 (85.8%)

    X12: 0 2521 (61.0%) 1286 (62.9%) 3807 (61.6%)

    X12: 1 1614 (39.0%) 759 (37.1%) 2373 (38.4%)

    X13: 0 217 (5.2%) 100 (4.9%) 317 (5.1%)

    X13: 1 3918 (94.8%) 1945 (95.1%) 5863 (94.9%)

    X14: 0 3965 (95.9%) 1959 (95.8%) 5924 (95.9%)

    X14: 1 170 (4.1%) 86 (4.2%) 256 (4.1%)

    Table 3.5: Outcome frequencies of dichotomous categorical variables X9X14.

    TRAIN sample (Y = 0)

    0

    500

    1000

    1500

    2000

    2500

    X15

    TEST sample (Y = 0)

    0

    500

    1000

    X15

    TRAIN sample (Y = 1)

    0

    50

    100

    150

    X15

    TEST sample (Y = 1)

    0

    50

    100

    X15

    Figure 3.5: Bar charts of the categorical variable X15.

    18

  • 8/2/2019 Credit Risk Estimation

    20/73

    TRAIN sample TEST sample

    0 1229 (29.7%) 590 (28.9%) 1819 (29.4%)

    1 245 (5.9%) 131 (6.4%) 376 (6.1%)

    2 2661 (64.4%) 1324 (64.7%) 3985 (64.5%)

    Table 3.6: Outcome frequencies of the categorical variable X15.

    TRAIN sample (Y = 0)

    0

    500

    1000

    1500

    2

    000

    2500

    3000

    X16

    TEST sample (Y = 0)

    0

    500

    1000

    1500

    X16

    TRAIN sample (Y = 1)

    0

    50

    100

    150

    X16

    TEST sample (Y = 1)

    0

    50

    X16

    Figure 3.6: Bar charts of the categorical variable X16.

    TRAIN sample TEST sample

    0 3071 (74.3%) 1513 (74.0%) 4584 (74.2%)

    1 535 (12.9%) 293 (14.3%) 828 (13.4%)

    2 529 (12.8%) 239 (11.7%) 768 (12.4%)

    Table 3.7: Outcome frequencies of the categorical variable X16.

    19

  • 8/2/2019 Credit Risk Estimation

    21/73

    TRAIN sample (Y = 0)

    0

    500

    1000

    1500

    X17

    TEST sample (Y = 0)

    0

    200

    400

    600

    800

    X17

    TRAIN sample (Y = 1)

    0

    50

    100

    X17

    TEST sample (Y = 1)

    0

    20

    40

    60

    X17

    Figure 3.7: Bar charts of the categorical variable X17.

    TRAIN sample TEST sample

    0 1440 (34.8%) 728 (35.6%) 2168 (35.1%)1 223 (5.4%) 118 (5.8%) 341 (5.5%)

    2 639 (15.5%) 311 (15.2%) 950 (15.4%)

    3 1833 (44.3%) 888 (43.4%) 2721 (44.0%)

    Table 3.8: Outcome frequencies of the categorical variable X17.

    20

  • 8/2/2019 Credit Risk Estimation

    22/73

    TRAIN sample (Y = 0)

    0

    500

    1000

    X18

    TEST sample (Y = 0)

    0

    200

    400

    600

    X18

    TRAIN sample (Y = 1)

    0

    50

    X18

    TEST sample (Y = 1)

    0

    10

    20

    30

    40

    X18

    Figure 3.8: Bar charts of the categorical variable X18.

    TRAIN sample TEST sample

    0 874 (21.1%) 441 (21.6%) 1315 (21.3%)

    1 809 (19.6%) 378 (18.5%) 1187 (19.2%)

    2 595 (14.4%) 319 (15.6%) 914 (14.8%)

    3 232 (5.6%) 125 (6.1%) 357 (5.8%)

    4 120 (2.9%) 77 (3.8%) 197 (3.2%)

    5 1505 (36.4%) 705 (34.5%) 2210 (35.8%)

    Table 3.9: Outcome frequencies of the categorical variable X18.

    21

  • 8/2/2019 Credit Risk Estimation

    23/73

    TRAIN sample (Y = 0)

    0

    500

    1000

    X19

    TEST sample (Y = 0)

    0

    100

    200

    300

    400

    500

    X19

    TRAIN sample (Y = 1)

    0

    20

    40

    60

    X19

    TEST sample (Y = 1)

    0

    10

    20

    30

    X19

    Figure 3.9: Bar charts of the categorical variable X19.

    TRAIN sample TEST sample

    0 369 (8.9%) 209 (10.2%) 578 (9.4%)

    1 634 (15.3%) 295 (14.4%) 929 (15.0%)

    2 1093 (26.4%) 548 (26.8%) 1641 (26.6%)

    3 255 (6.2%) 132 (6.5%) 387 (6.3%)

    4 1145 (27.7%) 552 (27.0%) 1697 (27.5%)

    5 639 (15.5%) 309 (15.1%) 948 (15.3%)

    Table 3.10: Outcome frequencies of the categorical variable X19.

    22

  • 8/2/2019 Credit Risk Estimation

    24/73

    TRAIN sample (Y = 0)

    0

    500

    1000

    1500

    2000

    2500

    X20

    TEST sample (Y = 0)

    0

    500

    1000

    X20

    TRAIN sample (Y = 1)

    0

    50

    100

    X20

    TEST sample (Y = 1)

    0

    10

    20

    30

    40

    50

    X20

    Figure 3.10: Bar charts of the categorical variable X20.

    TRAIN sample TEST sample

    0 710 (17.2%) 370 (18.1%) 1080 (17.5%)

    1 267 (6.5%) 151 (7.4%) 418 (6.8%)

    2 372 (9.0%) 198 (9.7%) 570 (9.2%)

    3 255 (6.2%) 121 (5.9%) 376 (6.1%)

    4 60 (1.5%) 25 (1.2%) 85 (1.4%)

    5 2471 (59.8%) 1180 (57.7%) 3651 (59.1%)

    Table 3.11: Outcome frequencies of the categorical variable X20.

    23

  • 8/2/2019 Credit Risk Estimation

    25/73

    TRAIN sample (Y = 0)

    0

    500

    1000

    X21

    TEST sample (Y = 0)

    0

    100

    200

    300

    400

    500

    X21

    TRAIN sample (Y = 1)

    0

    20

    40

    60

    X21

    TEST sample (Y = 1)

    0

    10

    20

    30

    40

    X21

    Figure 3.11: Bar charts of the categorical variable X21.

    TRAIN sample TEST sample

    0 737 (17.8%) 387 (18.9%) 1124 (18.2%)

    1 412 (10.0%) 183 (8.9%) 595 (9.6%)

    2 708 (17.1%) 318 (15.6%) 1026 (16.6%)

    3 461 (11.1%) 214 (10.5%) 675 (10.9%)

    4 1069 (25.9%) 554 (27.1%) 1623 (26.3%)

    5 473 (11.4%) 237 (11.6%) 710 (11.5%)

    6 275 (6.7%) 152 (7.4%) 427 (6.9%)

    Table 3.12: Outcome frequencies of the categorical variable X21.

    24

  • 8/2/2019 Credit Risk Estimation

    26/73

    TRAIN sample (Y = 0)

    0

    500

    1000

    X22

    TEST sample (Y = 0)

    0

    100

    200

    300

    400

    500

    X22

    TRAIN sample (Y = 1)

    0

    50

    X22

    TEST sample (Y = 1)

    0

    10

    20

    30

    40

    X22

    Figure 3.12: Bar charts of the categorical variable X22.

    TRAIN sample TEST sample

    0 383 (9.3%) 198 (9.7%) 581 (9.4%)

    1 420 (10.2%) 228 (11.1%) 648 (10.5%)

    2 1189 (28.8%) 572 (28.0%) 1761 (28.5%)

    3 202 (4.9%) 78 (3.8%) 280 (4.5%)

    4 210 (5.1%) 95 (4.6%) 305 (4.9%)

    5 317 (7.7%) 175 (8.6%) 492 (8.0%)

    6 218 (5.3%) 97 (4.7%) 315 (5.1%)

    7 227 (5.5%) 112 (5.5%) 339 (5.5%)

    8 143 (3.5%) 69 (3.4%) 212 (3.4%)

    9 826 (20.0%) 421 (20.6%) 1247 (20.2%)

    Table 3.13: Outcome frequencies of the categorical variable X22.

    25

  • 8/2/2019 Credit Risk Estimation

    27/73

    TRAIN sample (Y = 0)

    0

    500

    1000

    X23

    TEST sample (Y = 0)

    0

    100

    200

    300

    400

    500

    X23

    TRAIN sample (Y = 1)

    0

    10

    20

    30

    40

    50

    X23

    TEST sample (Y = 1)

    0

    10

    20

    30

    X23

    Figure 3.13: Bar charts of the categorical variable X23.

    TRAIN sample TEST sample

    0 279 (6.7%) 145 (7.1%) 424 (6.9%)

    1 298 (7.2%) 167 (8.2%) 465 (7.5%)

    2 1025 (24.8%) 510 (24.9%) 1535 (24.8%)

    3 375 (9.1%) 180 (8.8%) 555 (9.0%)

    4 226 (5.5%) 99 (4.8%) 325 (5.3%)

    5 104 (2.5%) 59 (2.9%) 163 (2.6%)

    6 269 (6.5%) 132 (6.5%) 401 (6.5%)

    7 389 (9.4%) 194 (9.5%) 583 (9.4%)

    8 561 (13.6%) 263 (12.9%) 824 (13.3%)

    9 65 (1.6%) 25 (1.2%) 90 (1.5%)

    10 544 (13.2%) 271 (13.3%) 815 (13.2%)

    Table 3.14: Outcome frequencies of the categorical variable X23.

    26

  • 8/2/2019 Credit Risk Estimation

    28/73

    4 Credit Scoring Methods

    We suppose that we are given a set {xi }Ni=1 of N observations of a random vector

    X in Rp. That is, there are p independent variables (predictors) X1, . . . , X p and one

    dependent variable (response) Y. Each observation is thus a row vector (xi , yi), i =

    1, 2, . . . , N and the whole data set can be written in a matrix form:

    X =

    x1,1 . . . x1,p...

    ...

    xN,1 . . . xN,p

    Y =

    y1...

    yN

    In our particular case we have two data sets with NTRAIN = 4135 and NTEST = 2045.

    The dimension of the predictor space is either p = 23 (or p = 61 after introducing

    dummy variables).

    4.1 Logistic Regression

    Logistic regression stems from a wider class of models. It is a special case of the gener-

    alized linear model (GLM). The logistic regression model assumes that the conditional

    expectation E[Y|X = x] of Y is given in the following way:

    E[Y|x] = G(0 + x) =

    e0+x

    1 + e0+x,

    where 0 is the regression constant, = (1, . . . p) are the regression coefficients,

    and G(.) is the logistic cumulative distribution function (also called the link function).

    The distribution of Y (Bernoulli distribution) belongs to the exponential family of

    distributions:

    f(yi, ) = P(Y = yi|X = xi) = yi(1 )1yi , i = 1, 2, . . . , N

    where = P(Y = 1|x) is the probability that the applicant will not be able to pay his

    liabilities and 1 = P(Y = 0|x) is the probability of a creditworthy applicant.The parameters of a GLM are, in general, estimated by maximum-likelihood method

    by maximizing the term:

    L(0, 1, . . . , p) =Ni=1

    yi(1 )1yi

    =

    Ni=1

    G(0 +

    xi)yi(1 G(0 +

    xi))(1yi)

    .

    The estimating procedure is done by the Newton-Raphson algorithm. Computational

    details are given in Hardle et al. (2000b).

    27

  • 8/2/2019 Credit Risk Estimation

    29/73

    Variable Std.Err. t-value Variable Std.Err. t-value

    Const. -2.747 0.066 -41.811 Const. -2.757 0.066 -41.976

    X1 0.191 0.069 2.771 X5 0.093 0.082 1.135

    Const. -2.823 0.071 -39.873 Const. -2.754 0.066 -41.861

    X2 -0.313 0.088 -3.539 X6 0.035 0.131 0.267

    Const. -2.767 0.067 -41.518 Const. -2.830 0.071 -39.635

    X3 -0.097 0.082 -1.189 X7 -0.872 0.217 -4.013

    Const. -2.752 0.066 -41.721 Const. -2.948 0.099 -29.807

    X4 0.047 0.078 0.596 X8 -1.306 0.431 -3.032

    Table 4.1: Logistic regression coefficients for each metric variable separately.

    Bold parameter values are significant at 5%.

    Results

    The estimations are based on the TRAIN data set. Since the logistic regression cannot

    deal with categorical variables directly, we must recode each of the variables X9X23

    with a set of new contrast variables (the first category, 0, is taken as the reference

    category for each variable). In this manner we get a larger data set with one response,

    8 metric and 53 dummy variables (stored in files fr train C.dat and fr test C.dat).

    Let Xa(b) denote the b-th contrast variable for the original variable Xa, e.g. X23 is

    then recoded into contrast variables X23(1), X23(2), ... X23(10). Firstly we tried to

    fit the response on each variable separately. Table 4.1 shows the estimation results:

    the parameters, their standard error and t-value for the metric variables. X1, X2,

    X7 and X8 are significant at 5% (and thus emphasized by bold letters), X3X6 are

    insignificant.

    Results for categorical variables estimated separately are given in Table 4.2. Pa-

    rameters significant at 5% are emphasized by bold letters again. Significant binary

    variables are X9, X10 and X14. No variable with more than two categories has sig-

    nificant coefficients for all dummy variables. All coefficients for X16 and X17 are

    insignificant.

    From the significant variables the categorical variable X10 has the lowest deviance

    (1802.6). The sequence of other significant variables sorted with respect to the deviance

    (in decreasing order) is: X10, X20, X22, X23, X21 and the corresponding R2 values

    decrease from 3.7% to 1.5%. This is really a very poor performance. Hence we use

    28

  • 8/2/2019 Credit Risk Estimation

    30/73

    Variable Std.Err. t-value Variable Std.Err. t-value

    Const. -2.403 0.107 -22.426 Const. -2.121 0.121 -17.475

    X9(1) -0.524 0.136 -3.864 X20(1) -0.392 0.262 -1.496

    Const. -1.864 0.110 -16.910 X20(2) -1.048 0.290 -3.613

    X10(1) -1.206 0.138 -8.737 X20(3) -0.343 0.263 -1.304Const. -2.609 0.166 -15.727 X20(4) 0.932 0.328 2.836

    X11(1) -0.172 0.181 -0.954 X20(5) -1.024 0.158 -6.481

    Const. -2.841 0.087 -32.563 Const. -3.160 0.186 -16.951

    X12(1) 0.206 0.132 1.557 X21(1) -0.049 0.316 -0.155

    Const. -2.344 0.240 -9.759 X21(2) 0.142 0.258 0.549

    X13(1) -0.440 0.250 -1.763 X21(3) 1.008 0.241 4.184

    Const. -2.792 0.068 -41.015 X21(4) 0.561 0.222 2.528

    X14(1) 0.659 0.258 2.549 X21(5) 0.274 0.277 0.987

    Const.-3.140

    0.143 -21.952X

    21(6)0.667

    0.294 2.271X15(1) 0.174 0.329 0.528 Const. -3.133 0.255 -12.267

    X15(2) 0.540 0.162 3.329 X22(1) -0.096 0.361 -0.266

    Const. -2.771 0.077 -36.160 X22(2) 0.678 0.277 2.445

    X16(1) 0.255 0.181 1.405 X22(3) -0.354 0.487 -0.726

    X16(2) -0.192 0.215 -0.892 X22(4) 0.638 0.365 1.749

    Const. -2.833 0.115 -24.628 X22(5) 0.262 0.357 0.735

    X17(1) -0.034 0.318 -0.106 X22(6) 0.894 0.343 2.604

    X17(2) -0.077 0.213 -0.363 X22(7) -1.590 0.755 -2.107

    X17(3) 0.192 0.148 1.297 X22(8) 0.989 0.374 2.645

    Const. -3.147 0.170 -18.492 X22(9) 0.255 0.299 0.854

    X18(1) 0.568 0.219 2.596 Const. -2.561 0.232 -11.035

    X18(2) 0.246 0.251 0.982 X23(1) -0.308 0.346 -0.890

    X18(3) 0.891 0.281 3.168 X23(2) -0.811 0.290 -2.794

    X18(4) 0.635 0.386 1.645 X23(3) -0.264 0.323 -0.816

    X18(5) 0.416 0.201 2.065 X23(4) -0.622 0.412 -1.509

    Const. -3.030 0.248 -12.204 X23(5) -0.425 0.514 -0.826

    X19(1) 0.636 0.287 2.217 X23(6) 0.092 0.325 0.284

    X19(2) 0.253 0.280 0.904 X23(7) -0.161 0.313 -0.513

    X19(3) 0.670 0.334 2.009 X23(8) 0.342 0.272 1.257

    X19(4) 0.025 0.285 0.086 X23(9) -1.598 1.034 -1.545

    X19(5) 0.241 0.301 0.802 X23(10) 0.054 0.283 0.191

    Table 4.2: Logistic regression coefficients for each categorical variable sepa-

    rately. Bold parameter values are significant at 5%.

    29

  • 8/2/2019 Credit Risk Estimation

    31/73

    GLM: logistic fit, n=4135

    -5 0

    Index eta

    0

    0.2

    0.4

    0.6

    0.8

    1

    Linkmu,

    Responsesy

    Figure 4.1: Predicted probabilities.

    quantlet glmforward 3 to find a more appropriate model. The quantlet starts with a

    null model (containing intercept only) and adds particular variables consequently. The

    measure of goodness of the model is the Akaikes criterion. In this way we can identify

    subsets of independent variables that are good predictors of the response. Note that

    this quantlet must evaluate up to p(p+1)2 models, what means in our case62632 = 1953

    models, and therefore the computation takes a while. The forward stepwise procedure

    suggest to include in the model the following variables: X1, X2, X5, X7X10, X12

    X14, X16(2), X18(3), X18(5), X19(1), X19(3), X20(2), X20(4), X20(5), X21(3),

    X21(4), X21(6), X22(2), X22(4), X22(6)X22(9), X23(2), X23(8), X23(9). The

    Akaikes information criterion of this model is 1643.6. It is remarkable that the variableX23(3) is insignificant, even though it seemed to be quite predictive from Figure 3.13.

    Table 4.3 shows the estimated parameters, their standard error and corresponding t-

    value in this model. Since by using dummy variables one considers all possible effects,

    modelling for the categorical variables cannot be further improved, but the influence of

    metric variables may be better investigated by using semiparametric model. Figure 4.1

    shows a plot of X vs. Y together with a plot of X vs. the link function G(X).

    3 Files o-logit*.* contain the complete computer output related to this section.

    30

    http://www.xplore-stat.de/help/glmforward.html
  • 8/2/2019 Credit Risk Estimation

    32/73

    Variable Std.Err. t-value

    Const. -2.153 0.3653 -5.8936

    X1 0.1952 0.1126 1.7338

    X2 -0.3533 0.0967 -3.6518X5 0.1618 0.1047 1.546

    X7 -0.8900 0.2198 -4.0488

    X8 -0.958 0.4080 -2.3482

    X9(1) -0.3269 0.1706 -1.9162

    X10(1) -0.9780 0.1519 -6.4356

    X12(1) 0.2551 0.1598 1.5961

    X13(1) -0.511 0.2678 -1.9091

    X14(1) 0.6736 0.2850 2.3636

    X16(2) -0.3144 0.2351 -1.3372

    X18(3) 0.4531 0.265 1.7056

    X18(5) 0.5652 0.1940 2.913

    X19(1) 0.4681 0.1768 2.6472

    X19(3) 0.5394 0.2546 2.1181

    X20(2) -0.8286 0.2982 -2.7789

    X20(4) 1.262 0.362 3.4868

    X20(5) -0.8635 0.1506 -5.7339

    X21(3) 0.7086 0.1949 3.6355

    X21(4) 0.2385 0.1687 1.4136

    X21(6) 0.5004 0.2640 1.8953

    X22(2) 0.6722 0.1839 3.6541

    X22(4) 0.7068 0.3161 2.236

    X22(6) 0.9538 0.2967 3.2146

    X22(7) -1.341 0.6702 -2.0007

    X22(8) 0.880 0.4261 2.0666

    X22(9) 0.350 0.2182 1.6064

    X23(2) -0.6462 0.2001 -3.2293

    X23(8) 0.3820 0.1766 2.163

    X23(9) -1.552 0.9915 -1.566

    Table 4.3: Logistic regression coefficients of the model suggested by

    glmforward. Bold parameter values are significant at 5%.

    31

    http://www.xplore-stat.de/help/glmforward.html
  • 8/2/2019 Credit Risk Estimation

    33/73

    Prediction

    We use the model described in the previous paragraph (suggested by the glmforward

    quantlet) to estimate the outcomes of the TEST data set and to compute the misclas-

    sification rate.

    Firstly we check the prediction on the TRAIN data set which built the model.

    Table 4.4 shows the results when observations with probability higher than 0.5 are

    assigned to be non-creditworthy clients. For comparison we show results using prior

    probabilities in Table 4.5. At the first sight, using prior probabilities gives worse

    resultsthe overall misclassification rate of 8.95% is higher than the rate when using 0.5

    threshold (6.0%). But in fact, the latter misclassifies 74.9% of the bad clients, while

    the former more than 93.9%! Threshold 0.5 gains better overall misclassification rate

    due to the low number of misclassified good clients (0.4%), using prior probabilities

    misclassifies 4.76% of them. From the banks point of view it is worse to grant a loan

    to a non-creditworthy applicant than to reject a creditworthy applicant and thus we

    will further use prior probabilities to decide on the creditworthiness of the client.

    Then we use the model on the testing data, stored in the file data-test C.dat.

    Table 4.6 shows the results. In the TEST data set 98 applicants of 1920 good

    applicants were denoted as bad, 101 of the 125 bad clients were assigned as good

    and thus 199 applicants of the entire TEST sample were misclassified, what results in

    the overall misclassification rate about 9.7%.

    observed predicted misclass. overall

    0 1 misclass.

    0 3872 16 0.41% 248

    1 232 15 93.93% (6.00%)

    Table 4.4: Misclassification rates of the logit model for the TRAIN data set.

    observed predicted misclass. overall0 1 misclass.

    0 3703 185 4.76% 370

    1 185 62 74.90% (8.95%)

    Table 4.5: Misclassification rates of the logit model for the TRAIN data set

    (prior probabilities).

    32

    http://www.xplore-stat.de/help/glmforward.html
  • 8/2/2019 Credit Risk Estimation

    34/73

    observed predicted misclass. overall

    0 1 misclass.

    0 1822 98 5.10% 199

    1 101 24 80.80% (9.73%)

    Table 4.6: Misclassification rates of the logit model for the TEST data set

    (prior probabilities).

    Discussion

    Since logit model belongs to traditional techniques, one may find the logistic regression

    in almost every paper treating credit scoring, at least as a reference method for com-

    parison with other models. In general, logistic regression is easy to fit and works well inpractice. However binary output models (like logit or probit) describe best the output

    category, which occur most frequent. Therefore each output category should have at

    least 5% of observations. If an output class has too small number of observations, one

    should use so-called rare event models (Kaiser and Szczesny, 2000a).

    4.2 Multi-layer Perceptron

    The multi-layer perceptron (MLP) is a simple feed-forward neural network with an

    input layer, several hidden layers and one output layer. It means that information can

    only flow forward from the input units to the hidden layer and then to the output

    unit(s). MLP network is the most often used architecture of neural networks, there is

    a great deal of publications concerning MLP network, see for example Bishop (1995).

    For the purpose of credit scoring a MLP with one hidden layer and one or two output

    units only is sufficient. Its basic structure is illustrated in Figure 4.2. The value of the

    output unit can be expressed:

    f(x) = F2

    w(2)0 +

    rj=1

    w(2)j F1

    w(1)j0 +

    pi=1

    w(1)ji xi

    ,

    where xi are the the input units, w(1)ji and w

    (2)j are the weights of the hidden and

    output layer respectively and F1 and F2 are the transfer functions from the input to

    the hidden layer and from the hidden to the output layer respectively. The transfer

    function is usually sigmoid, e.g. logistic function. The parameters for the network are

    determined iteratively, commonly via the backpropagation procedure.

    33

  • 8/2/2019 Credit Risk Estimation

    35/73

    Input layer Hidden layer Output layer

    qqq

    dddddddd

    hhhhhhhh

    xp

    x2

    x1

    qqq

    lllll

    lll

    zr

    z2

    z1

    y = f(x)

    w(1)

    w(2)

    Figure 4.2: A multi-layer perceptron network with one hidden layer and one

    output unit.

    Results

    For the probabilistic models the softmax transfer function for the outputs is intended.

    However, we found out faulty usage of this function in XploRe, it is commented in

    Appendix B.2. Therefore we used a MLP network with logistic transfer function for

    the output, quadratic least squares error function and no skip connections.

    We split the TRAIN sample into two subsamples stored in the files data-nn-1.dat

    (with 2779 observations) and data-nn-2.dat (with 1356 observations) respectively.

    The first subsample is used for the actual network training, it looks for a MLP network

    with the minimal mean squared error (MSE). The latter subsample is validating and

    observed predicted misclass. overall0 1 misclass.

    0 2579 50 1.9% 110

    1 60 90 40.0% (4.0%)

    Table 4.7: Misclassification rates of the 23-12-1 MLP network for the small

    training sample.

    34

  • 8/2/2019 Credit Risk Estimation

    36/73

    observed predicted misclass. overall

    0 1 misclass.

    0 1197 62 4.9% 148

    1 86 11 88.7 % (10.9%)

    Table 4.8: Misclassification rates of the 23-12-1 MLP network for the small

    validating sample.

    observed predicted misclass. overall

    0 1 misclass.

    0 2569 60 2.3% 120

    1 60 90 40.0% ( 4.3%)

    Table 4.9: Misclassification rates of the 17-13-1 MLP network for the small

    training sample.

    observed predicted misclass. overall

    0 1 misclass.

    0 1205 54 4.3% 135

    1 81 16 83.5% (10.0%)

    Table 4.10: Misclassification rates of the 17-13-1 MLP network for the small

    validating sample.

    it should avoid the overfitting so that the model is not built for one particular sample

    only.

    We computed many models and looked at the MSE and the misclassification rates

    of the validation data set. The misclassification rates are computed due to prior prob-abilities gained from the small training sample. Finally we chose a MLP networks with

    12 units in the hidden layer, which reached the MSE of 243.58. The final 23-12-1 MLP

    network 4 is saved in files mlp1.*.

    Tables 4.7 and 4.8 show the misclassification rates which the final MLP network

    reaches in the small training and in the validating sample respectively. The resulting

    MLP network predicts more than 98% of the good clients correctly. Out of the bad

    4Here 23-12-1 denotes 23 input units, 12 units in the hidden layer and 1 output unit.

    35

  • 8/2/2019 Credit Risk Estimation

    37/73

    clients 40% are misclassified. Altogether there are 4% misclassified clients. The results

    get slightly worse, when using the MLP network on the small validation set. Almost

    5% of the good clients and more than 88% of the bad clients are misclassified!

    The problem by neural networks is missing procedure choosing significant input

    units. Thus we let us inspire by the logistic regression and restrict the input layer

    to 17 units only, as suggested in subsection 4.1 by the quantlet glmforward. That

    is, we built the network only on the knowledge of the variables X1, X2, X5, X7

    X10, X12X14, X16 and X18X23. The resulting network has 13 hidden units and

    the MSE of 310.35. This network is stored in the files mlp2.*. Table 4.9 shows its

    misclassification rates. This restricted MLP network has higher misclassification rate

    by the good clients from the small training sample (2.3%), but the performance in

    the small validating sample is a little bit better. 4.3% of the good clients and only

    83.5% of the bad clients are misclassified (Table 4.10).

    Prediction

    Before we compute the misclassification rates of the TEST data set, we check the

    prediction on the whole data from the TRAIN sample which built the model. Both

    results for the 23-12-1 MLP and 17-13-1 MLP network are shown in Table 4.11 and

    Table 4.12 respectively. The restricted network has better prediction by the bad

    observed predicted misclass. overall0 1 misclass.

    0 3771 117 3.0% 263

    1 146 101 59.1% (6.4%)

    Table 4.11: Misclassification rates of the 23-12-1 MLP network for the TRAIN

    data set.

    observed predicted misclass. overall0 1 misclass.

    0 3751 137 3.5% 275

    1 138 109 55.9% (6.7%)

    Table 4.12: Misclassification rates of the 17-13-1 MLP network for the TRAIN

    data set.

    36

    http://www.xplore-stat.de/help/glmforward.html
  • 8/2/2019 Credit Risk Estimation

    38/73

    observed predicted misclass. overall

    0 1 misclass.

    0 1815 105 5.5% 215

    1 110 15 88.0% (10.5%)

    Table 4.13: Misclassification rates of the 23-12-1 MLP network for the TEST

    data set.

    observed predicted misclass. overall

    0 1 misclass.

    0 1813 107 5.6% 217

    1 110 15 88.0% (10.6%)

    Table 4.14: Misclassification rates of the 17-13-1 MLP network for the TEST

    data set.

    clientsit misclassified 8 clients (3.2%) less than the full network, on the other side

    by the good clients it misclassifies 20 clients (0.5%) more than the full network.

    Altogether the restricted 17-13-1 MLP network describes the TRAIN data set slightly

    worse than the full 23-12-1 MLP network. It misclassifies 12 clients more (0.3%).

    Note that the number of misclassified clients in the small training sample plus the

    number of misclassified clients in the small validation sample is not exactly equal to

    the number of misclassified clients in the TRAIN data set (no matter which network

    110 + 148 = 263 and 120 + 135 = 275). This is due to the prior probabilities we used.

    We are denoting a client as bad, if its neural networks output function is greater

    than the rate of good clients in the corresponding training sample, that is 94.6% in

    the small training sample and 94.0% in the TRAIN data set.

    Table 4.13 and Table 4.14 show the final results of prediction the TEST data set.

    There is no difference in predicting the bad clients. The restricted 17-13-1 MLP

    network misclassifies only 2 clients more than the full 23-12-1 MLP network. Thus

    these two neural networks seem to give almost the same results.

    Discussion

    Neural Networks represent very flexible models with good performance. Although

    there are various architectures of neural networks, more than 50% of applications are

    37

  • 8/2/2019 Credit Risk Estimation

    39/73

    using the multi-layer perceptron (MLP) network, which is both simple and well known.

    Problematic is choosing the number of units in the hidden layer. West (2000) uses an

    analogy of the forward stepwise procedure in the logistic regression. The so called

    cascade learning starts with one neuron in the hidden layer and adds other neurons

    as long as the performance is being better. As one can see, logistic regression may be

    classified as a simple MLP with one processing unit in one hidden layer and logistic

    function as the sigmoid activation function.

    4.3 Radial Basis Function Neural Network

    The radial basis function (RBF) network is another architecture of feed-forward neural

    networks, which has in principle only one hidden layer. The hidden units are also

    called clusters, as the observations are clustered to one of the hidden units whichare represented by radially symmetric functions. The weights of the hidden layer then

    represent centers of these clusters (mean of the radial functions). At the first stage

    these centers as well as deviances of the radial basis functions are to be found (via the

    so called unsupervised learning). As next, weights of the output layer are determined

    via supervised learning (similarly like in the MLP networks). RBF neural networks are

    explained in Appendix A comprehensively.

    observed predicted misclass. overall

    0 1 misclass.

    0 2505 124 4.7% 248

    1 124 26 82.7% (8.9%)

    Table 4.15: Misclassification rates of the 23-100-1 RBF network for the small

    training sample.

    observed predicted misclass. overall

    0 1 misclass.

    0 1200 59 4.7% 142

    1 83 14 85.6% (10.5%)

    Table 4.16: Misclassification rates of the 23-100-1 RBF network for the small

    validation sample.

    38

  • 8/2/2019 Credit Risk Estimation

    40/73

    Results

    The RBF neural network is built in the same way as the MLP network. We trained

    the network on the small training sample with 2779 observations (data-nn-1.dat) in

    order to minimize the mean squared error and at the same time checked the overfitting

    by evaluating the model on the small validating sample (data-nn-2.dat).

    In the same manner as by the MLP network we tried many different networks with

    various learning parameters and different number of hidden units, till we got optimal

    results. There are two models again, one with 23 input units (the full model) and

    another one with only 17 inputs, as suggested in Section 4.1

    The first model uses 100 clusters, we denote it 23-100-1 RBF neural network and

    save it in rbf1.rbf). The latter contains only 80 clusters and we denote it 17-80-

    1 RBF network (stored in rbf2.rbf). Misclassification rates for the 23-100-1 RBF

    network are given in Table 4.15 for the small training sample and in Table 4.16 for the

    validating sample. The 17-80-1 RBF neural networks misclassification rates are shown

    in Table 4.17 and 4.18. In comparison with MLP networks, both the 23-100-1 and

    17-80-1 RBF networks misclassify in the small training sample almost twice as much

    as the MLP does. However, the misclassification rates in the validation sample are

    reasonable: 10.5% and 10.8% of overall misclassified observations.

    observed predicted misclass. overall

    0 1 misclass.0 2510 119 4.5% 238

    1 119 31 79.3% (8.6%)

    Table 4.17: Misclassification rates of the 17-80-1 RBF network for the small

    training sample.

    observed predicted misclass. overall

    0 1 misclass.0 1198 61 4.8% 146

    1 85 12 87.6% (10.8%)

    Table 4.18: Misclassification rates of the 17-80-1 RBF network for the small

    validation sample.

    39

  • 8/2/2019 Credit Risk Estimation

    41/73

    Prediction

    Table 4.19 shows the results of the 23-100-1 RBF neural network. It misclassifies exactly

    the same number of observations as the 23-12-1 MLP network: 10.5%. Anyway, the

    RBF network predicts better the bad applicants there are 109 misclassifications

    (87.2%). Out of the good applicants 106 (5.5%) were misclassified. Results for the

    17-80-1 RBF network are shown in Table 4.20. It performs even a little bit better

    106 of bad and 103 of good applicants were misclassified. That is altogether 209

    (10.2%) misclassified applicants.

    observed predicted misclass. overall

    0 1 misclass.

    0 1814 106 5.5% 215

    1 109 16 87.2% (10.5%)

    Table 4.19: Misclassification rates of the 23-100-1 RBF network for the TEST

    data set.

    observed predicted misclass. overall

    0 1 misclass.

    0 1817 103 5.4% 2091 106 19 84.8% (10.2%)

    Table 4.20: Misclassification rates of the 17-80-1 RBF network for the TEST

    data set.

    Discussion

    Radial basis function neural networks are supposed to give better prediction than theMLP networks. That is true indeed, however, their performance is only slightly better

    (8 misclassified observations less than in the MLP) in our case. Due to the unsupervised

    learning the computation of RBF networks proceeds very fast. One may notice the

    misclassification rates of the small training sample. While MLP networks tend to

    overfit, the results from the RBF networks are more robustthe misclassification rates

    between the small training and the validation sample differ only in about 2 percent

    points.

    40

  • 8/2/2019 Credit Risk Estimation

    42/73

    5 Other Methods

    In the vast amount of publications treating credit scoring there are many other tech-

    niques which may also be used. This section gives a short overview of some of these

    methods and refers to other literature.

    5.1 Probit Regression

    Sometimes an alternative to the logit model is used. The probit model is another

    variant of generalized linear models. It is derived by letting the link function be the

    standard normal distribution function. Since the logistic link function is closely ap-

    proximate to that of normal random variable, the results are similar.

    5.2 Semiparametric Regression

    In order to give more attention to metric variables, one may estimate them nonpara-

    metrically. Hardle et al. (2000a) show that semiparametric methods perform better

    than the logistic regression. The resulting model consist then of a linear and a nonlinear

    part:

    E[Y|X] = E[Y|(V, W)] = G(V + m(W)),

    where = (1, . . . , v) is parameter vector for the categorical variables and m(.) is

    a smooth real function which can be estimated nonparametrically. In practice for the

    nonparametric part one chooses only those continuous explanatory variables, which

    have the most influence on the dependent variable Y. Estimators for and m(.) are

    computed by semiparametric maximum likelihood (this is reviewed in Hardle et al.

    (2000b)).

    5.3 Classification Trees

    Classification tree (usually summarized under a more general name classification and

    regression treesCART) is a nonparametric method to analyze categorical dependent

    variables as a function of metric and/or categorical explanatory variables. They has

    become a standard tool for developing credit scoring systems, since they are easily

    interpretable and may be demonstrated graphically.

    Classification and regression trees uses recursive partitioning algorithm (split-and-

    conquer): The basic idea is to split the sample into two subsamples each of which

    contains only cases from one response category and then repeatedly split each of the

    41

  • 8/2/2019 Credit Risk Estimation

    43/73

    subsamples. The subsamples are called nodes, the entire sample is called root node.

    Firstly, one looks for the explanatory variable which splits the sample in a node into

    two subgroups in such a way, that these children nodes are inside as homogeneous as

    possible and differ from each other as much as possible. The splitting rules are done

    according to some statistical criteria. Usually the tree is split till it is overfitted and

    afterward the tree is pruned back to get a more robust model. Once the final tree is

    determined, each terminal node is classified to be either good or bad depending

    on the majority of observations in that node. Figure 5.1 shows an example of a clas-

    sification tree used for credit scoring. There are 1000 clients in the beginning (140 of

    them are bad). The procedure splits this sample due to a particular value of some

    variables, till seven final groups are determined. The upper number in each box rep-

    resent the number of observations, the lower number stands for the number of bad

    clients in this node. Either 0 or 1 above the boxes denote terminal nodes to be the one

    of either good or bad clients. The basic monograph explaining CART is Breimann

    et al. (1983). Thomas (2000) mentions that in comparison with linear discriminant

    analysis (LDA), CART are better for predictors with interactions, while LDA is better

    for predictors with intercorrelations.

    Example of a classification tree

    1000

    previous default

    marriaged income

    income income

    another credit

    1 0

    1 0 1

    0 1

    16 14

    80 220 20

    550 100

    9 1

    45 5 15

    10 55

    yes

    yes> $2,970

    > $2,680 > $2,820

    yes

    Figure 5.1: Example of a fictive classification tree.

    42

  • 8/2/2019 Credit Risk Estimation

    44/73

    5.4 Linear Discriminant Analysis

    Discriminant analysis is a set of methods for separating subgroups in data using some

    discriminant rules, it is well explained in Hardle and Simar (2002). Many papers

    treating credit scoring use also discriminant analysis method to distinguish goodand bad clients in the sample (Back et al., 1996) or (Desai et al., 1996). However,

    linear discriminant approach assumes metric variables which are normally distributed

    and that the variance matrix in each of the groups is the same. Most of the variables are

    categorical in credit scoring, the metric variables are usually not normally distributed

    (see the Section 3) and moreover, there is no reason to assume, that the good and

    bad clients have same variance matrices (Case of not equal matrices may be solved

    nonlinearly). From this point of view any usage of discriminant analysis in credit

    scoring framework is faulty, however, Desai et al. (1996) states, that the empirical

    performance of linear discriminant analysis is relatively good.

    5.5 Panel Data Analysis

    Logit and probit models may be extended into panel data analysis. Panel data denote

    data, which have not only cross section, but also the time dimension. In the credit

    scoring it means, that for every client we have also observations in the time t. As a

    rule, panel data contains more data than only cross section- or time series data and

    therefore we may get more exact results. Panel data analysis for credit scoring isexplained in Kaiser and Szczesny (2000a)

    5.6 Hazard Regression

    Hazard regression models analyze survival data. Particularly they estimate probabili-

    ties, that the observed unit stays in the current state and solve the problem of censored

    data. In credit scoring it complies with the fact, that we either know that the client

    became delinquent at some time t, or that the client is still creditworthywhat does

    not necessary mean that he will not become delinquent at a future time. Kaiser and

    Szczesny (2000b) describe hazard regression in credit scoring and other related topics

    in detail.

    5.7 Genetic Algorithms

    Genetic algorithms are another general optimization schemes based on biological analo-

    gies (as the artificial neural networks are) described in the early 1990s. They simulate

    43

  • 8/2/2019 Credit Risk Estimation

    45/73

    Darwinian evolution. In fact they are recombining possible vector solutions (so called

    chromosomes) from a space of candidate solutions (population of chromosomes) and

    try to maximize some fitness function (determines how good a chromosome is). The

    recombination is done via three operators whose names originate from the biological

    background: reproduction, cross-over and mutation. Genetic algorithms are mentioned

    in the analysis of Back et al. (1996) or in Thomas (2000).

    5.8 Linear Programming

    Linear programming is searching for a cut-off point c and a weight vector (w1, . . . , wp),

    so that the scalar product of vector of observations and the weight vector wxi is

    above this cut-off point for the not creditworthy clients and below this point for the

    creditworthy clients. Sum of errors

    ei is minimized with respect to the unknownparameters:

    min e1 + e2 + ... + enG+nB

    subject to

    w1xi1 + w2xi2 + ... + wpxip c ei , 1 i nB

    w1xi1 + w2xi2 + ... + wpxip c + ei , nB + 1 i nG + nB = N

    ei 0 , 1 i nG + nB

    where nB is the number of bad and nG the number of good clients. Linear pro-

    gramming for credit scoring is described in Thomas (2000).

    5.9 Treed Logits

    Chipman et al. (2001) studied the prediction problem in direct marketing They gen-

    eralized and conjoined logistic regression and the CART techniques. The space of all

    observations is partitioned through a binary tree. In each bottom node a different logit

    model is fitted(instead of simply observed frequencies of response). Treed logits are

    interpretable, small and prevent the overfitting.

    44

  • 8/2/2019 Credit Risk Estimation

    46/73

    6 Summary and Conclusion

    Credit scoring represent a set of common techniques to decide whether a bank should

    grant a loan (issue a credit card) to an applicant or not. We have presented sev-

    eral methods and showed their usage in XploRe. The results of these methods aresummarized in Table 6.1.

    Artificial neural networks are very flexible models, however they provide slightly

    worse performance than the traditional logistic regression, which misclassifies 10 (0.5%)

    observations less than the best of neural network models. Logit misclassifies 98 good

    clients and denotes them as bad, 101 of the bad clients are granted the loan, as

    they are supposed to be creditworthy. Additionally, logistic regression dispose with

    statistical tests to identify how important are each of the predictor variables. Our

    analysis thus showed that neural network models did not manage to beat logit inprediction. However, due to the flat-maximum effect (Lovie and Lovie, 1986) one is

    unlikely to achieve a great deal of improvement from better statistical modelling on

    the same set of matching variables.

    misclass. logit MLP MLP RBF RBF

    (23-12-1) (17-13-1) (23-100-1) (17-80-1)

    good 98 105 107 106 103

    bad 101 110 110 109 106

    overall 199 215 217 215 209

    overall in % 9.7% 10.5% 10.6% 10.5% 10.2%

    Table 6.1: Misclassification rates of the methods tested.

    45

  • 8/2/2019 Credit Risk Estimation

    47/73

    7 Extensions

    The objective of most credit scoring models is to minimize the misclassification rate

    or the expected default rate. However, one should pay more attention to the term

    off misclassification rate. Usually, the overall misclassification rates are compared todecide about the prediction of various models. However, there are two types of misclas-

    sification. Firstly, one may denote a non-creditworthy client as creditworthy and the

    other way round it is possible to denote a creditworthy client as non-creditworthy. The

    latter is loss of profit, but it is not as bad as the former mistake, which means direct

    loss for the bank. Therefore the bank is not trying to minimize the misclassification

    rate, but to maximize its profit. One possible solution would be implementation of a

    cost matrix. The preferred method minimize then the term L = r1w1 + r2w2, where r1

    and r2 are number of bad applicants classified as creditworthy and number of goodapplicants classified as not creditworthy respectively. w1, w2 are weights mirroring the

    loss and profit lost. These weights are to be estimated. Unfortunately, there are not

    many papers on this topic yet. Tam and Kiang (1992) show that incorporating the

    cost matrix into neural networks and discriminant analysis is possible.

    At present the emphasis is on changing the objectives from trying to minimize the

    chance a client will default on any particular product to looking at how the firm can

    maximize the profit it can make from that client. As an example we can mention

    insurances sold on loans (Stanghellini, 1999). Thus a non-creditworthy client may be

    profitable if he buys the insurance on his loan and become delinquent relatively later.

    Credit scoring has better performance than decisioning made by loan officers. How-

    ever, one bad property of scoring methods is, that they are static and estimate at par-

    ticular time. There are another tools which should be used. Jacobson and Roszbach

    (1998) showed that banks using credit scoring grant loans inconsistent with default risk

    minimization. They suggest value at risk (VaR) as a more adequate measure of losses

    than default risk. Therefore next research should concentrate more on such topics. We

    will study methods of credit scoring in near future again and in a more detail in a

    diploma thesis at the Charles University in Prague.

    46

  • 8/2/2019 Credit Risk Estimation

    48/73

    Appendix

    A Radial Basis Function Neural Networks

    This section describes radial basis function (RBF) neural networks and explains

    their implementation in XploRe. In general, artificial neural networks (ANN)

    appeared in the 40s, but their progress and usage of these theories was enabled

    in the last decade, especially with the development of personal computers. Nowa-

    days they are very widely used in many research and commercial fields and may

    be successfully applied in credit scoring as well. The fashion of neural networks

    originating in brain nerve cells let grow up new terminology although the roots

    of neural networks stretch in much older techniques. Table A.1 shows basic ter-

    minology with different names for the same subject in artificial neural networks

    and statistics framework:

    statistics neural networks

    model network

    estimation learning

    regression supervised learning

    interpolation generalizationobservations training set

    parameters weights

    independent variables inputs

    dependent variables outputs

    Table A.1: Comparison of the neural networks and statistician terminology.

    The simplest example of an ANN is perceptron. Multi-layer perceptrons were

    described in Section 4.2. For a thorough explanation see Bishop (1995). Radial

    basis function neural networks are, alike the MLPs, feed-forward networks. It

    means that the signal in the network is passed forward only. But apart from

    the MLPs, RBF networks stand for a class of neural network models, in which

    the hidden units are activated according to the distance between the input units.

    RBF networks combine two different types of learning: supervised and unsuper-

    vised. First, at the hidden layer training, one conjoints the input vector into

    several clusters (unsupervised learning) and afterward, at the output

    47

  • 8/2/2019 Credit Risk Estimation

    49/73

    Input layer Hidden layer Output layer

    x4

    x3

    x2

    x1

    w

    bias

    z3

    z2

    z1

    y = f(x)

    .

    ............

    .........

    ......

    .........

    .

    ......

    ..

    ..

    ..

    ..

    .

    ..

    ..

    ..

    ..

    ..

    ..

    .

    ...

    .........

    .........

    ......

    .........

    .

    ......

    ..

    ..

    ..

    ..

    .

    .

    ..

    ..

    ..

    ..

    .

    ..

    .

    .........

    ...

    ..

    .......

    ......

    ..........

    ......

    ..

    ..

    ..

    ..

    .

    ....

    .

    ..

    ..

    ..

    .

    Figure A.1: RBF scheme for p = 4, q = 1, r = 3. Dashed lines show various

    weights of the connections.

    layer training, the output of the RBF network is determined by supervised learn-

    ing. While for the supervised learning we have both the independent variables and

    response variable(s), the unsupervised learning must work without the knowledge

    of response variable. This concept will be more clear from the next subsection.

    Training of a RBF network can be essential faster than the methods used to

    train MLP networks. Further the multi-layer feed-forward network trained with

    backpropagation does not yield the approximating capabilities of RBF networks.

    Therefore the theory of RBF neural networks is still the subject of extensive

    ongoing research (Orr, 1999). One remarkable feature of RBF neural networks

    is:

    RBF networks possess the property of best approximation. An ap-

    proximation scheme has this property if, in the set of approximating

    functions (i.e. the set of functions corresponding to all possible choices

    of the adjustable parameters) there is one function which has minimum

    approximating error for any given function to be approximated. This

    property is not shared by MLPs.Bishop (1995)

    48

  • 8/2/2019 Credit Risk Estimation

    50/73

    A.1 The Model

    Radial basis function neural networks have one hidden layer only. Each of the

    hidden units (the so called clusters) implements a radial function. The output

    units are weighted sums of clusters outputs. This is illustrated in Figure A.1.We suppose that we are given a set {xi }

    Ni=1 of N observations from a p-

    dimensional space. That is, we are given p input variables X1, . . . , X p and, in

    general, q output variables Y1, . . . , Y q. Let f : Rp Rq denote the function we

    want to approximate via the RBF neural network. The output of a RBF neural

    network with p input units, r clusters and one output unit is:

    f(x) =

    r

    j=1 [w0 + wjj(x)]

    ,

    where w0 is the weight of the bias, wj j = 1, . . . r are the output weights, j is a

    radially symmetric function with two parameters cj , j and is output transfer

    function. During training the observation points are joined into r clusters firstly,

    for instance by a K-means clustering algorithm. Each cluster is represented

    by a radial function. Radially symmetric function (.) is required to fulfill the

    condition that if xi = xj then (xi) = (xj). The norm . is usually

    taken to be L2-norm. The well known radially symmetric function is the p-variate

    Gaussian function:

    j(x) = exp

    x cj

    22j

    , j > 0 .

    Another popular activation function is the generalized inverse multi-quadric func-

    tion:

    j(x) = (x cj + 2j )

    , j > 0, > 0 .

    Both of them have the property that 0 as x . Other possible choices

    of radially symmetric functions are the thin-plate spline function:

    j(x) = x cj ln

    x cj

    ,

    or the generalized multi-quadric function:

    j(x) = (x cj + 2j )

    , j > 0, 1 > > 0 .

    The last two functions have the property that as x . The even

    mentioned radially symmetric functions are plotted in Figure A.2. However,

    49

  • 8/2/2019 Credit Risk Estimation

    51/73

    Gaussian

    -5 0 5

    X

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    5

    Y

    Thin-plate Spline

    -5 0 5

    X

    0

    10

    Y

    Generalized Inverse Multi-quadric

    -5 0 5

    X

    -0.4-0.2

    00.20.40.60.8

    11.21.41.61.8

    22.2

    2.42.62.8

    33.23

    .43.63.8

    44.24.44.64.8

    Y

    Generalized Multi-quadric

    -5 0 5

    X

    0

    1

    2

    3

    4

    5

    Y

    Figure A.2: Radially symmetric activation functions.

    both theoretical and empirical studies show that estimation results are relatively

    insensitive to the exact form of the radially symmetric activation function. The

    most commonly used radial function is the p-variate Gaussian function. It has

    not only the attractive property of product separability (it can be rewritten as

    a product of p univariate functions), but also other useful analytical properties.

    We must estimate centers (cj) and widths (j) of these clusters. The clusters

    weights are actually coordinates of the cluster centers. To get the proper number

    of these clusters one may grow to the suitable size starting form one cluster

    or alternatively, one starts with as many clusters as observations and trims the

    clusters as desired. However, the number of clusters (number of units in the

    hidden layer) is typically much less than N. After initial training, when the

    clusters are already known, one may apply the supervised learning to estimate

    weights of the output units. Determination of the clusters weights (i.e. centers of

    the clusters) and widths at the first stage may be seen as a parametric approach.

    50

  • 8/2/2019 Credit Risk Estimation

    52/73

    The second stage of training may be seen as a nonparametric approach and

    therefore RBF networks take the place of a semiparametric procedure.

    Finally, let us mention, that the radially symmetric constraint on the activa-

    tion function is sometimes violated in order to decrease the number of units in

    the hidden layer or to improve the performance of the network. For instance the

    multivariate Gaussian function may be generalized to the elliptically symmet-

    ric function by replacing the L2 norm by the Mahalanobis distance (Hardle and

    Simar, 2002).

    A.2 RBF Neural Networks in XploRe

    In this section we shortly describe how to run RBF neural network in XploRe.

    {inp,net,err} = rbftrain(x,y,clust,learn,epochs,mMSE,activ)

    trains a radial basis function neural network (slow)

    {inp,net,err} = rbftrain2(x,y,clust,learn,epochs,mMSE,activ)

    trains a radial basis function neural network (fast)

    The quantlets rbftrain and rbftrain2 build a radial basis function neural net-

    work. They use the same algorithm, the only difference is, that the former is

    written directly in XploRe while the latter uses a dynamically linked library writ-

    ten in C programming language. Account of this fact, rbftrain works slowly,

    but allows the user to change the source code directly, to add new features and

    to see what is exactly happening. On the other hand, rbftrain2 is a closed

    product, which cannot be changed, but runs fast.

    The input parameters x and y are the input and output variables respectively.

    We assume that x and y have dimensions Np and Nq respectively. Number of

    units in the hidden layer is given by the parameter clust, it must be determined

    by the user, usually clust

  • 8/2/2019 Credit Risk Estimation

    53/73

    for training output weights. Each of these learning rates must be from the range

    (0, 1). The vector epochs has two rows. The first row is the number of training

    epochs for the hidden layer and the second row contains number of epochs to

    train the output layer. The training is stopped either when the output units

    were already trained epochs[2]-times or when the mean squared error reaches

    the value given by the parameter mMSE. The optional input parameter activ

    determines whether the bipolar sigmoid function (activ = 1):

    1 eW

    1 + eW,

    should be used instead of the default binary activation function ( activ = 0):