advances in credit scoring: combining performance and...

38
Advances in Credit Scoring: combining performance and interpretation in Kernel Discriminant Analysis. Caterina Liberati DEMS Università degli Studi di Milano-Bicocca, Milan, Italy [email protected] Liberati November 10 th 2017 1 / 36

Upload: others

Post on 09-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Advances in Credit Scoring: combiningperformance and interpretation in Kernel

Discriminant Analysis.

Caterina Liberati

DEMS Università degli Studi di Milano-Bicocca, Milan, Italy

[email protected]

Liberati November 10th 2017 1 / 36

Page 2: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Outline

1 Motivation

2 Kernel-Induced Feature Space

3 Our Proposal

4 Examples

Liberati November 10th 2017 2 / 36

Page 3: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Motivation

Credit Scoring: Performance vs Interpretation

Learning Task with Standard Techniques

The objective of quantitative Credit Scoring (CS) is to develop accurate modelsthat can distinguish between good and bad applicants (Baesens et al, 2003).

The CS→ supervised classification problem: Linear discriminant analysis (Mays2004; Duda et al. 2000), logistic regression and their variations (Wiginton 1980;Hosmer and Lemeshow 1989; Back et al. 1996)

Modeling CS with Machine Learning Algorithms

A variety of techniques have been applied in modeling CS: Neural Networks(Malhotra and Malhotra, 2003; West, 2000), Decision Trees (Huang et al, 2006),k-Nearest Neighbor classifiers (Henley and Hand, 1996; Piramuthu, 1999)

Comparisons with standard data mining tools highlighted the superiority of suchalgorithms with respect to the standard classification tools

Liberati November 10th 2017 3 / 36

Page 4: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Motivation

Credit Scoring: Performance vs Interpretation

Kernel-based Discriminants

Significant theoretical advances in Machine Learning produced new algorithms’category based on works of Vapnik (1995-1998). He points out that learning canbe simpler if one uses low complexity classifiers in high dimensional space (F ).

The usage of kernel mappings makes it possible to project data implicitly in theFeature Space (F ) through the inner product operator

Due to the flexibility and remarkably good performance, the popularity of suchalgorithms grew quickly.

Performance vs Interpretation

Kernel-based classifiers are able to capture non-linearities in the data, at thesame time, they have an inability to provide an explanation, or comprehensiblejustification, for the solutions they reach (Barakat and Bradley 2010).

Liberati November 10th 2017 4 / 36

Page 5: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Kernel-Induced Feature Space

Complex Classification Tasks

o

o o

oo o

o

o

o

o

o

o o

x

x

x

x xx

x

x

x

x x

x

x

x

x

x

x

x

xx

x

x

x1

x2

−12 −10 −8 −6 −4 −2 0 2 4 6 8−10

−8

−6

−4

−2

0

2

4

6

Figure: Examples of complex data structures.

Liberati November 10th 2017 5 / 36

Page 6: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Kernel-Induced Feature Space

Do we need Kernels?The complexity of the target function to be learned depends on the way it is represented and the difficulty of the learning task canvary accordingly (figure from Schölkopf and Smola (2002)).

φ : R2 → R3

(x1, x2) → (z1, z2, z3) = (x21 ,√

2x1x2, x22 )

(φ(x), φ(z)) = (x21 ,√

2x1x2, x22 )(z2

1 ,√

2z1z2, z22 )T

= ((x1, x2)(z1, z2)T )2

= (x · z)2

Liberati November 10th 2017 6 / 36

Page 7: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Kernel-Induced Feature Space

Making Kernels

Kernel converts a non linear problem into a linear one by projecting data onto ahigh dimensional Feature Space F without knowing the mapping functionexplicitly.

k : X 2 → R which for all pattern sets {x1, x2..xn} ⊂ X and with X ⊂ Rp, give riseto positive matrices Kij = k(xi , xj )

If the Mercer’s theorem is satisfied (Mercer, 1909), the kernel k corresponds tomapping the data into a possibly high-dimensional dot product space F by a(usually nonlinear) map φ : Rp → F and taking the dot product there (Vapnik,1995), i.e.

k(x, z) = (φ(x) · φ(z)) (1)

if Mercer’s theorem is satisfied Kij = k(xi , xj ) is a Reproducing Kernel HilbertSpace (RKHS).

Liberati November 10th 2017 7 / 36

Page 8: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Kernel-Induced Feature Space

Advantages of Learning with Kernels

Among the others a RKHS has nice property:

K (x , z)2 6 K (x , x) · K (z, z) ∀x , z ∈ X (2)

The Cauchy-Schwarz inequality allows us to view K as a measure of similaritybetween inputs. If x , z ∈ X are similar then K (x , z) will be closer to 1 while if x , z ∈ Xare dissimilar then K (x , z) will be closer to 0.

K (x , z) is a space of similarities among instances

The freedom to choose the mapping k will enable us to design a large variety oflearning algorithms.

If map is chosen suitably, complex relations can be simplified and easilydetected.

Liberati November 10th 2017 8 / 36

Page 9: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Kernel-Induced Feature Space

Kernel Discriminant Analysis

Assume that we are given the input data set IXY = {(x1, y1), ..., (xn, yn)} of trainingvectors xi ∈ X and the corresponding values of yi ∈ Y ={1, 2} be sets of indices.

The class separability in a direction of the weights ω ∈ F is obtained maximizing theRayleigh coefficient (Baudat and Anouar, 2000):

J(ω) =ω′SφBωω′SφWω

(3)

From the theory of reproducing kernel the solution of ω ∈ F must lie in the span of allthe training samples in F .

We can notice that ω can be formed by a linear expansion of training samples asfollows:

ω =n∑

i=1

αiφ(xi ) (4)

Liberati November 10th 2017 9 / 36

Page 10: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Kernel-Induced Feature Space

Kernel Discriminant Analysis

As already showed by Mika et al (2003) the SΦB and SΦ

W can be easily written as

ω′SφBω = α′Mα (5)

where

M = (m1 −m2)(m1 −m2)′

mg = 1ng

∑ni=1

∑ngk=1 k(xi , xg

k ), g = 1, 2.

ω′SφWω = α′Nα (6)

N =∑2

g=1 K g(I − Lg)K ′g

K g a kernel matrix with a generic element (i th, k th) equal to k(xi , xgk )

I the identity matrix

Lg a matrix with all entries n−1g

Liberati November 10th 2017 10 / 36

Page 11: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Kernel-Induced Feature Space

Kernel Discriminant Analysis

These evidences allow to boils down the optimization problem of eq. 3 into finding theclass separability directions α of the following maximization criterion:

J(α) =α′Mαα′Nα

(7)

This problem can be solved by finding the leading eigenvectors of N−1M. Since theproposed setting is ill-posed, because N is at most of rank n-1, we employed aregularization method. The classifier is:

f (x) =n∑

i=1

αik(xi , x) + b (8)

b = α′12

(m1 + m2) (9)

Liberati November 10th 2017 11 / 36

Page 12: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Kernel-Induced Feature Space

KDA into SVM formulation

The linear classifier can be reviewed into SVM formulation as LS-SVM. Consider abinary classification model in the Reproducing Kernel Hilbert space:

f (x) = ω′φ(x) + b (10)

where ω is the weight vector in RKHS, and b ∈ R which is called as the bias term. Thediscriminant function of LS-SVM classifiers (Suykens and Vandewalle 1999 isconstructed by minimizing the following problem:

Min J(ω, e) =12ω′ω +

12γ

n∑i=1

e2i (11)

Such that: yi = ω′φ(xi ) + b + ei i=1,2,...n

Liberati November 10th 2017 12 / 36

Page 13: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Kernel-Induced Feature Space

KDA into SVM formulation

The Lagrangian of problem (eq. 13) is expressed by:

L(ω,b, e;α) = J(ω,b, e)−n∑

i=1

αi (ω′φ(xi ) + b− yi + ei ) (12)

where αi ∈ R are the Lagrange multipliers, which can be positive or negative in thisformulation. The conditions for optimality yield:

∂L∂ω

= 0⇒ ω =∑n

i=1 αiφ(xi )∂L∂b = 0⇒

∑mi=1 αi = 0

∂L∂ξ

= 0⇒ αi = γe∂L∂αi

= 0⇒ yi (ω′φ(xi ) + b)− 1 + ei = 0∀i = 1, 2, ..n

(13)

The solution is found by solving a system of linear equations in eq. 15 (Kuhn andTucker 1951). The fitting function namely the output of LS-SVM is:

f (x) =n∑

i=1

αik(xi , x) + b (14)

Liberati November 10th 2017 13 / 36

Page 14: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Kernel-Induced Feature Space

LS-SVM vs SVM

LS-SVM vs SVM

The major drawback of SVM lies in the estimation procedure based on theconstrained optimization programming (Wang and Hu 2015), therefore thecomputation burden becomes particularly heavy for large scale problems.

In such cases LS-SVM is preferred because its solution is based on solving alinear set of equations (Suykens and Vandewalle 1999).

KDA vs SVM

SVMs do not deal with multi-class problem directly when data structures presentmore than 2 groups unless we use any OAA OAO classifications.

Liberati November 10th 2017 14 / 36

Page 15: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Kernel-Induced Feature Space

Kernel Settings

The most common kernel mappings:

Kernel Mapping k(x,z)Cauchy 1

1+||x−z||2

c

Laplace exp(−√‖x−z‖2

c2 )

Multi-quadric√||x− z||2 + c2

Polynomial degree 2 (x · z)2

Gaussian (RBF) exp(−‖x−z‖2

2c2 )

Sigmoidal (SIG) tanh[c(x · z) + 1]

Tuning parameter is set trough some grid search algorithms

Regularization methods to overcome the singularity of SΦW (Friedman 1989; Mika

1999) REG

Model selection criteria for choosing the best kernel function (Error Rate, AUC,Information Criteria) SEL

Liberati November 10th 2017 15 / 36

Page 16: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Our Proposal

Our ProposalAn operative strategy

Our goal is NOT to derive a new classification model that discriminates better thanothers previously published.

I Selection of the best Kernel Discriminant function

(a) Computing kernel matrix using as inputs original variables(b) Performing Kernel Discriminant Analysis (KDA) with different kernel maps(c) Selecting the best Kernel Discriminant f (x) via minimum misclassification

error rate or maximization of AUC

II Reconstruction of the Kernel Discriminant function through a linear regression

(a) Performing a linear regression where f (x) is the target and the originalvariables are the predictors

(b) Studying the goodness of fit of the linear reconstruction(c) If II.(b) is satisfying, use the estimates of regression f̂ (x) as new classifier

III Application of the rule for a test data

(a) Employing to the test set, the parameters of the regression at II.(a)

Liberati November 10th 2017 16 / 36

Page 17: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Our Proposal

Our ProposalAn operative strategy

Our goal is NOT to derive a new classification model that discriminates better thanothers previously published.

I Selection of the best Kernel Discriminant function

(a) Computing kernel matrix using as inputs original variables(b) Performing Kernel Discriminant Analysis (KDA) with different kernel maps(c) Selecting the best Kernel Discriminant f (x) via minimum misclassification

error rate or maximization of AUC

II Reconstruction of the Kernel Discriminant function through a linear regression

(a) Performing a linear regression where f (x) is the target and the originalvariables are the predictors

(b) Studying the goodness of fit of the linear reconstruction(c) If II.(b) is satisfying, use the estimates of regression f̂ (x) as new classifier

III Application of the rule for a test data

(a) Employing to the test set, the parameters of the regression at II.(a)

Liberati November 10th 2017 16 / 36

Page 18: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Our Proposal

Our ProposalAn operative strategy

Our goal is NOT to derive a new classification model that discriminates better thanothers previously published.

I Selection of the best Kernel Discriminant function

(a) Computing kernel matrix using as inputs original variables(b) Performing Kernel Discriminant Analysis (KDA) with different kernel maps(c) Selecting the best Kernel Discriminant f (x) via minimum misclassification

error rate or maximization of AUC

II Reconstruction of the Kernel Discriminant function through a linear regression

(a) Performing a linear regression where f (x) is the target and the originalvariables are the predictors

(b) Studying the goodness of fit of the linear reconstruction(c) If II.(b) is satisfying, use the estimates of regression f̂ (x) as new classifier

III Application of the rule for a test data

(a) Employing to the test set, the parameters of the regression at II.(a)

Liberati November 10th 2017 16 / 36

Page 19: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Our Proposal

Our ProposalAn operative strategy

One may be surprised that the linear approximation of the kernel rule by inputvariables does not coincide with the direct regression of the response variable on theinput variables.

Theorem

Let A and B be 2 pre-hilbertian subspaces A and B such that B ⊂ A, and PA and PB

the orthogonal projectors onto A and B, then for any y, PB · PA(y) = PB(y).

Since the vector space linearly spanned by the input variables is embedded in theFeature Space, there should be no gain to approximate y by the kernel classifier (14)and then approximate f (x) by a linear combination of the xi , instead of a directprojection onto the xi .

This paradox disappears if we notice that LS-SVM classifier (14) does not correspondto the orthogonal projection onto the Feature Space, or, in other words, to the leastsquares approximation of the binary response.

Liberati November 10th 2017 17 / 36

Page 20: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Examples

Example 1: psycho credit scoring

The total sample is composed of 7699 self-employed customers (entrepreneurs, artisans,freelancers) of an Italian bank, distributed into two classes (6160 in “good”, 1539 in “bad”).

Training set→ 4619 instances, Test set→ 3080 instances

Our database is composed by 4 sets of quantitative variables and 1 target variable:

1 y: target variable which identifies the trespassing of the credit limit by the clients.

2 CREDIT: set of 1 scale variable provided by the credit bureau which measures thesolvency statement of the subjects.

3 MANAGE:set of 25 dichotomous variables (yes/no) related to the customers’ usage of thebanking services (e.g. bank account, credit card, payment of utilities, accrual of salary,etc.) synthesized via MCA.

4 ECO: set of 7 scale variables related to the cash flow and the economic returns of thefinancial activities operated by the users (e.g. monthly revenue produced by thecustomers, financial assets held by the customers, etc.).

5 SEMIO: set of 5 scale variables that synthesize the psychological traits of the subjects.

Liberati November 10th 2017 18 / 36

Page 21: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Examples

Example 1: data preparation

Sémiométrie (Lebart et al, 2003) is a list of 210 graphical forms among nouns, adjectives, orverbs, marked by respondents in terms of sensation (pleasant=+3 or unpleasant=-3).

Sample: 16,582 individuals aged 18 and over who were interviewed between 1990 and 2002.The rankings were synthesize via Principal Component Analysis.

According to the results only the first 6 factorial axis were interpreted.1 Pc1 - named axis of Participation. It is not a psychological trait.2 Pc2 - named Duty (-)/Pleasure (+)3 Pc3 - named Attachment (-)/Detachment (+)4 Pc4 - named Sublimation (-)/Materialism (+)5 Pc5 - named Idealization (-)/Pragmatism (+)6 Pc6 - named Humility (-)/Sovereignty (+)

SEMIO is obtained by a supplementary projection of points onto a subspace spanned by the 5semiometric factors (Pc2-Pc6):

fsup = X +U (15)

where X + is our standardized data matrix (supplementary observations) and U are the originaleigenvectors obtained by the spectral decomposition of Sémiométrie data.

Liberati November 10th 2017 19 / 36

Page 22: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Examples

Example 1: classification results

Table: Average classification performance statistics on test sets (50 runs)

Classifier Parameter Variables set Error Rate AUC Class Accuracy

good bad

CAU 3.6786 CREDIT+ECO+MANAGE+SEMIO 0.186 0.850 0.863 0.619LAP 3.6786 CREDIT+ECO+MANAGE+SEMIO 0.199 0.852 0.831 0.678MUL 5.8893 CREDIT+ECO+MANAGE+SEMIO 0.220 0.873 0.769 0.826RBF 3.6786 CREDIT+ECO+MANAGE+SEMIO 0.210 0.856 0.801 0.748POLY 2 CREDIT+ECO+MANAGE+SEMIO 0.333 0.566 0.733 0.398LDA CREDIT+ECO+MANAGE+SEMIO 0.368 0.522 0.713 0.300LR CREDIT+ECO+MANAGE+SEMIO 0.159 0.522 0.936 0.458

Liberati November 10th 2017 20 / 36

Page 23: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Examples

Example 1: variables importance

Discriminant**score**********y

Bad Good

Figure: Score values

Liberati November 10th 2017 21 / 36

Page 24: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Examples

Example 1: variables importance

r

Score Variable 4R2 rWR b p-value

MUL (R2=0.986 on the training sample with CREDIT+ECO+MANAGE+SEMIO sets)Pc2 0.177 18.40% 0.924 0.000Pc3 0.701 71.50% 1.836 0.000

Bureau 0.080 8.90% 7.426 0.000

RBF (R2=0.869 on the training sample with CREDIT+ECO+MANAGE+SEMIO sets)Pc2 0.160 18.90% 0.020 0.000Pc3 0.614 70.90% 0.040 0.000

Bureau 0.069 8.70% 0.160 0.000

POLY (R2=0.682 on the training sample with CREDIT+ECO+MANAGE+SEMIO sets)Interests on financial asset (F.) 0.018 3.30% 0.766 0.000Total financial assets managed 0.059 11.20% 0.001 0.000

Factor 3 0.040 7.80 % 62.885 0.000Factor 4 0.009 15.90% 100.860 0.000

Factor 13 0.009 14.00% 43.619 0.000

LDA (R2=1 on the training sample with CREDIT+ECO+MANAGE+SEMIO sets)Pc2 0.176 18.10% 0.095 0.000Pc3 0.712 71.60% 0.190 0.000

Bureau 0.082 9.00% 0.774 0.000

LR (R2=0.394 on the training sample with CREDIT+ECO+MANAGE+SEMIO sets)Pc2 0.093 16.60% 1.031 0.000Pc3 0.298 63.10% 2.006 0.000

Bureau 0.039 6.80% 7.888 0.000

Liberati November 10th 2017 22 / 36

Page 25: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Examples

Example 1: roc of reconstructed multiquadric scoreAUC=0.893

score=0.924 Pc2+1.836Pc3+0.088Bureau

Pc2=Duty/PleasurePc3=Attachment/DetachmentBureau=measure of solvency

False positive rate (1-Specificity)0 0.2 0.4 0.6 0.8 1

Tru

e p

ositiv

e r

ate

(S

ensitiv

ity)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1ROC curve

Figure: ROC of reconstructed scoreLiberati November 10th 2017 23 / 36

Page 26: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Examples

Example 2: SMEs Data

A real dataset provided by an Italian bank

y→ default probability over the next 12 months

Our database is composed by 10 qualitative variables:1 4 variables collected by a questionnaire administered to corporate

customers. DOM1-DOM4 investigate some aspects of the corporateclients: the seniority of the company, the skills present in the company, thepast experience in the market, the personal assets of the owners.

2 4 default indicators provided by the Central Bureau of riskCB1az , CB2az measure the risk related to the companiesCB1coll , CB2coll measure the risk related to the natural persons(collaborators) involved with the enterprises ownership.

3 2 variables: CERI is a proxy of the non-standard behavior of firmsestimated by the central risk of the bank; ANAG indicates the term ofrelationship between the business and the bank

Liberati November 10th 2017 24 / 36

Page 27: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Examples

Example 2: preprocessing

We randomly selected a large sample composed by 8700 instances.

The groups distribution was: y=1 “bad” (29% of the total sample) and y=2“good”(71% of the total sample)

We split the sample into training (4703) and test sample (3997).

Data were synthesized via Multiple Correspondence Analysis in 37 factor axis.

The allocation rule of the units to one of the two groups is based on k-NearestNeighbor with a window width δ = 10.

Liberati November 10th 2017 25 / 36

Page 28: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Examples

Example 2: classification results

Figure: Confusion matrices of different classifiers on training set

Liberati November 10th 2017 26 / 36

Page 29: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Examples

Example 2: classification results

Discriminant AUCCAUCHY 0.956LAPLACE 0.915RBF 0.890LOGISTIC 0.842LINEAR 0.713

Table: Area Under Curve(AUC) on training sets

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1−Specificity

Se

nsib

ility

Cauchy

Laplace

RBF

FLDA

Logistic

Liberati November 10th 2017 27 / 36

Page 30: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Examples

Example 2: Assessing the reconstruction process

Figure below shows the relationship between the Cauchy kernel discriminant functionand the Linear reconstruction Discriminant.

Liberati November 10th 2017 28 / 36

Page 31: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Examples

Example 2: classification accuracy

Table: Correct Classification Rates for different methods on the test data.

Correct Classification RatesDiscriminant Rules class 1 class 2 overallCAUCHY 71.48 74.78 73.82LOGISTIC REGRESSION 54.91 89.88 79.90FLDA 61.88 56.57 58.08RECONSTRUCTED 73.80 74.51 74.30

Results highlight very good performance of Cauchy kernel discriminant respect to theother classifier. Logistic regression is the best in terms of overall accuracy but if wecompare the two rules in terms of good predictions in both classes the reconstructedCauchy kernel discriminant is more balanced and effective.

Liberati November 10th 2017 29 / 36

Page 32: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Examples

Example 2: Characterization of the test partition

Characterization of the test partition has been carried out by finding a ranking amongall the characterizing variables of a group by means of probabilistic criterion:value-test V ∼ Hyp(n, nν , nq)n=sample sizenq= number of instances sampled without replacement belonging to q-th groupnν= number of instances with ν-th category.

Table: Categories characterizing the group of the bad instances classified as bad.

Characteristic % of category % of category % of group V-Test Pvaluecategories in group in sample in category

nνq/nq nν/n nνq/nνCB2_az=1 76.08 28.52 59.37 38.42 0.000CB1_az=1 55.31 23.83 51.64 26.39 0.000CERI=1 40.02 17.31 51.45 21.09 0.000CB2_coll=1 18.62 8.89 46.62 11.92 0.000DOM4=1 46.04 31.60 32.43 11.47 0.000DOM3=1 13.22 5.70 51.58 11.13 0.000DOM2=2 23.56 13.91 37.70 9.98 0.000DOM1=1 30.67 20.23 33.73 9.44 0.000CB1_coll=1 14.30 7.96 39.95 8.26 0.000ANAG=1 35.34 28.08 28.01 5.98 0.000DOM2=1 4.41 2.22 44.14 5.09 0.000ANAG=2 32.01 26.24 27.15 4.86 0.000DOM1=2 29.59 24.96 26.38 3.96 0.000DOM2=3 26.71 22.99 25.85 3.26 0.001

nνq=instances with ν-th category in the group q

Liberati November 10th 2017 30 / 36

Page 33: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Examples

Table: Categories characterizing the group of the good instances classified as good.

Characteristic % of category % of category % of group V-Test Pvaluecategories in group in set in categoryCB2_az=4 26.09 16.49 89.08 22.19 0.000CB2_az=3 27.91 18.41 85.33 20.66 0.000ANAG=4 28.87 21.15 76.82 15.53 0.000DOM2=4 70.21 60.88 64.92 15.33 0.000CERI=5 56.45 47.13 67.43 15.03 0.000CB1_az=5 28.47 21.59 74.24 13.67 0.000CERI=4 14.50 9.97 81.93 12.67 0.000CB2_coll=4 8.53 5.48 87.59 11.45 0.000CB1_az=4 22.11 16.91 73.61 11.34 0.000DOM3=5 21.54 16.77 72.32 10.41 0.000DOM4=5 21.01 16.35 72.34 10.27 0.000CB2_az=5 22.64 17.85 71.41 10.18 0.000DOM1=4 28.94 23.81 68.40 9.72 0.000CB2_coll=3 8.67 6.14 79.48 8.73 0.000CB1_coll=4 7.68 5.48 78.83 7.97 0.000DOM4=4 17.60 14.23 69.62 7.81 0.000DOM1=3 35.30 31.00 64.11 7.47 0.000CB1_az=3 18.81 15.59 67.91 7.16 0.000CB2_coll=2 7.93 6.54 68.20 4.49 0.000DOM3=4 4.98 3.96 70.71 4.17 0.000CB1_coll=2 7.15 5.96 67.45 3.99 0.000DOM4=3 26.20 24.13 61.11 3.85 0.000CB1_coll=3 6.40 5.40 66.67 3.51 0.000CB2_az=2 20.33 18.73 61.11 3.27 0.001ANAG=3 25.84 24.53 59.30 2.41 0.008

Liberati November 10th 2017 31 / 36

Page 34: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Examples

References

Akaike H (1973) Information theory and an extension of the maxi- mum likelihood principle in Information Theory:Proceedings of the 2nd International Symposium, B. N. Petrov and F. Csaki (Eds.), pp. 267-281, Academiai Kiado,Budapest, Hungary.

Baesens B, Van Gestel T, Viaene S, Stepanova M, Suykens J, Vanthienen J (2003) Benchmarking state- of-the-artclassification algorithms for credit scoring. Journal of the Operational Research Society 54:627-635

Barakat N, Bradley AP (2010) Evaluating consumer loans using neural networks. Neurocomputing 74:178-190

Baudat G, Anouar F (2000) Generalized discriminant analysis using a kernel approach. Neural Computation12:2385-2404Bozdogan H, Sclove LS (1984) Multi-sample cluster analysis using Akaike’s Information Criterion Annals of the Instituteof Statistical Mathematics36(1): 163-180

Haff L. R. (1980) Empirical Bayes estimation of the multivariate normal covariance matrix The Annals of Statistics 8(3):586-597Huang YM, Hung C, Jiau HC (2006) Evaluation of neural networks and data mining methods on a credit assessment taskfor class imbalance problem. Nonlinear Analysis: Real World Applications 7:720-747

James W, Stein C (1961) Estimation with quadratic loss in Proceedings of the 4th Berkeley Symposium on MathematicalStatistics and Probability, vol. 1, 361?379, Berkeley, Calif, USA.

Johnson R M (1966). The minimal transformation to orthonormality. Psychometrika, 31, 61-66.

Johnson J (2000) A Heuristic Method for Estimating the Relative Weight of Predictor Variables in Multiple RegressionMultivariate Behavioral Research 35(1):1-19

Lebart, L, Piron, M, and Steiner, J. F. (2003) La Sémiométrie. Dunod

Ledoit O, Wolf M (2004) A well-conditioned estimator for large-dimensional covariance matrices Journal of MultivariateAnalysis 88(2) 365-411

Liberati C, Camillo F, Saporta G (2017) Advances in credit scoring: combining performance and interpretation in kerneldiscriminant analysis. Advances in Data Analysis and Classification 11(1):121-138

Liberati November 10th 2017 32 / 36

Page 35: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Examples

Malhotra R, Malhotra DK (2003) Evaluating consumer loans using neural networks. Omega 31:83-96

Mercer J (1909) Functions of positive and negative type and their connection with the theory of integral equations. PhilosTrans R Soc Lond 209:415-446.

Mika S, Rätsch J G Weston, Schölkopf B, Müller KR (2003) Constructing descriptive and discrimina- tive nonlinearfeatures: Rayleigh coefficients in kernel feature spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence5:623-628

Schölkopf B, Smola AJ (2002) Learning with Kernels. MIT Press, Cambridge, MA.

Suykens J, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Processing Letters9(3):293-300

Shurygin A (1983) The linear combination of the simplest discriminator and Fisher’s one in Applied Statistics, Nauka(Ed.), Moscow 144-158 (in Russian)

Stein C (1975) Estimation of a covariance matrix, Rietz Lecture in Proceedings of the 39th Annual Meeting IMS, Atlanta,Ga, USA.

Vapnik V (1995) The Nature of Statistical Learning Theory. Springer, New York

Liberati November 10th 2017 33 / 36

Page 36: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Examples

Choice of ridge parameter

Naïve Ridge Estimators of the Covariance Matrix:

ΣR = Σ̂MLE + γ · I

Smoothed Covariance Estimators:

ΣS = Σ̂(1− ρ) + ρD (16)

with 0 < ρ < 1 and D =tr(Σ̂)

pIp

The structure minimizes the mean squared error (MSE):||Σ̂− Σ||2FMaximum Likelihood/Empirical Bayes (MLE/EB) (Haff, 1980):Σ̂MLE/EB = Σ̂MLE + p−1

n·tr(Σ̂MLE )Ip

Stipulated Ridge (Shurygin, 1983): Σ̂SRE = Σ̂MLE + p(p − 1)[2n · tr(Σ̂MLE )]−1Ip

Convex Sum (Ledoit and Wolf, 2004): Σ̂CSE = nn+m Σ̂ + (1− n

n+m )D̂

Return

Liberati November 10th 2017 34 / 36

Page 37: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Examples

Model Selection Criterion

The best kernel function selected among all the competitive models is the one that

minimize the total error rate

maximizing the Area Under the ROC curve

minimize the Akaike criterion (in case of using a probabilistic discriminant):The computation of Information criteria (Akaike 1973, Bozdogan and Sclove,1984) is done under the normality assumption of each group: where n sampleinstances, p= variables and Xg ∼ Np(µg ,Σ) for g = 1, 2

AIC = np log(2π) + n log |n−1ΣW |+ np + 2(2p +2p(p + 1)

2) (17)

Return

Liberati November 10th 2017 35 / 36

Page 38: Advances in Credit Scoring: combining performance and ...crc.business-school.ed.ac.uk/wp-content/uploads/sites/55/2017/11/CRC-Seminar...The usage of kernel mappings makes it possible

Examples

Relative Weight of Predictor Variables

Assume X an n × p matrix of variables and y is a n × 1 vector of scores. It is possibleto find the singular value decomposition of X:

X = P 4Q′

Johnson (1966) showed that the best-fitting orthogonal approximation of X is obtainedby:

Λ = Q 4Q′ and that β = PQ′y

The ε measure (Johnson, 2000) the relative importance of the variables.

ε = Λ2β2

rescaled Relative Weights (rRW) represent the proportion of predictable variance in yexplained by the variables:

rWR = ε/R2

R2 is the R squared of the estimated regression.r

Liberati November 10th 2017 36 / 36