statistics in retail finance chapter 9: fraud detectionbm508/teaching/retailfinance/lecture9.pdf ·...

Statistics in Retail Finance Chapter 9: Fraud Detection

1

Statistics in Retail Finance

Chapter 7: Fraud Detection in Retail Credit


2

Overview >

Detection of fraud remains an important issue in retail credit.

Methods similar to scorecard development may be employed, but there are

some problems specific to this application area.

In this chapter we discuss:-

Types of fraud and size of the problem.

Automated fraud detection.

Two-class and one-class classifiers for fraud detection.

Parzen density estimation.

Evaluation issues for fraud detection.


3

References >

There is not too much material on fraud detection in retail finance.

The following sources should be useful.

Fraud The Facts (2012) Financial Fraud Action UK report

(http://www.financialfraudaction.org.uk/download.asp?file=2699)

Anderson R (2007) The Credit Scoring Toolkit: theory and practice for

retail credit risk management and decision automation. NY: OUP.

Hit ‘em where it hurts: Using analytics to lock up fraudsters. SAS white

paper 2012

Dorronsoro JR, Ginel F, Sanchez C and Santa Cruz C, Neural fraud

detection in credit card operations, IEEE transactions on Neural

Networks, Vol.8, no.4, July 1997.

Juszczak P, Adams NM, Hand DJ, Whitrow C, Weston DJ, Off-the-peg

and bespoke classifiers for fraud detection, Computational statistics and

data analysis 52 (2008) 4521-4532.

http://www.financialfraudaction.org.uk/download.asp?file=2699


4

Types of fraud >

Theft fraud. A credit card is physically stolen or lost and used by someone

other than the card holder.

Card mail non-receipt fraud. A type of theft, but before the genuine

card holder gets the card.

Counterfeit fraud. A credit card is physically faked and used.

Application fraud. An individual applies for credit deliberately using false

information.

Bankruptcy fraud. A person receives and uses credit knowing that

they will be personally bankrupt in future.


5

Behavioural fraud / Card-not-present (CNP) fraud. Credit card details

are taken and used remotely by someone other than card holder. Common

in telephone sales, internet commerce and mail order.

Example of real fraud

http://www.bbc.co.uk/news/uk-england-somerset-20505489


6

Cost and detection of fraud >

The loss due to credit card fraud is strongly related increasingly with the

length of time from the time the fraud starts to the time the fraud is

detected and the credit is stopped.

When is fraud detected?

For stolen or lost cards, a card can be stopped as soon as it is reported

missing.

For application and bankruptcy fraud, a problem may only become

apparent when payments become due and are not met. For a personal

loan, the whole amount could be lost.

Counterfeit and behavioural fraud may only be detected when a

customer spots an anomalous transaction on his/her account statement

and reports this to the bank.

Analytic methods in banks can be used to detect fraudulent behaviour.


7

Size of the fraud problem >

Cost of retail credit fraud in UK (2001 to 2011).

Source: FFA UK (2012)

Note: In 2004, chip-and-pin was introduced and this has been quoted as part of the

reason for reduction in fraud losses from 2008.

0

100

200

300

400

500

600

700

20

01

20

02

20

03

20

04

20

05

20

06

20

07

20

08

20

09

20

10

20

11

£ m

illio

n Mail non-receipt

Card ID theft

Lost/stolen

Counterfeit

Card-not-present


8

Automated fraud detection >

Automated methods are applied to detect behavioural fraud.

The main issue here is the timeliness of the detection, to shorten the

amount of time the fraud is operating.

Usually automated methods generate fraud alerts that are followed up

manually.

Note, not all fraud alerts will turn out to be genuine fraud; many will be

false alarms.

This is a type of classification problem, to distinguish between legitimate

transactions ( ) and fraudulent transactions ( ).


9

Special considerations for fraud detection >

There are some special problems for fraud detection:

1.Need to process millions of transactions in real time.

2.Highly imbalanced classification problem.

Ratio of fraudulent to legitimate transactions is typically less than

1:1000.

3.Nature of fraud is reflexive. That is, fraudsters adapt to the detection

methods applied by banks to stop them.

However, unlike application model development, there is less need to build

an explanatory model, therefore complex structured non-linear models can

be considered.


10

Automated fraud detection methods >

There are four categories of methods:-

1.Business rules

2.Predictive models

3.Anomaly detection

4.Social network analysis


11

Method 1: Business rules >

The simplest approach is to use expert knowledge to implement business

knowledge of fraudulent behaviour as part of a computer-based expert

system.

A typical rule is:-

Generate a fraud alert if

a credit card is used abroad

and it has not been used in that country in the past year

and the credit card holder has not told the bank they will be visiting

that country.


12

Method 2: Predictive models >

We treat fraud detection as a classification problem and use a two-class

classifier. The result is a fraud scorecard.

Usually the fraud score is used with low scores indicating higher level of

fraud risk and higher scores indicating lower level of fraud risk.

Choose a classifier based on a model with functional form , such that

( ) for a transaction and some model parameters .

Estimate based on a training data of past transactions that included

fraud.


13

To deal with the high imbalance between classes, a simple filter can be

applied first to detect and remove obviously legitimate transactions and

so increase the ratio of fraudulent to legitimate transactions in the

training data.

o For example, inactive accounts and low value or repeated

transactions could be removed.

Research results and past experience show that models based on linear

combinations of predictor variables such as OLS and logistic regression

are not sufficient.

Non-linear classifiers such as artificial neural networks (ANN) are

effective and used in practice (eg SAS fraud tools).

We do not have the scope to present ANNs in this course.


14

We can expect to have good results for types of fraud that are the same

as the ones in the training data. This is because the two-class classifier

is a model of the fraudulent behaviour observed.

However, it is not expected to perform well if new types of fraud emerge

over time. They will not have been modelled.


15

Method 3: Anomaly detection >

An alternative to predictive modelling is to model only the legitimate

transactions then report anomalies in new cases as potential fraudulent

transactions.

This method has the advantage that fraud is not explicitly modelled, so

in principle it should be adaptable to new types of fraud that emerge.

Additionally, the highly unbalanced nature of the data is not a problem

since model is only based on the legitimate transactions.

The one major problem is that it will not be sensitive to frauds which

appear very similar to legitimate ones.

One-class classifiers are used to build a model of legitimate transactions.

Typically these work by modelling the probability density function (PDF)

over the predictor variables for legitimate transactions.

In this chapter we will use the common Parzen density estimator.


16

Anomaly detection process >

A typical anomaly detection process is given as follows:-

1.Use an outlier detector to remove extreme cases from the training

data (these may be errors, genuine outliers or fraudulent

transactions).

2.Let ( ) be a training sequence of legitimate transactions

(with outliers removed)

3.Denote outcome by { } where 1 denotes a legitimate transaction

and 0 a fraudulent one.

4.Estimate PDF ( ) where is an estimation parameter.

5.A classification decision on a new observation is made as

( ( ) )

for some threshold on the density, .


17

The threshold can be set based on the (sensible) strategy of controlling

the fraction of legitimate cases to be classified as anomalous, based on

training data.

This controls the false alert rate and also can be informed by how many

alerts can be followed-up manually, which is constrained by business

resources (eg how many staff are employed to do follow-up).

We write this as the optimization task

∑ ( ( ) )

( )

Note: The inequality “ ” is used here only for cases where the sum does

not give an exact value of ( ). Because is minimized, the sum

always gives a value as close to ( ) as possible.


18

Parzen density estimator >

We could base the estimate on just the empirical frequency, but

1.This only works for univariate data and

2.It is a somewhat crude estimator of the underlying PDF:

( )

∑ ( )

Instead we use a Parzen estimator that smooths over a multivariate sample

to generate a distribution.

( )

∑ (

)

where is some kernel which is symmetric, ( ) ( ), and integrates to

1, ∫ ( )

,

is a bandwidth parameter and

is the dimensionality of (ie the number of predictor variables).


19

For any point in the variable space, , each value in the training

sequence contributes to the estimate, but its contribution is weighted

by its distance from , given by .

The bandwidth controls the scaling of that distance within the kernel

function.

A typical kernel function is the multivariate normal distribution:

( ) ( ) ( )

In the R statistical language, the function density implements Parzen

density estimation.


20

Exercise 9.1

Prove that

∫ ( )


21

Example 9.1.

This R code demonstrates Parzen density estimation and the use of

bandwidth.

The example simulates 200 observations from a mixture of two normal

distributions.

x <- c(rnorm(100,-2,1), rnorm(100,2,1))

par(mfrow=c(2,2))

hist(x)

plot(density(x,bw=0.1), main="Density estimate")




22

The following output is produced:


23

Method 4: Social network analysis >

Very recently banks have been accessing publicly available social

network data.

This allows them to determine transactions that have some association

with other individuals or accounts that are known to be fraudulent or

suspect.

This would reduce the fraud score of such transactions.

Statistical methods that are evolving to deal with this data:-

o Social network analysis,

o Dynamic network analysis.

This is a very new area and we will not investigate these topics further

in this course.


24

Available data for fraud detection >

Accounts data

Including type of account, application details and aggregate behavioural

characteristics.

Transaction data

Including spending and repayment patterns.

Personal data

Data the bank has about person holding the account, some of which may

have been provided by a credit bureau.

Location data

Information about where the transaction was performed and the borrower

lives.


25

Evaluation >

Although, essentially a classification problem, the fraud problem has some

characteristics that make evaluation of performance slightly different:

1.The timeliness of detection has an effect on the cost of the fraud.

2.The cost of monitoring automated fraud alerts is important.

3.It is necessary to ensure false alerts are kept to a minimum in order to

not upset/alienate legitimate customers.

At the moment there is no clear agreement about the best performance

measure.

As with scorecard development, typically base measures on the two CDFs:

( ) ( )

for some fraud score (remember lower value means more risk of fraud),

and for each outcome { } (remember means legitimate).


26

Thus, plotting ( ) against ( ) gives the receiver-operating characteristics

(ROC) curve and the area under the ROC curve (AUC) as classification

performance measure:

∫ ( ) ( )

However, the ROC curve and AUC does not take into account the special

points (1) to (3) given above.

We consider a measure based on these terms:

The false alarm rate is given by ( ).

The undetected fraud rate is given by ( ).

The alert rate, which is linked to the monitoring cost, is ( ) ( ).

Notice that ( ) ( ) ( ) ( ) ( ).


27

Performance curve >

The performance curve is an alternative to the ROC curve.

Plot ( ) against ( ).

o This plots monitoring cost (point 2) against proportion of frauds not

detected.

o Also, since ( ) ( ) ( ) and ( ) this also shows some

control on false alarms (point 3).

The point ( ( )) is the perfect performance: all detected at

minimal possible cost.

The line must pass through ( ) when no frauds are detected since no

detection is performed.

The performance given by a random classifier is where ( ) ( ).

Hence this is the diagonal from (0,1) to (1,0).


28

Best performance is given by curves below this line, but area under the

performance curve is a penalty measure:

∫ ( ) ( )

The x-axis is called a timeline since it captures an aspect of detection

over time (point 1).

o Basically as frauds are detected this increases the proportion of

undetected frauds left in the data, so over time we expect to move

along the x-axis.

o This is similar to performance curves in engineering (eg stress

versus performance curves).


29

Cost-based evaluation >

The financial cost of fraud can be estimated directly.

Based on history of past fraud or total exposure of account at time of

fraud.

This is based on past accounting data for those cases that have been

correctly detected in the past.


30

Example 9.2

This is an example of a comparison between a one-class classifier, using

Parzen density estimator a with two-class classifier.

Uses the performance curve as an evaluation method.

Based on Juszczak et al (2008).

Data set:

11,383 accounts with 646,729 transactions with

3,217 (28.3%) fraudulent accounts and 18,501 (2.9%) fraudulent

transacations.

Transaction records over a 6 month period.

Use Parzen density estimator as one-class classifier.


31

Outcome of model build and test on hold-out sample:-

Now consider forecasts over time and in comparison with comparable two-

class classifier (in this case a density-based Parzen classifier).

0

0.1

0.2

0.3

0.4

0.5

-0.1 0.1 0.3 0.5 0.7

F( c

)

1-F0( c)

Performance curve

F(c)


32

Fixing ( )=0.2 and plotting cost against forecast ahead months.

This shows that initially the two-class classifier gives slightly better

performance.

However, its performance deterioriates over time in comparison to the

one-class classifier which is more robust.

Our hypothesis is that the two-class classifier is not sensitive to new

types of fraud.

00.020.040.060.08

0.10.120.140.160.18

0.2

2 3 4 5 6

Co

st F

(c )

Months

One-class Two-class


33

Exercise 9.2

Suppose and ( ) {( )

for {

} .

Let ( ) be a sequence of instances of , which

correspond to legitimate transactions.

1.Show that is a kernel function for Parzen density estimation for random

variable with bandwidth .

2.Using , compute the threshold that gives a false positive rate up to

.


34

Review of Chapter 9 >

In this chapter we have investigated:-

Types of fraud and size of the problem.

Automated fraud detection.

Two-class and one-class classifiers for fraud detection.

Parzen density estimation.

Evaluation issues for fraud detection.

statistics in retail finance chapter 9: fraud detectionbm508/teaching/retailfinance/lecture9.pdf ·...

Documents