03 niall adams

8/6/2019 03 Niall Adams

1/49

Credit Card Transaction Fraud Detection

Niall Adams

Department of MathematicsImperial College London

January 2009
http://goforward/http://find/http://goback/


2/49

Obligatory contents slide

My objective is to report some of the our recent work on fraud

detection, mostly sponsored under the EPSRC ThinkCrimeinitiative.

1. Transaction fraud problem, process, challenge

2. Supervised and unsupervised approaches (and data

manipulation)3. Combining methods

4. Streaming approach to handle change

Collaborators: David Hand, Dave Weston, Chris Whitrow, PiotrJuszczak, Dimitris Tasoulis, Christoforos Anagnostopoulos.

Collaborating banks: Abbey National, Alliance and Leicester,Capital One, Lloyds TSB
http://find/http://goback/


3/49

Another plastic card fraud anecdote...

Fraud is a piece of cake?

Two very hungry German couriers ate a fruit cake destined for aGerman newspaper and in its place mailed a box of credit carddata. The data including names, addresses and card transactionsended up at the Frankfurter Rundschau daily. The mix-uptriggered an alarm, and police advised credit card customers withLandesbank Berlin to check their accounts for inconsistencies.Fruitcake must be different in Germany for people to want to use it

as something other than a paperweight. (from slashdot)


4/49

Transaction fraud

Plastic-card fraud is a serious problem: losses to fraud in the UK inFH 2008 amounted to 307 million pounds. ...and once upon atime, this sounded like a lot of money...The problem is getting worse: (Source: www.cardwatch.org.uk)

There seems to be a perception that there is a fundamentalbackground level of fraud. Losses to fraud are absorbed bycustomers, merchants, lenders. Thus the industry is veryconcerned with determining chargeback responsibility.


5/49

Types of fraud

Transaction fraud is often grouped into a number of categories,including:

Counterfeit. Creating duplicate cards. Skimming refers toreading a cards magnetic strip to create a duplicate card.

Mail non-receipt fraud. Cards are intercepted in the post andused without permission. Extra effort may be required by thefraudster to get other information, perhaps by phishing.

Card not present fraud. Includes phone, internet and mail

order fraud.Chip&PIN simply shifted the problem.


6/49

The nature of fraud attacks is changing

Online banking fraud dramatically increasing: FH 08 21.4m - up185%! (source: apacs). Mostly phishing incidents, but increasing

money mule adverts.

(Source: www.cardwatch.org.uk)Fraudsters change tactics to adapt to bank security measure (eg.HSBC checking all transactions now) arms race. Fraud

population behaviour changing, but not legitimate population?


7/49

Transaction stream

Plastic card transaction processing uses a very complicated ISinfrastructure (eg. Visa Europe processes 6000 transactions asecond) to connect banks and merchants.Processing requirements include

Speed

fraud filtering while minimizing false positives


8/49

Schematic processing path


9/49

Challenges I

In this talk we will explore methods that could stand in for thefraud filter, or operate immediately after. We will have to use amodified performance metric, and note that our data is subject toselection bias due to the fraud filter.

Temporal aspects each account consists of an irregularly spaced sequence of

(complicated) transactions records

need for rapid processing

shifting fraud tactics fraud identification delayed


10/49

Challenges II

Population and system factors

Imbalanced data sets (P(fraud)


11/49

Approaches

Most existing fraud filters are relatively simple supervisedpredictive models, based on very carefully selected variables. Forexample, FALCON is (essentially) a logistic regression on a largeset of variables.Fundamentally, can consider two approaches:

supervised learning - using the transaction fraud labels. A

population approach unsupervised learning - has a customer departed from normal

behavior? An account level approach.

Will consider these approaches, and hybrids.Taking a tool based approach means that different approachesneed different features


12/49

Superficially:

Supervised use known fraud label, so possibly resistant to unusual

non-fraud transactions implemented on a window, so using older fraud transactions,

and not immediately responsive to each account decision threshold specification straightforward (in principle)

Unupervised respond to every transaction on an account capacity to respond to new types of fraud, since not modeling

known frauds risk of higher false positive rate setting some parameters less straightforward (in principle)


13/49

Data

Typical transaction record has more than 70 fields, including

transaction value

transaction time and date

transaction category (payment, refund, ATM mobile top-upetc)

ATM/POS indicator

Merchant category code - large set, ranging from specificairlines to massage parlours

card reader response codes

Fundamental problem is to select which data to extract. Moreover,different (supervised/unsupervised) tools will handle transactionsdifferently.


14/49

Performance assessment

Superficially, fraud detection looks like a two class classification

problem fraud versus non-fraud for which a suitable measure isAUC (area under ROC curve).

However, AUC integrates over allocation costs. Moreover, there isa temporal aspect related to timeliness of detection.

Suppose that the cost of investigating a case is 1 unit. Both TPand FP incur this. Estimates from a collaborating bank suggest amissed (FN) fraud costs 100 such units. We construct a measure,TC, that accounts for the number of fraud and non fraudtransactions on an account, which deploying this cost information.

Subtle arguments in Hand et al. (2008) show that this exactsummary can be derived from a operating characteristic curvemodified to account for temporal ordering of transactions.


15/49

Supervised methods

Perhaps most natural approach - transactions ultimately labeled as

fraud or non-fraud - is two class classification.Many possible methods for this, ranging from logistic regression tosupport vector machines. The question is how to pre-process (eg.alignment issues) the transaction database for presentation to thesupervised learner.

We explored the approach of transaction aggregation transforming transaction level data to account level data.xi - fixed length vector extracted from account i transaction.

yi = (

x{1}

i . . . ,x{n}

i )This is the activity record for account i, based on n sequentialtransactions. is the transformation - which we restrict to beinsensitive to the order of the arguments


16/49

Selected variables for x using expert advice and extensive

exploratory data analysis, to explore relationship between variablesand fraud label.Variables included:

number of POS transactions

value of POS transactions transactions identified by magnetic strip

simplified merchant category codes

The function was tailored to compute various counts and

averages (again, using extensive exploratory analysis - which seemshard to escape)


17/49

To illustrate, using 5 transactions

Various tricks to handle time-of-day data.Note these type of approaches are reasonably standard in theindustry. We end up with a maximum of 67 variables.


18/49

If any transaction in an activity record is labeled as fraud, then wedeem all transactions in the record as fraud.We fix the number of days in the activity record across the

population - thereby inducing variable numbers of transaction peraccount.We experimented with the following classifiers to explore theimpact of this length, considering activity records of 7 days, 3 days,1 day, and 1 transaction:

Logistic regression Naive Bayes (all variables binned)

QDA (with some covariance regularization)

SVM withe Gaussian RBF kernels, kernel width and

regularization parameter set by experimentation Random forests, using 200 bootstrap samples, and 10

variables set at each split

CART, K-NN (both with some further tinkering)


19/49

To recap, we occupy a feature space with activity records, oflength 1,3,7, built using consecutive windows. each object in thisspace has a fraud label, and we use a variety of classifiers, ofvarious expressive power, to made predictions.

These methods are deployed on real data samples from commercialcollaborators, consisting of tens or hundreds of millions oftransactions.

We try to use the data fairly, so quote out of sample predictionsrepresenting the temporal ordering of the data.

B k A


20/49

Bank A

TC=0.09 corresponds to guessing. Standard error approx. 0.001(bootstrap). In general, longer records better. Best performancefrom random forest.

B k B


21/49

Bank B

Mostly, longer activity records better. Random forests again best

method.Note different performance on different banks - different customerbases.Mixing different length records would be of interest, and remainsfor future work.

U i d M th d A t l l l d t ti


22/49

Unsupervised Methods: Account level anomaly detection

Attempt to detect departure from normal behavior. Construct adensity estimate over suitable defined transaction features, thenflag new transactions with low estimated density.

Three accounts, difference in time of consecutive transactions:

0 2 4 6 8 10 12

x 105

0

0.5

1

1.5

2

2.5

3

3.5

4x 10

3

time [s]

pdf

30 trans.

0 2 4 6 8 10 12

x 105

0

1

2

3

4

5

6x 10

3

time [s]

pdf

63 trans.

0 2 4 6 8 10 12

x 105

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

time [s]

53 trans.

All these accounts are (?) legitimate. Account level approachavoids need to handle this source of heterogeneity.


23/49

We consider a two-stage approach.1. Estimation stage - accumulate enough transactions to

construct a model of normal behaviour. We use a fixednumber, but this is a free parameter.

2. Operational stage - use the model of behaviour to flagtransactions as normal or abnormal. Treat abnormal as fraud

Generic issues to handle: choice of model, choice of threshold,method of handling temporal nature of data.


24/49

For account i, we have the transaction sequence

Xi =xt|xt R

N, t = 1, 2, . . .

Here, we have chosen to represent the transaction record as a

collection of continuous variables. Of course, other options arepossible.Some trickery required to handle categorical variables likemerchant category codes.


25/49

For a specific account, suppose we have legitimate transaction

data, X, then our detector for new transaction x

h(x|X, ) = I(p(x|X, ) > ) =

1 x is classified as a legitimate,

0 x is classified as a fraud

Here p() is a density estimate (the model), and refers to controlparameters for the model. is the alert threshold. Difficult to set without context, but onepossibility relate to the maximum proportion of flagged cases

that we can afford to investigate.


26/49

time of day [s]

m

oney[p]

0 2 4 6 8

x 104

0

5000

10000

15000

20000

time of day [s]

m

oney[p]

0 2 4 6 8

x 104

0

5000

10000

15000

20000

Models


27/49

Models

We explored many possibilities, including

Kernel density estimate (Parzen)

Naive Parzen (NParzen)

Mixture of Gaussians (MoG)

Gaussian (Gauss)

nearest neighbour (1-NN)

etc, etc...

Control parameters difficult; various procedures, or arbitrarily fixed.ATM and merchant type represented as distances and modelled

with linear programming data description. Essentially finddistance of each point from representative plane, and transform tohave character of probability


28/49

Features


29/49

Features

We represent the jth transaction as

amount

amount difference

time

time difference (crude method of incorporating some temporalstructure)

Merchant type *

ATM-id *

* - categorical variables.Of course, selection of variables could be optimised, but this mightbe impractical in the streaming context.

Some results


30/49

Some results

same data sets as before, different features, so avoid directcomparison with supervised.Performance order, two banks, two measures (TC and AUC)

D1

performance curve SVDD MST 1-NN Gauss NParzen SOM MoG MPM ParzenROC SVDD MST NParzen SOM 1-NN MPM Gauss MoG Parzen

D2performance curve SVDD MST 1-NN NParzen SOM Gauss MoG Parzen MPM

ROC SVDD MST NParzen SOM 1-NN Gauss MoG Parzen MPM

Supervised-classifiers built on this data exhibit similar performance.


31/49

Of interest to examine performance into the future. Buildsupervised classifier on same data as unsupervised, then examineperformance into the future (fixed costs, false positive rates)

2 3 4 5 6

0.11

0.13

0.15

0.17

month

FP

Parzen

oneclass classifier

twoclass classifier

2 3 4 5 6

0.11

0.13

0.15

month

FP

1NN

oneclass classifier

twoclass classifier

2 3 4 5 6

0.09

0.13

0.17

0.21

month

FP

SVM

oneclass classifier

twoclass classifier

Evidence that the account-level approach degrades more gracefully

over time.

Unsupervised Methods: Peer group analysis


32/49

Unsupervised Methods: Peer group analysis

Peer group analysis (PGA) is a new method, attempting to usemore than just a single accounts data for anomaly detection.

Premise: some accounts exhibit similar behaviour (ie follow similartrajectories through some feature space). Use anomaly concept,

but incorporating behaviour of similar accounts.Two stage process: (1). learn identity of similar accounts(temporal clustering) (2). anomaly detection over similar accounts.

Lots of implementation issues! One instantiation not competitive

with previous approaches, but does identify objects that are notsimply population outliers.

Combination


33/49

C b

Cannot practically run different detectors in parallel. Combinationessential, and perhaps yield improved performance?

Again, different approaches to combination possible. Perhaps mostelegant is to incorporate unsupervised scores into supervised

method. But technically and practically difficult.Instead, we consider the output of each detector, and consider howto combine them. For each transaction we have a score from eachof Random forest, an SVM-based anomaly detector, and an

instantiation of PGA.Normalize all scores to have character of P(fraud).


34/49

with each transaction represented by three scores (three variables),one from each detection sub-system we can consider different sortsof combiner

ad-hoc

max Supervised

logistic regression, naive Bayes K-NN


35/49

Build all sub-systems on first part of data, and predict on second.Note, no model updating.

Method Loss % 0.1 AUC % 0.1Random forest 8.63 68.5SVM anomaly 9.08 54.8PGA 9.08 32.8logistic 8.14 67.9NB 7.23 87.9133-NN (2 var ) 7.22 88.2123-NN (3 var) 7.04 88.5

- not using PGA, K selected by CV study. Strikingly, PGA,

which has no standalone merit, may add a little to the performanceof a suitably constructed combiner.


36/49

Combination strategies can certainly provide improvedperformance. Still working out why: one point is that PGA workson histories with frequent transactions, account level detectionbetter for infrequent transactions.

So, tools can be put together, but we have ignored the issue ofchange over time. All these methods have been built on static

windows of data. This is consistent with the industry norm - buildthe detector - monitor performance - rebuild when performance isdeemed to have degraded too far.

Clearly, a static window gives some capacity to handle changing

populations (old data not relevant). But there may be a betterway to do it...

Temporal adaption - current work


37/49

Consider the problem of computing the mean vector and covariance

matrix of a sequence of n multivariate vectors. Standard resultssay this computation can be implemented as a recursion

mt = mt1 + xt, t = mt/n, m0 = 0 (1)

St = St1 + (xt t)(xt t)T, t = St/n, S0 = 0 (2)

After n steps, this would give the equivalent offline result. If we aremonitoring vectors coming from a non-stationary system, the

simple averaging of this type is biased.If we knew the precise dynamics of the system, we have a chanceto construct an optimal filter. However, we do not.


38/49

One approach to tracking the mean value would be to run with awindow. Alternatively, we can use ideas from adaptive filter theory,

and incorporate a forgetting factor, (0, 1], in the previousrecursion

nt = nt1 + 1, n0 = 0 (3)

mt = mt1 + xt, t = mt/nt (4)

St = St1 + (xt t)(xt t)T, t = St/nt (5)

down weights old information more smoothly than a window.

nt is the effective sample size or memory. = 1 gives offline

solutions, and nt = n. For fixed < 1 memory size tends to1/1(1 ) from below.

Setting


39/49

Two choices for , fixed value, or variable forgetting, t. Fixedforgetting: set by trial and error.Variable forgetting: result from Haykin (1997) (from adaptive filtertheory) say tune t according to a local gradient descent rule

t = t1 2t , t: residual error at time t, small (6)

Amazingly, using results from numerical linear algebra, thisframework can still yield efficient updating rules. Performance verysensitive to . Very careful implementation required.

We are exploring extending this idea, to construct a framework forsequential likelihood estimation with forgetting.

Illustration


40/49

Tracking mean and covariance in 2d


41/49

change detection properties, two fixed values of , 5D, abrupt

change

-400

-200

0

200

400

600

800

1000

1200

1400

1600

0 200 400 600 800 1000 1200 1400 1600 1800 2000

ValueofGra

dient

Time

Reaction of Gradient at Abrupt Change at t = 1000

lambda fixed at 0.99lambda fixed at 0.95

Streaming classifier


42/49

Since we have a method for adaptively and incrementally

estimating mean vectors and covariances matrices, we can nowconsider an adaptive version of Gaussian based classification (sincethese methods only require means and covariances).

Recall

P(c|x) =f(x|c)P(c)

f(x)

Change can happen in various ways, but population drift usuallyrefers to the class prior P(C) and/or the class conditional densitiesP(x|c).

Linear/quadratic discriminant analysis (LDA/QDA) motivated byreasoning that f(x|c) N(, ). LDA: assume covariance matrixcommon across classes. QDA: different covariances.


43/49

Now, instead of using static estimates of and , use the

adaptive estimates. Can use the same adaptive forgetting factorto handle changing prior probabilities also. This leads to a numberof ways of constructing a stream classifier.

This requires some amount of hack, because the theory requires

regularly-spaced data in time. To handle this, we simply updateevery time an observation arrives.

Also, to test the idea, provide fraud flag immediately afterclassification (unrealistic, but we have some ideas for the realproblem).


44/49

Using 18 variables, based on numerical elements of the transaction

record, and a means of coding merchant category codes, we havethe following performance, over 5 years (AUC measured once amonth). (Note, a little regularisation required).

Performance of static versions (adaptive windows) very poor. This

model can be implemented very efficiently.


45/49

What does the adaptive forgetting factor show?

While rebuilding the detector will always be needed, this approachmight help mitigate some losses due to population drift.


46/49

In the context of the whole problem, we have the opportunity toplace the time-adaption in different places. This suggests the

following intriguing idea for fraud detection between systemrebuilds:

Conclusions


47/49

Transaction fraud detection is an important, but hard, real worldproblem. A significant amount of engineering is required toproduce effective solutions.

Different modeling approaches and tools can have merit - and it

appears they can be effectively combined. This suggests that thedifferent tools are capturing the fraud signal in non-overlappingways.

We have the problem of handling changing populations (arms race,economic drift etc). Preliminary results suggest that temporallyadaptive methods may have some utility in this context.

Future Work


48/49

Explore continuous updating and adaption of subsystems andcombiner.

Extend adaptive classifier to finite mixture model (moreflexible), approximate logistic regression and RBF networks.

More realistically handle the delayed fraud label.


49/49

03 niall adams

Documents