data mining in support of fraud management

The techniques of data mining in support of Fraud Management

How to design a Predictive Model, by Marco Scattareggia – HP EMEA Fraud Center of Excellence Manager

Information Security magazine published on May/June 2011

>> Data mining is a superposition of rapidly evolving disciplines, including statistics and

artificial intelligence which are the two most emerging among many others. This article clarifies the meaning of the main technical terms which can make it more difficult to

understand the methods of analysis, and in particular, those used for the prediction of the phenomena of interest and the construction of appropriate predictive models. Fraud

management, as well as other industrial applications, relies on data mining techniques to perform fast decision making according to the scoring of the fraud risks. The concepts

contained in this article come from the work done by the author, during the preparation of the workshop on data mining and fraud management, held in Rome in the auditorium of Telecom Italia on September 13, 2011, thanks to a worthy initiative of Stefano Maria de

'Rossi, to whom the author gives his thanks.

About the author Marco Scattareggia

Marco Scattareggia, graduated in Electronic

Engineering and Computer Science, works in Rome at Hewlett-Packard Italia where he directs the Center of Excellence for HP EMEA dedicated to the design

and implementation of fraud management solutions for telecom operators.

Inductive reasoning, data mining and fraud management

Data mining, or digging for gold in the large amounts of data, is the combination of several disciplines including statistical inference, the management of computer databases and machine learning, which is the study of self-learning in artificial intelligence research. Literally data mining refers to extracting knowledge from a mass of data in order to acquire the rules that provide decision support and determine what action should be taken. Such a concept can effectively be expressed with the term actionable insight and the benefits to a business process, like fraud management, can be drawn from forecasting techniques. In data mining, this predictive analytics activity is based on three elements:

1. Large amounts of available data to be analyzed and to provide representative samples for training, verification and validation of predictive models.

2. Analytical techniques for understanding the data, their structures and their significance.

3. Forecasting models to be articulated, as in every computer process, in terms of input, process, and output. In different words, by predictors (the input), algorithms (the process) and target of the forecast (the output).

In addition to the techniques of analysis, adequate tools and methods for data collection, normalization, and loading are

also needed. These preliminary activities are highlighted in the early stages of the KDD (Knowledge Discovery in Databases) paradigm, and are generally found in products known as ETL (Extract, Transform, Load). By visiting the site www.kdd.org, you can understand how data mining can actually consist in the analysis phase of the interactive process for the extraction of knowledge from the data shown in Figure 1.

Figure 1 Besides being interested in the practical applications of data mining in an industrial context, it is also useful to examine Figure 2, which sets forth the evolution of the techniques of business analytics. It starts with the simple act of reporting, which provides a graphical summary of data grouped according to their different dimensions, and highlights the main differences and the elements of interest. The second phase corresponds to the activity of analysis to understand why there was a specific phenomenon. Subsequently, the monitoring is the use of tools that let you control what is happening and finally, predictive analytics allows you to determine what could or

should happen in the future. Obviously, it should be pointed out that the future may be predicted only in probabilistic terms, and nobody can be one hundred percent sure about what really will happen. The result of this process is an ordering, a probabilistic ranking of the possible events based on previously accumulated experience. This activity, known as scoring, assigns a value in percentage terms, the score, which expresses the confidence we may have in the forecast itself. It allows us to perform our actions in a consistent way according to the score values. For example, in fraud management, a high score corresponds to a big risk of fraud and the consequential action could be to stop the service (e.g., the loan from a financial bank, the telephone line, the insurance protection, etc.), while a more moderate score may only require an additional investigation by the analyst. This article will show how the fraud management application, designed as a business process, can benefit from data mining techniques and the practical use of predictive models.

Figure 2

It is interesting to note that the techniques of business analytics are derived from inferential statistics and more specifically from Bayesian probabilistic reasoning. Thomas Bayes' theorem on conditional probability answers the question "Knowing that there was the effect B, what is the probability that A is the cause?" In a nutshell, it gives the probability of a cause when knowing its effect. The article “How to build a predictive model” published on May/June 2011 by the Italian Information Security magazine, explained how to calculate the buying probability given the gender (man or woman), and by observing the customers’ dressing style:

- During the construction of the model, the outcome or effect, which in the example is the positive or negative result of a purchase, is known while the cause requires a probabilistic assessment and is the object of analysis. The roles are reversed: knowing the effect we look for the cause.

- When forecasting, the roles of cause and effect return to their natural sequence: given the causes, the model predicts the resulting effect. The gender of a person and his or her dressing are the predictors, while the purchase decision, whether positive or negative, becomes the target to predict.

The analysis phase, during which the roles of cause and effect (i.e., the predictors and the target) are reversed, is indicated in the techniques of predictive analytics as

supervised training of the model. Figure 3 below shows the contingency table with the exemplary values of the probabilities to be used in Bayes' theorem to calculate the probability of purchase for a man or a woman. It's like saying that having analyzed the history of purchases and having been able to calculate or estimate the probability of the causes (predictors) conditioned by a specific effect (target), we can use a forecasting model based on Bayes’ theorem to predict the likelihood of a future purchase once having the person's gender and his or her dressing style.

Figure 3 Bayes’ theorem of probability of causes is widely used to predict which causes are more likely to have produced the observed event. However, it was Pierre-Simon Laplace to consolidate, in his Essai sur les probabilités philosophique (1814), the logical system that is the foundation of inductive reasoning, and now referred to as Bayesian reasoning. The formula that follows is Laplace’s rule of succession. Assuming that the results of a phenomenon have only two options, "success" and "failure", and alleged that a

priori we know little or nothing of how the outcome of the results is determined, Laplace derived the way to calculate the probability that the next result is a success:

P = (s +1) / (n +2)

where "s" is the number of previously observed successes and "n" the total number of known instances. Laplace went on to use his rule of succession to calculate the probability of the rising sun each new day, based on the fact that, to date, this event has never failed and, obviously, he was strongly criticized by his contemporaries for his irreverent extrapolation. The goal of inferential statistics is to provide methods that are used to learn from experience, that is to build models to move from a set of particular cases to the general case. However, Laplace’s rule of succession, as well as the whole system of Bayesian inductive reasoning, can lead to blatant errors. The pitfalls inherent in the reasoning about the probabilities are highlighted by the so-called paradoxes that pose questions whose correct answers are highly illogical. The philosopher Bertrand Russell, for example, pointed out that falling from the roof of a twenty floor building, when arriving at the first floor you may incorrectly infer from the Laplace’s rule of succession that, because nothing bad happened during the fall for 19 of 20 floors, there is no danger in the last twentieth part of the fall too. Russell concluded pragmatically that an inductive reasoning can be accepted if it not only

leads to a high probability prediction, but is also reasonably credible.

Figure 4 Another example often used to demonstrate the limits of inductive logic procedure is the paradox of the black ravens developed by Carl Gustav Hempel. By examining a million crows, one by one, we note that they are all black. After each observation, therefore, the theory that all ravens are black became increasingly likely to be true, and consistent with the inductive principle. But the assumption "the crows are all black", if isolated, is logically equivalent to the assumption "all things that are not black are not crows." This second point would be more likely true even after the observation of a "red apple" – it would be observed, in fact, something "not black" that "is not a crow." Obviously, the observation of a red apple, if taken to make true the proposition that all crows are black, it is not consistent and not reasonably credible. Bertrand Russell would argue that if the population of crows in the world totals a million plus one exemplars, then the inference "the crows are all black", after examining a million of black crow, could be considered reasonably correct. But if you were to estimate the existence of a hundred million crows, then the sample of only one million black crows would no longer be sufficient.

The forecasts provided by the inductive models and their practical use in business decisions are based upon this response of Russell. When selecting data samples for the training, testing and validation of a predictive model, you need to raise two fundamental questions:

a) Are the rules that constitute the algorithm of the model consistent with the characteristics of the individual entities that make up the sample? b) Are the sample data really representative of the whole population of entities to be inferred?

The answers to these questions are derived respectively from the concepts of internal validity and external validity of an inferential statistical analysis as shown in Figure 5. The internal validity measures how much the results of the analysis are corrected for the sample of entities that have been studied, and it may be affected by a not-perfectly random sampling procedure which becomes an element of noise and disturbance (bias). A good internal validity is necessary but not sufficient and we should also check the external validity and the degree of generalization that is acquired by the predictive model. When the model did not have enough general rules, we may likely have just recorded most of the data present in the sample used for training (we have overfitted the model), but not effectively learned from the data (we

didn’t extract the knowledge hidden behind the data). In this situation, the model will not be able to successfully process new cases from other samples.

Figure 5 The techniques of predictive analytics help you make decisions once the data have been classified and characterized as to a certain phenomenon. Other techniques, such as OLAP (On-Line Analytical Processing), help to make decisions too because they allow you to see what happened. However, a predictive model directly provides the prediction of a phenomenon, estimates its size, and allows you to perform the right actions. A further possibility made available by using the techniques of predictive analytics is the separation and classification of the elements belonging to a non-homogeneous set. The most common example for this type of application is selecting which customers to address in a marketing campaign, who to send a business proposal having a reasonable chance of getting a positive response, and rightly so, in these cases one can speak of business intelligence. This technique, known as clustering, is also useful in the fraud management area because it allows you to better target the action of a predictive

model. It improves the internal validity of the training sample by dividing the mass of available data into homogeneous subsets. It may also discover new patterns of fraud and help you to generate new detection rules. Moreover, the identification of values very distant from the average, called outliers, leads directly to the identification of cases that have a high probability of fraud and therefore require more thorough investigation. The Dilemma of the fraud manager The desire of every organization that is aware of the loss of revenues due to fraud, is obviously to achieve zero losses. Unfortunately this is not possible either due to the rapid reaction of the criminal organizations that profit from fraud and quickly find new attack patterns and discover more weaknesses in the defense systems, and because fighting fraud has a cost that grows in proportion to the level of defense put in place. Figure 6 shows graphically that, without enforcement systems, the losses to fraud can reach very high levels, to over 30% of total revenues, and may even threaten the very survival of the company. By putting in place an appropriate organization to manage fraud, and provide an appropriate technology infrastructure, losses can be brought down to acceptable levels very quickly to the order of a few percentage points. The competence of the fraud manager is important to identify the optimum compromise between the costs of managing fraud and the residual losses due to fraud. This tradeoff is indicated by the red point in Figure 6. Going further could significantly increase the cost of personnel

and instruments to achieve only tiny incremental loss reductions.

Figure 6

The main difficulty, however, lies not on demonstrating the value of residual fraud but estimating the losses actually prevented by the regular activities performed by the fraud management team. In other words it is not easy to estimate the loss size and the consequences theoretically due to frauds, which have not been perpetrated thanks to daily prevention works. For more details and to understand how to calculate the ROI of an FMS, you can refer to the article Return on Investment of an FMS published on March/April 2011 by the Italian Information Security magazine. Technically you must choose the appropriate KPIs (Key Performance Indicators) and measure both the value of fraud detected in a given period and of that remaining in the same period. For example, the trends of two popular KPIs, known as precision (percentage of fraud detected in the analyzed total fraud) and recall (percentage of fraud detected in the total of existing fraud) are shown in Figure 7.

Figure 7 Wishing to reach the ideal point for which you would have, at the same time, a precision and a recall of 100%, one can make several attempts to improve one or the other KPI. For example, you could increase the number of cases of suspected fraud to be tested daily (increasing recall), and, of course, increase the number of working hours too. Conversely, one may attempt to better configure the FMS, and to reduce the number of cases to be analyzed in a day by eliminating the false alarms that needlessly consume the time of the analysts (increasing precision). However, if you do not really increase the information given to the system by adding new rules or some better searching keywords, when improving the precision you will get a worse percentage of recall, and vice versa. The problem exposed leads to the dilemma that afflicts every fraud manager. In fact you cannot improve the results of fighting against fraud without simultaneously increasing the costs of structure (i.e., its power), or without increasing the information to be provided to the FMS. It is therefore necessary meeting at least one of the two requirements, costs, or

information, and possibly improve them both.

Figure 8 The predictive models lend themselves to improve the effectiveness and efficiency of a fraud management department. For example, the inductive techniques of decision trees can be used to extract new rules from the data for better identifying the cases of fraud, the scoring technique makes it easier to organize the human resources on a risk priority basis and, eventually, enabling automatic mechanisms to be used at night or in the absence of personnel. Figure 8 represents the gain chart for three different scoring models. This productivity gain consists of the analyst’s time saving and it is in contrast to a non-guided processing of the cases that follows a random sequence indicated by the red diagonal. The blue solid line indicates the ideal path, which is practically unattainable, but is just the aim. According to that, all cases of outright fraud, the true positives, are immediately discovered without losing time due to false alarms. It is interesting to note that this ideal situation occurs

when both precision and recall KPIs are equal to 100% and therefore to a model that has matched the ideal point shown in Figure 7. For a comprehensive evaluation of a predictive model, the reader may refer to the article Evaluation of the predictive capabilities of an FMS published on February/March by the Italian Information Security magazine. Construction of a model to score cases of fraud in telecommunications Figure 9 shows the conceptual scheme of a predictive model to score cases of fraud in a telecommunications company. In this representation the algorithm which forms the core of the model is represented by a neural network. However the whole model would not change if you chose a different algorithm such as, for example, a decision tree, a Bayesian network, etc.

Figure 9 The alarms and cases generated by the FMS are derived from aggregations, or other information processing, of the elementary data coming from the telecommunication traffic. In fact, all input data to a

predictive model can be elaborated and replaced with other derived parameters. All input data and derived parameters compete in a sort of analytic game to be elected as predictors, that are the right input to the core forecasting algorithm which is highlighted in the blue box of Figure 9. The output of the predictive model is simply the score value associated with the case. This value is a percentage and varies between zero and one hundred, or between zero and one, and expresses the probability that the case represents an outright fraud (when the score is closer to 100), or a false alarm (when the score is close to 0). The inclusion of a predictive model in the operational context of the company has a significant impact on its existing structure of information technology (IT) and it can take many months to develop dedicated custom software and the associated operating procedures. However, recently the development of wide data transfer capacity through the Internet and the web service technology, the emerging paradigms of cloud computing solutions, and SaaS - Software-as-a-Service, have paved the way for an easier transition into production of the predictive models. The data mining community, represented by the Data Mining Group (DMG), has recently developed a new language, PMML (Predictive Model Markup Language) that is

destined to become the lingua franca, i.e. spoken by many vendors and systems, for the standard definition and the immediate use of predictive models. The PMML, which is based on XML, provides all the methods and tools to define, verify and then put into practice the predictive models. By adopting PMML, it is no more necessarily the case that the model is developed and run by the same company as the vendor of the software products. All definitions and descriptions necessary for understanding the PMML can be found on the DMG website, http://www.dmg.org/. In conclusion, the PMML, being an open standard, when combined with an offer of cloud computing, can dramatically lower the TCO (Total Cost of Ownership) by breaking down the barriers of incompatibility between different systems of the IT infrastructure already in place in the company. Furthermore, the inclusion of the operational model in the context of applications can be run directly by the same people who developed it, i.e., without involving the heavily technical IT department. For more on the creation of predictive models, see the article How to design a Predictive Model, published on May/June 2011 by the Italian Information Security magazine.

data mining in support of fraud management

Business