arxiv:1104.3904v1 [cs.ai] 19 apr 2011

arX

iv:1

104.

3904

v1 [

cs.A

I] 1

9 A

pr 2

011

An expert system for detecting automobile insurance fraud using social network analysis

Lovro Subelj∗, Stefan Furlan, Marko Bajec

Faculty of Computer and Information Science, University ofLjubljana, Trzaska 25, SI-1001 Ljubljana, Slovenia

Abstract

The article proposes an expert system for detection, and subsequent investigation, of groups of collaborating automobile insurancefraudsters. The system is described and examined in great detail, several technical difficulties in detecting fraud are also considered,for it to be applicable in practice. Opposed to many other approaches, the system uses networks for representation of data. Networksare the most natural representation of such a relational domain, allowing formulation and analysis of complex relations betweenentities. Fraudulent entities are found by employing a novel assessment algorithm,Iterative Assessment Algorithm(IAA), alsopresented in the article. Besides intrinsic attributes of entities, the algorithm explores also the relations betweenentities. Theprototype was evaluated and rigorously analyzed on real world data. Results show that automobile insurance fraud can beefficientlydetected with the proposed system and that appropriate datarepresentation is vital.

Key words: Fraud detection, Automobile insurance, Social network analysis, Link analysis, Assessment propagation

1. Introduction

Fraud is encountered in a variety of domains. It comes in alldifferent shapes and sizes, from traditional fraud, e.g. (simple)tax cheating, to more sophisticated, where entiregroupsof indi-viduals are collaborating in order to commit fraud. Such groupscan be found in the automobile insurance domain.

Here fraudsters stage traffic accidents and issue fake insur-ance claims to gain (unjustified) funds from their general orvehicle insurance. There are also cases where an accident hasnever occurred, and the vehicles have only been placed onto theroad. Still, the majority of such fraud is not planned (oppor-tunistic fraud) – an individual only seizes the opportunity aris-ing from the accident and issues exaggerated insurance claimsor claims for past damages.

Staged accidents have several common characteristics. Theyoccur in late hours and non-urban areas in order to reduce theprobability of witnesses. Drivers are usually younger males,there are many passengers in the vehicles, but never childrenor elders. The police is always called to the scene to make thesubsequent acquisition of means easier. It is also not uncom-mon that all of the participants have multiple (serious) injuries,when there is almost no damage on the vehicles. Many othersuspicious characteristics exist, not mentioned here.

The insurance companies place the most interest in organizedgroups of fraudsters consisting of drivers, chiropractors, garagemechanics, lawyers, police officers, insurance workers and oth-ers. Such groups represent the majority of revenue leakage.

∗Corresponding author. Tel.:+386 1 4768 186.Email addresses:[email protected] (Lovro Subelj),

[email protected] (Stefan Furlan),[email protected] (Marko Bajec)

Most of the analyses agree that approximately 20% of all in-surance claims are in some way fraudulent (various resources).But most of these claims go unnoticed, as fraud investigationis usually done by hand by the domain expert or investigatorand is only rarely computer supported. Inappropriate represen-tation of data is also common, making the detection of groupsof fraudsters extremely difficult. An expert system approach isthus needed.

Jensen (1997) has observed several technical difficulties indetecting fraud (various domains). Most hold for (automobile)insurance fraud as well. Firstly, only a small portion of acci-dents or participants is fraudulent (skewed class distribution)making them extremely difficult to detect. Next, there is a se-vere lack oflabeleddata sets as labeling is expensive and timeconsuming. Besides, due to sensitivity of the domain, thereiseven a lack of unlabeled data sets. Any approach for detectingsuch fraud should thus be founded on moderate resources (datasets) in order to be applicable in practice. Fraudsters are very in-novative and new types of fraud emerge constantly. Hence, theapproach must also be highly adaptable, detecting new typesof fraud as soon as they are noticed. Lastly, it holds that fullyautonomous detection of automobile insurance fraud is not pos-sible in practice. Final assessment of potential fraud can onlybe made by the domain expert or investigator, who also deter-mines further actions in resolving it. The approach should alsosupport this investigation process.

Due to everything mentioned above, the set of approachesfor detecting such fraud is extremely limited. We propose anovel expert system approach for detection and subsequent in-vestigation of automobile insurance fraud. The system is fo-cused on detection of groups of collaborating fraudsters, andtheir connecting accidents (non-opportunistic fraud), and notsome isolated fraudulent entities. The latter should be done in-

Preprint submitted to Expert Systems with Applications April 21, 2011

http://arxiv.org/abs/1104.3904v1

dependently for each particular entity, while in our system, theentities are assessed in a way that considers also the relationsbetween them. This is done with appropriate representationofthe domain – networks.

Networks are the most natural representation of any rela-tional domain, allowing formulation of complex relations be-tween entities. They also present the main advantage of our sys-tem against other approaches that use a standardflat dataform.As collaborating fraudsters are usually related to each other invarious ways, detection of groups of fraudsters is only possiblewith appropriate representation of data. Networks also provideclear visualization of the assessment, crucial for the subsequentinvestigation process.

The system assesses the entities using a novelIterative As-sessment Algorithm(IAA algorithm), presented in this article.No learning from initial labeled data set is done, the systemrather allows simple incorporation of the domain knowledge.This makes it applicable in practice and allows detection ofnewtypes of fraud as soon as they are encountered. The system canbe used with poor data sets, which is often the case in practice.To simulate realistic conditions, the discussion in the article andevaluation with the prototype system relies only on the dataandentities found in the police record of the accident (main entitiesare participant, vehicle, collision1, police officer).

The article makes an in depth description, evaluation andanalysis of the proposed system. We pursue the hypothesis thatautomobile insurance fraud can be detected with such a systemand that proper data representation is vital. Main contributionsof our work are: (1) a novel expert system approach for thedetection of automobile insurance fraud with networks; (2)abenchmarking study, as no expert system approach for detec-tion of groups of automobile insurance fraudsters has yet beenreported (to our knowledge); (3) an algorithm for assessment ofentities in a relational domain, demanding no labeled data set(IAA algorithm); and (4) a framework for detection of groupsof fraudsters with networks (applicable in other relational do-mains).

The rest of the article is organized as follows. In section 2we discuss related work and emphasize weaknesses of otherproposed approaches. Section 3 presents formal grounds of (so-cial) networks. Next, in section 4, we introduce the proposedexpert system for detecting automobile insurance fraud. Theprototype system was evaluated and rigorously analyzed on realworld data, description of the data set and obtained resultsaregiven in section 5. Discussion of the results is conducted insection 6, followed by the conclusion in section 7.

2. Related work

Our work places in the wide field of fraud detection. Fraudappears in many domains including telecommunications,banking, medicine, e-commerce, general and automobile

1Throughout the article the term collision is used instead of(traffic) acci-dent. The word accident implies there is no one to blame, which contradictswith the article.

insurance. Thus a number of expert system approachesfor preventing, detecting and investigating fraud have beendeveloped in the past. Researches have proposed using somestandard methods of data mining and machine learning,neuralnetworks, fuzzy logic, genetic algorithms, support vectormachines, (logistic) regression, consolidated (classification)trees, approaches overred-flagsor profiles, various statisticalmethods and other methods and approaches (Artis et al., 2002;Brockett et al., 2002; Bolton & Hand, 2002; Estevez et al.,2006; Furlan & Bajec, 2008; Ghosh & Schwartzbard, 1999;Hu et al., 2007; Kirkos et al., 2007; Perez et al., 2005;Rupnik et al., 2007; Quah & Sriganesh, 2008; Sanchez et al.,2009; Viaene et al., 2002, 2005; Weisberg & Derrig, 1998;Yang & Hwang, 2006). Analyses show that in practice noneis significantly better than others (Bolton & Hand, 2002;Viaene et al., 2005). Furthermore, they mainly have threeweaknesses. They (1) use inappropriate or inexpressiverepresentation of data; (2) demand a labeled (initial) dataset;and (3) are only suitable for larger, richer data sets. It turnsout that these are generally a problem when dealing with frauddetection (Jensen, 1997; Phua et al., 2005).

In the narrower sense, our work comes near the ap-proaches from the field of network analysis, that combineintrinsic attributes of entities with their relational attributes.Noble & Cook (2003) proposed detecting anomalies in net-works with various types of vertices, but they focus on de-tecting suspicious structures in the network, not vertices(i.e.entities). Besides that, the approach is more appropriate forlarger networks. Researchers also proposed detecting anoma-lies using measures of centrality (Freeman, 1977, 1979), ran-dom walks (Sun et al., 2005) and other (Holder & Cook, 2003;Maxion & Tan, 2000), but these approaches mainly rely onlyon the relational attributes of entities.

Many researchers have investigated the problem of clas-sification in the relational context, following the hypoth-esis that classification of an entity can be improved byalso considering its related entities (inference). Thus manyapproaches formulatinginference, spread or propagationon networks have been developed in various fields of re-search (Brin & Page, 1998; Domingos & Richardson, 2001;Kleinberg, 1999; Kschischang & Frey, 1998; Lu & Getoor,2003b; Minka, 2001; Neville & Jensen, 2000). Mostof them are based on one of the three most popular(approximate) inference algorithms:Relaxation Labeling(RL)(Hummel & Zucker, 1983) from the computer vision com-munity, Loopy Belief Propagation (LBP)on loopy (Bayesian)graphical models(Kschischang & Frey, 1998) andIterativeClassification Algorithm (ICA)from the data mining commu-nity (Neville & Jensen, 2000). For the analyses and comparisonsee (Kempe et al., 2003; Sen & Getoor, 2007).

Researchers have reported good results with these algorithms(Brin & Page, 1998; Kschischang & Frey, 1998; Lu & Getoor,2003b; Neville & Jensen, 2000), however they mainly addressthe problem of learning from an (initial) labeled data set (super-vised learning), or a partially labeled (semi-supervised learn-ing) (Lu & Getoor, 2003a), therefore the approaches are gen-erally inappropriate for fraud detection. The algorithm wein-

2

troduce here,IAA algorithm, is almost identical to theICA al-gorithm, however it was developed with different intentions inmind – to assess the entities when no labeled data set is at hand(and not for improving classification with inference). Further-more, IAA does not address the problem ofclassification, butranking. Thus, in this way, it is actually a simplification ofRLalgorithm, or even Google’sPageRank(Brin & Page, 1998),still it is not founded on the probability theory like the latter.

We conclude that due to the weaknesses mentioned, most ofthe proposed approaches are inappropriate for detection of(au-tomobile) insurance fraud. Our approach differs, as it does notdemand a labeled data set and is also appropriate for smallerdata sets. It represents data with networks, which are one ofthemost natural representation and allow complex analysis withoutsimplification of data. It should be pointed out that networks,despite their strong foundations and expressive power, have notyet been used for detecting (automobile) insurance fraud (atleast according to our knowledge).

3. (Social) networks

Networks are based upon mathematical objects calledgraphs. Informally speaking, graph consists of a collection ofpoints, calledvertices, and links between these points, callededges(Fig. 1). LetVG, EG be a set of vertices, edges for somegraphG respectively. We defineG asG = (VG,EG) where

VG = {v1, v2 . . . vn}, (1)

EG ⊆ {{vi , v j}| vi , v j ∈ VG ∧ i , j}. (2)

Note that edges are sets of vertices, hence they are not directed(undirected graph). In the case ofdirected graphsequation (2)rewrites to

EG ⊆ {(vi , v j)| vi , v j ∈ VG ∧ i , j}, (3)

where edges are ordered pairs of vertices – (vi , v j) is an edgefrom vi to v j . The definition can be further generalized by al-lowing multiple edges between two vertices and loops (edgesthat connect vertices with themselves). Such graphs are calledmultigraphs. Examples of some simple (multi)graphs can beseen in Fig. 1.

In practical applications we usually strive to store some extrainformation along with the vertices and edges. Formally, wecan define two labeling functions

lVG : VG → ΣVG , (4)

lEG : EG → ΣEG , (5)

whereΣVG , ΣEG are (finite) alphabets of all possible vertex, edgelabels respectively.Labeled graphcan be seen in Fig. 1 (b).

We proceed by introducing some terms used later on. LetG be some undirected multigraph or anunderlying graphofsome directed multigraph – underlying graph consists of samevertices and edges as the original directed (multi)graph, onlythat all of its edges are set to be undirected.G naturally parti-tions into a set of(connected) componentsdenotedC(G). E.g.all three graphs in Fig. 1 have one connected component, when

Fig. 1: (a) simple graph with directed edges; (b) undirectedmultigraph with la-beled vertices and edges (labels are represented graphically); (c) network repre-senting collisions where round vertices correspond to participants and corneredvertices correspond to vehicles. Collisions are represented with directed edgesbetween vehicles.

graphs in Fig. 2 consist of several connected components. Fromhere on, we assume thatG consists of a single connected com-ponent.

Let vi be some vertex in graphG, vi ∈ VG. Degreeof thevertexvi , denotedd(vi), is the number of edges incident to it.Formally,

d(vi) = |{e| e ∈ EG ∧ vi ∈ e}|. (6)

Let v j be some other vertex in graphG, v j ∈ VG, and letp(vi, v j)be apathbetweenvi andv j. A path is a sequence of verticeson a way that leads from one vertex to another (includingvi

and v j). There can be many paths between two vertices. Ageodesic g(vi, v j) is a path that has the minimum size – consistsof the least number of vertices. Again, there can also be manygeodesics between two vertices.

We can now define thedistancebetween two vertices, i.e.vi

andv j , as

d(vi , v j) = |g(vi, v j)| − 1. (7)

Distance betweenvi andv j is the number of edges visited whengoing fromvi to v j (or vice versa). Thediameterof some graphG, denotedd(G), is a measure for the “width” of the graph.Formally, it is defined as the maximum distance between anytwo vertices in the graph,

d(G) = max{d(vi , v j)| vi , v j ∈ VG}. (8)

All graphs can be divided into two classes. First arecyclicgraphs, having a pathp(vi, vi) that contains at least two othervertices (besidesvi) and has no repeated vertices. Such pathis called acycle. Graphs in Fig. 1 (a) and (b) are both cyclic.Second class of graphs consists ofacyclic graphs, more com-monly known astrees. These are graphs that contain no cycle(see Fig. 1 (c)). Note that a simple undirected graph is a treeifand only if |EG| = |VG| − 1.

Finally, we introduce thevertex coverof a graphG. Let S bea subset of vertices,S ⊆ VG, with a property that each edge in

3

EG has at least one of its incident vertices inS (covered byS).SuchS is called a vertex cover. It can be shown, that finding aminimum vertex cover isNP-hardin general.

Graphs have been studied and investigated for almost 300years thus a strong theory has been developed until today. Thereare also numerous practical problems and applications wheregraphs have shown their usefulness (e.g. Brin & Page, 1998) –they are the most natural representation of many domains andare indispensable whenever we are interested in relations be-tween entities or in patterns in these relations. We emphasizethis only to show that networks have strong mathematical, andalso practical, foundation –networks2 are usually seen as la-beled, orweighted, multigraphs with both directed and undi-rected edges (see Fig. 1 (c)). Furthermore, vertices of a net-work usually represent some entities, and edges represent somerelations between them. When vertices correspond to people,or groups of people, such networks are calledsocial networks.

Networks often consist of densely connected subsets of ver-tices calledcommunities. Formally, communities are subsetsof vertices with many edges between the vertices within somecommunity and only a few edges between the vertices of differ-ent communities. Girvan & Newman (2002) suggested identi-fying communities by recursively removing the edges betweenthem –between edges. As it holds that many geodesics runalong such edges, where only few geodesics run along edgeswithin communities, between edges can be removed by usingedge betweenness(Girvan & Newman, 2002). It is defined as

Bet(ei) = |{g(vi, v j)| vi , v j ∈ VG ∧ (9)

∧ g(vi, v j) goes alongei}|,

whereei ∈ EG. The edge betweennessBet(ei) is thus the num-ber of all geodesics that run along edgeei .

For more details on (social) networks see e.g. (Newman,2003, 2008).

4. Expert system for detecting automobile insurance fraud

As mentioned above, the proposed expert system uses (pri-marily constructed) networks of collisions to assign suspicionscore to each entity. These scores are used for the detec-tion of groups of fraudsters and their corresponding collisions.The frameworkof the system is structured into fourmodules(Fig. 2).

In the first module, different types of networks are con-structed from the given data set. When necessary, the networksare also simplified – divided into natural communities that ap-pear inside them. The latter is done without any loss of gener-ality.

Networks from the first module naturally partition into sev-eral connected components. In the second module we inves-tigate these components and output the suspicious, focusingmainly on their structural properties such as diameter, cycles,etc. Other components are discarded at the end of this module.

2Throughout the article the terms graph and network are used as synonyms.

Fig. 2: Framework of the proposed expert system for detecting (automobileinsurance) fraud.

Not all entities in some suspicious component are necessar-ily suspicious. In the third module components are thus furtheranalyzed in order to detect key entities inside them. They arefound by employingIterative Assessment Algorithm (IAA), pre-sented in this article. The algorithm assigns a suspicion scoreto each entity, which can be used for subsequent assessmentand analysis – to identify suspicious groups of entities andtheirconnecting collisions. In general, suspicious groups are subsetsof suspicious components.

Note that detection of suspicious entities is done in twostages(second and third module). In the first stage, or the sec-ond module, we focus only on detecting suspicious componentsand in the second stage, third module, we also locate the sus-picious entities within them. Hence the detection in the first,second stage is done at the level of components, entities respec-tively. The reason for thishierarchical investigationis that earlystages simplify assessment in the later stages, possibly withoutany loss for detection (for further implications see section 6).

It holds that fully autonomous detection of automobile in-surance fraud is not possible in practice. The obtained resultsshould always be investigated by the domain expert or inves-tigator, who determines further actions for resolving potentialfraud. The purpose of the last, fourth, module of the system isthus to appropriately assess and visualize the obtained results,allowing the domain expert or investigator to conduct subse-quent analysis.

First three modules of the system are presented in sec-tions 4.1, 4.2, 4.3 respectively, when the last module is onlybriefly discussed in section 4.4.

4

4.1. Representation with networks

Every entity’s attribute is eitherintrinsic or relational. Intrin-sic attributes are those, that are independent of the entity’s sur-rounding (e.g. person’s age), while the relational attributes rep-resent, or are dependent on, relations between entities (e.g. re-lation between two colliding drivers). Relational attributes canbe naturally represented with the edges of a network. Thus weget networks, where vertices correspond to entities and edgescorrespond to relations between them. Numerous different net-works can be constructed, depending on which entities we useand how we connect them to each other.

The purpose of this first module of the system is to constructdifferent types of networks, used later on. It is not immediatelyclear how to construct networks, that describe the domain inthebest possible way and are most appropriate for our intentions.This problem arises as networks, despite their high expressivepower, are destined to represent relations between only twoen-tities (i.e.binary relations). As collisions are actually relationsbetween multiple entities, some sort of projection of the dataset must be made (for other suggestions see section 7).

Collisions can thus be represented with various types of net-works, not all equally suitable for fraud detection. In our opin-ion, there are some guidelines that should be considered whenconstructing networks from any relational domain data (guide-lines are given approximately in the order of their importance):

1. Intention: networks should be constructed so that they aremost appropriate for our intentions (e.g. fraud detection)

2. Domain: networks should be constructed in a way that de-scribes the domain as it is (e.g. connected vertices shouldrepresent some entities, also directly connected in the dataset)

3. Expressiveness:expressive power of the constructed net-works should be as high as possible

4. Structure:structure of the networks should not be used fordescribing some specific domain characteristics (e.g. thereshould be no cycles in the networks when there are no actualcycles in the data set). Structural properties of networks area strong tool that can be used in the subsequent (investiga-tion) process, but only when these properties were not artifi-cially incorporated into the network during the constructionprocess

5. Simplicity: networks should be kept as simple and sparse aspossible (e.g. not all entities need to be represented by itsown vertices). The hypothesis here is that simple networkswould also allow simpler subsequent analysis and clearer fi-nal visualization (principle ofOccam’s razor3)

6. Uniqueness:every network should uniquely describe thedata set being represented (i.e. there should be abijectionbetween different data sets and corresponding networks)

Frequently all guidelines can not be met and some trade-off

have to be made.

3The principle states that the explanation of any phenomenonshould makeas few assumptions as possible, eliminating those making nodifference in theassessment – entities should not be multiplied beyond necessity.

In general there are(

31

)

+

(

32

)

+ ((

32

)

+

(

33

)

) = 10 possible wayshow to connect three entities (i.e. collision, participantand ve-hicle), depending on which entities we represent with theirownvertices. 7 of these represent participants with vertices and in 4cases all entities are represented by their own vertices. For thereason of simplicity, we focus on the remaining 3 cases. In thefollowing we introduce four different types of such networks,as an example and for later use. All can be seen in Fig. 3.

Fig. 3: Four types of networks representing same two collisions – (a)driversnetwork, (b) participants network, (c) COPTA networkand (d)vehicles net-work. Rounded vertices correspond to participants, hexagons correspond tocollisions and irregular cornered vertices correspond to vehicles. Solid directededges represent involvement in some collision, solid undirected edges representdrivers (only for the vehicles network) and dashed edges represent passengers.Guilt in the collision is formulated with edge’s direction.

The simplest way is to only connect the drivers who wereinvolved in the same collision –drivers networks. Guilt in thecollision is formulated with edge’s direction. Note that driversnetworks severely lack expressive power (guideline 3). Wecan therefore add the passengers and getparticipants networks,where passengers are connected with the corresponding drivers.Such networks are already much richer, but they have one majorweakness – passengers “group” on the driver, i.e. it is generallynot clear which passengers were involved in the same collisionand not even how many passengers were involved in some par-ticular collision (guidelines 3, 6).

This weakness is partially eliminated byCOnnect PassengersThrough Accidents networks(COPTA networks). We add spe-cial vertices representing collisions and all participants in somecollision are now connected through these vertices. Passengersno longer group on the drivers but on the collisions, thus theproblem is partially eliminated. We also add special edges be-tween the drivers and the collisions, to indicate the numberofpassengers in the vehicle. This type of networks could be ad-equate for many practical applications, but it should be men-tioned that the distance between two colliding drivers is nowtwice as large as before – the drivers are those that were directlyrelated in the collision (guideline 2, 5).

Last type of networks arevehicles networkswhere specialvertices are added to represent vehicles. Collisions are now rep-resented by edges between vehicles, and driver and passengers

5

are connected through them. Such networks provide good vi-sualization of the collisions and also incorporate anotherentity,but they have many weaknesses as well. Two colliding driversare very far apart and (included) vehicles are not actually of ourinterest (guideline 5). Such networks also seem to suggest thatthe vehicles are the ones, responsible for the collision (guide-line 2). Vehicles networks are also much larger than the previ-ous.

A better way to incorporate vehicles into networks is sim-ply to connect collisions, in which the same vehicle was in-volved. Similar holds for other entities like police officers, chi-ropractors, lawyers, etc. Using special vertices for theseenti-ties would only unnecessarily enlarge the networks and conse-quently make subsequent detection harder (guidelines 1, 5). Itis also true, that these entities usually aren’t available in practice(sensitivity of the domain).

Summary of the analysis of different types of networks isgiven in table 1.

Guidelines and networks

drivers particip. COPTA vehicles

Intention+ ++

5+ ++

Domain − − 4Expressive. −− − + 4Structure 4Simplicity + − −− 3

Uniqueness − − − 2

Total0 4 −9 −8−5 −1 1 −8

Table 1: Comparison of different types of networks due to the proposed guide-lines. Scores assigned to the guidelines are a choice made bythe authors. Anal-ysis for Intention(guideline 1), and total score, is given separately for second,third module respectively.

There is of course no need to use the same type of networksin every stage of the detection process (guideline 1). In theprototype system we thus use participants networks in the sec-ond module (section 4.2), as they provide enough informationfor initial suspicious components detection, andCOPTAnet-works in the third module (section 4.3), whose adequacy willbe clearer later. Other types of networks are used only for vi-sualization purposes. Network scores, given in table 1, confirmthis choice.

After the construction of networks is done, the resulting con-nected components can be quite large (depending on the type ofnetworks used). As it is expected that groups of fraudsters arerelatively small, the components should in this case be simpli-fied. We suggest using edge betweenness (Girvan & Newman,2002) to detect communities in the network (i.e. supersets ofgroups of fraudsters) by recursively removing the edges untilthe resulting components are small enough. As using edge be-tweenness assures that we would be removing only the edgesbetween the communities, and not the edges within communi-ties, simplification is done without any loss for generality.

4.2. Suspicious components detectionThe networks from the first module consist of several con-

nected components. Each component describes a group of re-lated entities (i.e. participants, due to the type of networksused), where some of these groups contain fraudulent entities.Within this module of the system we want to detect such groups(i.e. fraudulent components) and discard all others, in order tosimplify the subsequent detection process in the third module.Not all entities in some fraudulent component are necessarilyfraudulent. The purpose of the third module is to identify onlythose that are.

Analyses, conducted with the help of a domain expert,showed that fraudulent components share severalstructuralcharacteristics. Such components are usually much larger thanother, non-fraudulent components, and are also denser. Theun-derlying collisions often happened in suspicious circumstances,and the ratio between the number of collisions and the numberof different drivers is usually close to 1 (for reference, the ra-tio for completely independent collisions is 2). There are ver-tices with extremely high degree andcentrality. Componentshave a small diameter, (short) cycles appear and the size of theminimum vertex cover is also very small (all due to the size ofthe component). There are also other characteristics, all imply-ing that entities, represented by such components, are unusuallyclosely related to each other. Example of a fraudulent compo-nent with many of the mentioned characteristics is shown inFig. 4.

Fig. 4: Example of a component of participants network with many of the sus-picious characteristics shared by fraudulent components.

We have thus identified severalindicatorsof likelihood thatsome component is fraudulent (i.e.suspicious component). Thedetection of suspicious components is done by assessing theseindicators. Only simple indicators are used (no combinationsof indicators).

Formally, we define an ensemble ofn indicators asI =[ I1, I1 . . . In]T . Let c be some connected component in networkG, c ∈ C(G), and letHi(c) be the value forc of the characteris-tic, measured by indicatorI i . Then

I i(c) =

{

1 c has suspicious value ofHi

0 otherwise. (10)

6

For the reason of simplicity, all indicators are defined asbinaryattributes. For the indicators that measure a characteristic that isindependent of the structure of the component (e.g. number ofvertices, collisions, etc.), simplethresholdsare defined in orderto distinguish suspicious components from others (due to thischaracteristic). These thresholds are set by the domain expert.

Other characteristics are usually greatly dependent on thenumber of the vertices and edges in the component. A sim-ple threshold strategythus does not work. Values of suchHi

could of course be “normalized” before the assessment (basedon the number of vertices and edges), but it is often not clearhow. Values could also be assessed using some (supervised)learning algorithm over a labeled data set, but a huge set wouldbe needed, as the assessment should be done for each numberof vertices and edges separately (owing to the dependence men-tioned). What remains is to construct random networks of (pre-sumably) honest behavior and assess the values of such charac-teristics using them.

No in-depth analysis of collisions networks has so far beenreported, and it is thus not clear how to construct such ran-dom networks. General random networkgeneratorsor mod-els, e.g. (Barabasi & Albert, 1999; Eppstein & Wang, 2002),mainly give results far away from the collisions networks (vi-sually and by assessing different characteristics). Therefore asort of rewiring algorithm is employed, initially proposed byBall et al. (1997) and Watts & Strogatz (1998).

The algorithm iteratively rewires edges in some componentc, meaning that we randomly choose two edges inEc, {vi , v j}

and{vk, vl}, and switch one of theirs incident vertices. The re-sulting edges are e.g.{vi , vl} and{vk, v j} (see Fig. 5). The num-ber of vertices and edges does not change during the rewiringprocess and the values for someHi can thus be assessed by gen-erating a sufficient number of such random networks (for eachcomponent).

Fig. 5: Example of a rewired network. Dashed edges are rewired, i.e. replacedby solid edges.

The details of the rewiring algorithm are omitted due to spacelimitations, we only discuss two aspects. First, the numberofrewirings should be kept relatively small (e.g.< |Ec|), otherwisethe constructed networks are completely random with no traceof the one we start with – (probably) not networks represent-ing a set of collisions. We also want to compare components toother random ones, which are similar to them, at least in the as-pect of this rewirings. If a component significantly differs evenfrom these similar ones, there is probably some severe anomalyin it.

Second, one can notice that the algorithm never changes thedegrees of the vertices. As we wish to assess the degrees aswell, the algorithm can be simply adopted to the task in anad hoc fashion. We add an extra vertexve and connect all

other vertices with it. As this vertex is removed at the end,rewiring one of the newly added edges with some other (old)edge changes the degrees of the vertices. Let{vi , ve}, {vk, vl} bethe edges being rewired and let{vi , vl}, {vk, ve} be the edges af-ter the rewiring. The (true) degree of vertexvi , vk was increased,decreased by one respectively.

To assess the values of indicators we separately constructrandom components for each componentc ∈ C(G) and indica-tor I i ∈ I , and approximate the distributions for characteristicsHi (Hi are seen as random variables). A statistical test is em-ployed to test thenull hypothesis, if the observed valueHi(c)comes from the distribution forHi . The test can beoneor two-tailed, based on the nature of characteristicHi . In the case ofone-tailed test, where large values ofHi are suspicious, we get

I i(c) =

{

1 Pc(Hi ≥ Hi(c)) < ti0 otherwise

, (11)

whereprobability density function P(Hi) is approximated withthe generated distributionPc(Hi) and ti is a critical thresholdor acceptableType I error (e.g. set to 0.05). In the case oftwo-tailed test the equation (11) rewrites to

I i(c) =

1Pc(Hi ≥ Hi(c)) < ti/2∨Pc(Hi ≤ Hi(c)) < ti/2

0 otherwise. (12)

Knowing the values for all indicatorsI i we can now indicatethe suspicious components inC(G). The simplest way to ac-complish this is to use amajority classifieror voter, indicatingall the components, for which at least half of the indicatorsis setto 1, as suspicious. LetS(G) be a set of suspicious componentsin a networkG, S(G) ⊆ C(G), then

S(G) = {c| c ∈ C(G) ∧n∑

i=1

I i(c) ≥ n/2}. (13)

When fraudulent components share most of the characteristics,measured by the indicators, we would clearly indicate them(they would have most, at least half, of the indicators set).Still,the approach is rather naive having three major weaknesses(among others). (1) there is no guarantee that the thresholdn/2is the best choice; (2) we do not consider how many compo-nents have some particular indicator set; and (3) all indicatorsare treated as equally important. Normally, we would use somesort of supervised learning technique that eliminates thisweak-nesses (e.g. regression, neural networks, classification trees,etc.), but again, due to the lack of labeled data and skewed classdistribution in the collisions domain, this would only rarely befeasible (the size ofC(G) is even much smaller then the size ofthe actual data set).

To cope with the last two weaknesses mentioned, we sug-gest usingprincipal component analysis of RIDITs(PRIDIT)proposed by Brockett & Levine (1977) (see (Brockett, 1981)),which has already been used for detecting fraudulent insuranceclaim files (Brockett et al., 2002), but not for detecting groupsof fraudsters (i.e. fraudulent components). TheRIDIT analysiswas first introduced by Bross (1958).

7

RIDIT is basically a scoring method that transforms a setof categoricalattribute values into a set of values from inter-val [−1, 1], thus they reflect the probability of an occurrenceof some particular categorical value. Hence, anordinal scaleattribute is transformed into aninterval scaleattribute. In ourcase, allI i are simple binary attributes, and theRIDIT scores,denotedRi , are then just

Ri(c) =

{

p0i I i(c) = 1−p1

i I i(c) = 0, (14)

wherec ∈ C(G), p1i is therelative frequencyof I i being equal to

1, computed from the entire data set, and ˆp0i = 1− p1

i .We demonstrate the technique with an example. Let ˆp1

i beequal to 0.95 – almost all of the components have the indicatorI i set. TheRIDIT score for some componentc, with I i(c) = 1,is then just 0.05, as the indicator clearly gives a poor indicationof fraudulent components. On the other hand, for some compo-nentc, with I i(c) = 0, theRIDIT score is−0.95, since the indi-cator very likely gives a good indication of the non-fraudulentcomponents. Similar intuitive explanation can be made by set-ting p1

i to 0.05. A full discussion ofRIDIT scoring is omitted,for more details see (Brockett, 1981; Brockett & Levine, 1977).

Introduction ofRIDIT scoring diminishes previously men-tioned second weakness. To also cope with the third, we makeuse of thePRIDIT technique. The intuition of this techniqueis that we can weight indicators in some ensemble by assess-ing the agreement of some particular indicator with the entireensemble. We make a (probably incorrect) assumption that in-dicators are independent.

Formally, let W be a vector ofweights for the ensem-ble of RIDIT scorers Ri for indicators I i , denotedW =

[w1,w2 . . .wn]T , and letR be a matrix withi, jth componentequal toRj(c), wherec is an ith component inC(G). MatrixproductRWgives the ensemble’s score for all the components,i.e. ith component in vectorRW is equal to the weighted linearcombination ofRIDIT scores forith component inC(G). De-noteS = RW, we can then assess indicators agreement withentire ensemble as (written in matrix form)

RTS/ ‖ RTS ‖ . (15)

Equation (15) computes normalized scalar products of columnsof R, which corresponds to the returned values ofRIDIT scor-ers, andS, which is the overall score of the entire ensemble (foreach component inC(G)). When the returned values of somescorer are completely orthogonal to the ensemble’s scores,theresulting normalized scalar product equals 0, and reaches itsmaximum, or minimum, when they are perfectly aligned.

Equation (15) thus gives scorers (indicators) agreement withthe ensemble and can be used to assign new weights, i.e.W1

=

RTS/||RTS||. Greater weights are assigned to the scorers thatkind of agree with the general belief of the ensemble. DenoteS1= RW1, thenS1 is a vector of overall scores using these

newly determined weights. There is of course no reason to stopthe process here, as we can iteratively get even better weights.We can write

Wi=

RTSi−1

||RTSi−1||=

RTRWi−1

||RTRWi−1||(16)

for i ≥ 1, which can be used to iteratively compute better andbetter weights for an ensemble ofRIDIT scorersRi , startingwith some weights, e.g.W0

= [1, 1 . . .1] – the process con-verges to some fixed point no matter the starting weights (dueto some assumptions). It can be shown that the fixed point is ac-tually thefirst principal componentof the matrixRTR denotedW∞. For more details onPRIDIT technique see (Brockett et al.,2002).

We can now score each component inC(G) using thePRIDITtechnique for indicatorsI i and output as suspicious all the com-ponents, with a score greater than 0. Thus

S(G) = {c| c ∈ C(G) ∧R(c)W∞ ≥ 0}, (17)

whereR(c) is a row of matrixR, that corresponds to componentc. Again there is no guarantee, that the threshold 0 is the bestchoice. Still, if we know the expected proportion of fraudulentcomponents in the data set (or e.g. expected number of fraudu-lent collisions), we can first rank the components usingPRIDITtechnique and then output only the appropriate proportion ofmost highly ranked components.

4.3. Suspicious entities detection

In the third module of the system key entities are detectedinside each previously identified suspicious component. Wefocus on identifying key participants, that can be later usedfor the identification of other key entities (collisions, vehicles,etc.). Key participants are identified by employingIterative As-sessment Algorithm (IAA)that uses intrinsic and relational at-tributes of the entities. The algorithm assigns asuspicion scoreto each participant, which corresponds to the likelihood ofitbeing fraudulent.

In classical approaches over flat data, entities are assessedusing only their intrinsic attributes, thus they are assessedin complete isolation to other entities. It has been empir-ically shown that theassessmentcan be improved by alsoconsidering the related entities, more precisely, by consider-ing the assessment of the related entities (Chakrabarti et al.,1998; Domingos & Richardson, 2001; Lu & Getoor, 2003a,b;Neville & Jensen, 2000). The assessment of an entity isin-ferred from the assessments of the related entities andpropa-gatedonward. Still, incorporating only the intrinsic attributesof the related entities generally doesn’t improve, or even de-teriorates, the assessment (Chakrabarti et al., 1998; Oh etal.,2000).

The proposedIAA algorithm thus assesses the entities by alsoconsidering the assessment of their related entities. As these re-lated entities were also assessed using the assessments of theirrelated entities, and so on, the entire network is used in theas-sessment of some particular entity. This could not be achievedotherwise, as the formulation would surely be too complex. Weproceed by introducingIAA in a general form.

Let c be some suspicious component in networkG, c ∈ S(G),and letvi be one of its vertices,vi ∈ Vc. Furthermore, letN(vi)be a set of neighbor vertices ofvi (i.e. vertices at distance 1)andV(vi) = N(vi) ∪ {vi}, and letE(vi) be a set of edges incidentto vi (i.e. E(vi) = {e| e ∈ Ec ∧ vi ∈ e}). Let alsoeni be an

8

entity corresponding to vertexvi andN(eni), V(eni) be a set ofentities that corresponds toN(vi), V(vi) respectively. We definethe suspicion scores, s(·) ≥ 0, for the entityeni as

s(eni) = AM(s(N(eni)),V(eni),V(vi),E(vi)) (18)

= AM(i, c),

whereAM is someassessment modelands(N(eni)) is a set ofsuspicion scores for entities inN(eni). The suspicion of someentity is dependent on the assessment of the related entities (firstargument in equation (18)), on the intrinsic attributes of relatedentities and itself (second argument), and on the relational at-tributes of the entity (last two arguments). We assume thatAMis linear in the assessments of the related entities (i.e.s(N(eni)))and that it returns higher values for fraudulent entities.

For some entityeni, when the suspicion scores of the relatedentities are known,eni can be assessed using equation (18).Commonly, none of the suspicion scores are known preliminary(as the data set is unlabeled), and the equation thus cannot beused in a common manner. Still, one can incrementally assessthe entities in aniterativefashion, similar to e.g. (Brin & Page,1998; Kleinberg, 1999).

Let s0(·) be some set of suspicion scores, e.g.s0(·) = 1. Wecan then assess the entities using scoress0(·) and equation (18),and get better scoress1(·). We proceed with this process, itera-tively refining the scores until some stopping criteria is reached.Generally, on thekth iteration, entities are assessed using

sk(eni) = AM(sk−1(N(eni)),V(eni),V(vi),E(vi)) (19)

= AM(i, k, c).

Note that the choice fors0(·) is arbitrary – the process convergesto somefixed pointno matter the starting scores (due to someassumptions). Hence, the entities are assessed without prelimi-nary knowing any suspicion score to bootstrap the procedure.

We present theIAA algorithm below.

IAA algorithm

s0(·) = 1k = 1WHILE NOT stopping criteriaDO

FOR ∀vi , eni DO

sk(eni) = αsk−1(eni) + (1− α)AM(i, k, c)FOR ∀vi , eni: vi non-bucketDO

normalize sk(eni)k = k+ 1

RETURN sk(·)

Entities are iteratively assessed using modelAM (α is asmoothing parameterset to e.g. 0.75). In order for the pro-cess to converge, scores corresponding tonon-bucketverticesare normalized at the end of each iteration. Due to the factthat relations represented by the networks are often not binary,there are usually some vertices only serving asbucketsthat storethe suspicion assessed at this iteration to be propagated onthenext. Non-bucketvertices correspond to entities that are actu-ally being assessed and only these scores should be normalized

(for binary relations all the vertices are of this kind). Struc-ture of suchbucketnetworks would typically correspond tobi-partite graphs4 – bucket vertices would only be connected tonon-bucket vertices (and vice versa). In the case ofCOPTAnetworks, used in this module of the (prototype) system, bucketvertices are those representing collisions.

One would intuitively run the algorithm until some fixedpoint is reached, i.e. when the scores no longer change. Weempirically show that, despite the fact that iterative assessmentdoes indeed increase the performance, such approach actuallydecreases it. The reason is that the scoresover-fit the model.We also show, that superior performance can be achieved witha dynamic approach – by running the algorithm ford(c) itera-tions (diameter of componentc). For more see sections 5, 6.

Note that if each subsequent iteration of the algorithm ac-tually increased the performance, one could assess the entitiesdirectly. WhenAM is linear in the assessments of related enti-ties, the model could be written as a set oflinear equationsandsolved exactly (analytically).

An arbitrary model can be used with the algorithm. Wepropose several linear models based on the observation thatinmany of these bucket networks the following holds:every en-tity is well defined with (only) the entities directly connectedto it, considering the context observed. E.g. in the case ofCOPTAnetworks, every collision is connected to its partici-pants, who are clearly the ones who “define” the collision, andevery participant is connected with its collisions, which are theprecise aspect of the participant we wish to investigate whendealing with fraud detection. Similar discussion could be madefor movie-actor, corporate board-director and other well knowncollaboration networks. A model using no attributes of the en-tities is thus simply the sum of suspicion scores of the relatedentities (we omit the arguments of the model)

AMraw =

∑

{vi ,v j }∈E(vi )

s(enj). (20)

Our empirical evaluation shows that even such a simple modelcan achieve satisfactory performance.

To incorporate entities’ attributes into the model, we intro-ducefactors. These are based on intrinsic or relational attributesof entities. The intuition behind the first is that some intrinsicattributes’ values are highly correlated with fraudulent activity.Suspicion scores of corresponding entities should in this casebe increased and also propagated on the related entities. More-over, many of the relational attributes (i.e. labels of the edges)increase the likelihood of fraudulent activity – the propagationof suspicion over such edges should also be increased.

Let lEG be the edge labeling function andΣEG the alphabet ofall possible edge labels, i.e.ΣEG = {Driver,Passenger. . .} (forCOPTAnetworks). Furthermore, letEn be a set of all entitieseni . We defineFint, Frel to be the factors, corresponding tointrinsic, relational attributes respectively, as

Fint : En→ [0,∞), (21)

Frel : ΣEG → [0,∞). (22)

4In the social science literature bipartite graphs are knownascollaborationnetworks.

9

Improved model incorporating these factors is then

AMbas = Fint(eni)∑

e={vi ,v j }∈E(vi )

Frel(lEG(e)) s(enj). (23)

FactorsFint are computed from (similar forFrel)

Fint(eni) =

∏

k

Fkint(eni) (24)

where

Fkint(eni) =

{

1/(1− f kint(eni)) f k

int(eni) ≥ 01+ f k

int(eni) otherwise(25)

and

f kint : En→ (−1, 1). (26)

f kint arevirtual factorsdefined by the domain expert. The trans-

formation with equation (25) is done only to define factors onthe interval (−1, 1), rather than on [0,∞). The first is more intu-itive as e.g. two “opposite” factors are nowf and− f , f ∈ [0, 1),opposed tof and 1/ f , f > 0, before.

Some virtual factorf kint can be an arbitrary function defined

due to a single attribute of some entity, or due to several at-tributes formulatingcorrelationsbetween the attributes. Whenattributes’ values correspond to some suspicious activity(e.g.collision corresponds to some classicalscheme), factors are setto be close to 1, and close to−1, when values correspond tonon-suspicious activity (e.g. children in the vehicle). Other-wise, they are set to be 0.

Note that assessment of some participant with modelsAMraw

andAMbas is highly dependent on the number of collisions thisparticipant was involved in. More precisely, on the number ofthe terms in the sums in equations (20), (23) (which is exactlythe degree of the corresponding vertex). Although this propertyis not vain, we still implicitly assume we possesall of the colli-sions a certain participant was involved in. This assumption isoften not true (in practice).

We propose a third model diminishing the mentioned as-sumption. LetdG be the average degree of the vertices in net-work G, dG = ave{d(vk)| vk ∈ VG}. The model is

AMmean· =

dG + d(vi)2

AM·d(vi)

=

1+dG

d(vi)

AM·2, (27)

whereAM· can be any of the modelsAMraw, AMbas. AMmean·

averages terms in the sum of the modelAM·, and multiplies thisaverage by the mean of vertex’s degree and the average degreeover all the vertices inVG. Thus a sort ofLaplace smoothingis employed that pulls the vertex degree toward the average,inorder to diminish the importance of this parameter in the finalassessment. Empirical analysis in section 5 shows that suchamodel outperforms the other two.

Knowing scoress(·) for all the entities in some connectedcomponentc ∈ G, one can rank them according to the suspicionof their being fraudulent. In order to also compare the entitiesfrom various components, scores must be normalized appropri-ately (e.g. multiplied with the number of collisions representedby componentc).

4.4. Final remarks

In the previous section (third module of the system) we fo-cused only on detection of fraudulent participants. Their sus-picion scores can now be used for assessment of other entities(e.g. collisions, vehicles), using one of the assessment modelsproposed in section 4.3.

When all of the most highly ranked participants in somesuspicious component are directly connected to each other (orthrough buckets), they are proclaimed to belong to the samegroup of fraudsters. Otherwise they belong to several groups.During the investigation process, the domain expert or investi-gator analyzes these groups and determines further actionsforresolving potential fraud. Entities are investigated in the orderinduced by scoress(·).

Networks also allow a neat visualization of the assessment(see Fig. 6).

Fig. 6: FourCOPTAnetworks showing same group of collisions. Size of theparticipants’ vertices correspond to their suspicion score; only participants withscore above some threshold, and connecting collisions, areshown on each net-work. The contour was drawn based on theharmonic meandistance to everyvertex, weighted by the suspicion scores. (Blue) filled collisions’ vertices in thefirst network correspond to collisions that happened at night.

5. Evaluation with the prototype system

We implemented a prototype system to empirically evaluatethe performance of the proposition. Furthermore, various com-ponents of the system are analyzed and compared to other ap-proaches. To simulate realistic conditions, the data set used forevaluation consisted only of the data, that can be easily (auto-matically) retrieved from police records (semistructured data).We report results of the assessment of participants (not e.g. col-lisions).

10

5.1. Data

The data set consists of 3451 participants involved in 1561collisions in Slovenia between the years 1999 and 2008. Theset was made by merging two data sets, one labeled and oneunlabeled.

The first, labeled, consists of collisions corresponding topre-viously identified fraudsters and some other participants,whichwere investigated in the past. In a few cases, whenclassof aparticipant could not be determined, it was set according tothedomain expert’s and investigator’s belief. As the purpose of oursystem is to identify groups of fraudsters, and not some iso-lated fraudulent collisions, (almost) all isolated collisions wereremoved from this set. It is thus a bit smaller (i.e. 211 partici-pants, 91 collisions), but still large enough to make the assess-ment.

To achieve a more realistic class distribution and better statis-tics forPRIDIT analysis, the second larger data set was mergedwith the first. The set consists of various collisions chosen(al-most) at random, although some of them are still related withothers. Since random data sampling is not advised for relationaldata (Jensen, 1999), this set is used explicitly forPRIDIT analy-sis. Both data sets consist of only standard collisions (e.g. thereare no chain collisions involving numerous vehicles or coacheswith many passengers).

Class distribution for the data set can be seen in table 2.

Class distribution

Count ProportionFraudster 46 1.3% 21.8%

Non-fraudster 165 4.8% 78.2%Unlabeled 3240 93.9%

Table 2: Class distribution for the data set used in the analysis of the proposedexpert system.

The entire assessment was made using the merged data set,while the reported results naturally only correspond to thefirst(labeled) set. Note that the assessment of entities in some con-nected component is completely independent of the entitiesinother components (except forPRIDIT analysis).

5.2. Results

Performance of the system depends on random generation ofnetworks, used for detection of suspicious components (secondmodule). We construct 200 random networks for each indicatorand each component (equations (11), (12)), however, the re-sults still vary a little. The entire assessment was thus repeated20 times and the scores were averaged. To assess the rankingof the system, averageAUC (Area Under Curve) scores werecomputed,AUC. Results given in tables 5, 6, 7, 8 are allAUC.

In order to obtain a standard for other analyses, we firstreport the performance of the system that usesPRIDIT anal-ysis for fraudulent components detection, andIAA algorithmwith modelAMmean

bas for fraudulent entities detection, denotedIAAmean

bas (see table 3). Various metrics are computed, i.e.clas-sification accuracy(CA), recall (true positive rate), precision

Golden standard

CA 0.8720Recall 0.8913

Precision 0.6508Specificity 0.8667F1 score 0.7523

AUC 0.9228

Table 3: Performance of the system that usesPRIDIT analysis withIAAmeanbas

algorithm. Various metrics are reported; all exceptAUC are computed so thetotal cost (on the first run) is minimal.

(positive predictive value), specificity(1− false positive rate),F1 score(harmonic mean of recall and precision) and AUC.All but last are metrics that assess the classification of some ap-proach, thus a threshold for suspicion scores must be defined.We report the results from the first run that minimize the to-tal cost, assuming the cost of misclassified fraudsters and non-fraudsters is the same. Same holds for confusion matrix seenintable 4.

Confusion matrix

Suspicious UnsuspiciousFraudster 41 5

Non-fraudster 22 143

Table 4: Confusion matrix for the system that usesPRIDIT analysis withIAAmean

bas algorithm (determined so the total cost on the first run is minimal).

We proceed with an in-depth analysis of the proposedIAAalgorithm. Table 5 shows the results of the comparison ofdifferent assessment models, i.e.IAAraw, IAAbas, IAAmean

rawand IAAmean

bas . Factors for modelsIAAbas and IAAmeanbas (equa-

tion (25)) were set by the domain expert, with the help of sta-tistical analysis of data from collisions. To further analyze theimpact of factors on final assessment, an additional set of fac-tors was defined by the authors. Values were set due to authors’intuition; corresponding models areIAAint and IAAmean

int . Re-sults of the analysis can be seen in table 6.

Assessment models

PRIDITIAAraw IAAbas IAAmean

raw IAAmeanbas

0.8872 0.9145 0.8942 0.9228

Table 5: Comparison of different assessment models forIAA algorithm (afterPRIDIT analysis).

As already mentioned, the performance of theIAA algorithmdepends on the number of iterations made in the assessment(see section 4.3). We have thus plotted theAUC scores withrespect to the number of iterations made (for the first run), inorder to clearly see the dependence; plots forIAAmean

raw , IAAmeanbas

can be seen in Fig. 7, Fig. 8 respectively. We also show that su-perior performance can be achieved, if the number of iterations

11

Factors

ALLIAAmean

raw IAAmeanint IAAmean

bas0.8188 0.8435 0.8787

PRIDITIAAmean

raw IAAmeanint IAAmean

bas0.8942 0.9086 0.9228

Table 6: Analysis of the impact of factors on the final assessment (on all thecomponents and afterPRIDIT analysis).

is set dynamically. More precisely, the number of iterationsmade for some componentc ∈ C(G) is

max{dG, d(c)}, (28)

whered(c) is the diameter ofcanddG the average diameter overall the components. All other results reported in this analysisused such a dynamic setting.

Fig. 7: AUC scores with respect to the number of iterations made in theIAAalgorithm. Solid curves correspond toIAAmean

raw algorithm afterPRIDIT analysisand dashed curves toIAAmean

raw algorithm ran on all the components. Straightline segments show the scores achieved with dynamic settingof the number ofiterations (see text).

Fig. 8: AUC scores with respect to the number of iterations made in theIAAalgorithm. Solid curves correspond toIAAmean

bas algorithm afterPRIDIT analysisand dashed curves toIAAmean

bas algorithm ran on all the components. Straightline segments show the scores achieved with dynamic settingof the number ofiterations (see text).

Due to the lack of other expert system approaches for detect-ing groups of fraudsters, or even individual fraudsters (accord-ing to our knowledge), no comparative analysis of such kindcould be made. The proposedIAA algorithm is thus comparedagainst several well known measures for anomaly detection in

networks –betweenness centrality(BetCen), closeness central-ity (CloCen), distance centrality(DisCen) andeigenvector cen-trality (EigCen) (Freeman, 1977, 1979). They are defined as

BetCen(vi) =

∑

v j ,vk∈Vc

gv j ,vk(vi)

gv j ,vk

, (29)

CloCen(vi) =1

nc − 1

∑

v j∈Vc\vi

d(vi, v j), (30)

DegCen(vi) =d(vi)

nc − 1, (31)

EigCen(vi) =1λ

∑

{vi ,v j }∈Ec

EigCen(v j), (32)

wherenc is the number of vertices in componentc, nc = |Vc|,λ is a constant,gv j ,vk is the number of geodesics between ver-ticesv j andvk andgv j ,vk(vi) the number of such geodesics thatpass through vertexvi , i , j , k. For further discussion see(Freeman, 1977, 1979; Newman, 2003).

These measures of centrality were used to assign suspicionscore to each participant; scores were also appropriately nor-malized as in the case ofIAA algorithm. For a fair comparison,measures were compared against the model that uses no intrin-sic attributes of entities, i.e.IAAmean

raw . The results of the analysisare shown in table 7.

IAA algorithm

ALLBetCen CloCen DegCen EigCen IAAmean

raw

0.6401 0.8138 0.7428 0.7300 0.8188

PRIDITBetCen CloCen DegCen EigCen IAAmean

raw

0.6541 0.8158 0.8597 0.8581 0.8942

Table 7: Comparison of theIAA algorithm against several well known measuresfor anomaly detection in networks (on all the components andafter PRIDITanalysis). For a fair comparison, no intrinsic attributes are used in theIAAalgorithm (i.e. modelAMmean

raw ).

Next, we analyzed different approaches for detection offraudulent components (see table 8). The same set of 9 indi-cators was used for the majority voter (equation (13)) and for(P)RIDIT analysis (equation (17)). For the latter, we use a vari-ant ofrandom undersampling(RUS), to cope with skewed classdistribution. We output most highly ranked components, thusthe set of selected components contain 4% of all the collisions(in the merged data set) Analyses of automobile insurance fraudmainly agree that up to 20% of all the collisions are fraudu-lent, and up to 20% of the latter correspond to non-opportunisticfraud (various resources). However, for the majority voter, suchan approach actually decreases performance – we therefore re-port results where all components, with at least half of the indi-cators set, are selected.

Several individual indicators, achieving superior perfor-mance, are also reported. IndicatorIBetCen is based on be-tweenness centrality (equation (30)),IMinCov on minimum ver-

12

tex cover andI l−1 on l−1 measuredefined as the harmonic meandistance between every pair of vertices in some componentc,

l−1=

112nc(nc + 1)

∑

vi ,v j∈Vc,i≥ j

d(vi , v j)−1. (33)

Fraudulent components

IMinCov I l−1 IBetCen MAJOR RIDIT PRIDITALL

0.6019 0.6386 0.6774 0.7946 0.6843 0.7114

IMinCov I l−1 IBetCen MAJOR RIDIT PRIDITIAAmean

bas0.6119 0.8494 0.8549 0.8507 0.9221 0.9228

Table 8: Comparison of different approaches for detection of fraudulent com-ponents (prior to no fraudulent entities detection andIAAmean

bas ).

We last analyze the importance of proper data representa-tion for detection of groups of fraudsters – the use of networks.Networks were thus transformed into flat data and some stan-dard unsupervised learning techniques were examined (e.q.k-means, hierarchical clustering). We obtained no results com-parable to those given in table 3.

Furthermore, we tested nine standard supervised data-miningtechniques to analyze the compensation of data labels for theinappropriate representation of data. We used (default) im-plementations of classifiers inOrange data-mining software(Demsar et al., 2004) and 20-fold cross validationwas em-ployed as the validation technique. Best performance, up toAUC ≈ 0.86, was achieved withNaive Bayes, support vec-tor machines, random forestand, interestingly, alsok-nearestneighborsclassifier. Scores for other approaches were belowAUC = 0.80 (e.g.logistic regression, classification trees, etc.).

6. Discussion

Empirical evaluation from the previous section shows thatautomobile insurance fraud can be detected using the proposi-tion. Moreover, the results suggest that appropriate data repre-sentation is vital – even a simple approach over networks candetect a great deal of fraud. The following section discusses theresults in greater detail (in the order given).

Almost all of the metrics obtained withPRIDIT analysis andIAAmean

bas algorithm, golden standard, are very high (table 3).Only precision appears low, still this results (only) from theskewed class distribution in the domain. TheF1 measure isconsequently also a bit lower, else the performance of the sys-tem is more than satisfactory. The latter was confirmed by theexperts and investigators from a Slovenian insurance company,who were also pleased with the visual representation of the re-sults.

The confusion matrix given in table 4 shows that we correctlyclassified almost 90% of all fraudsters and over 85% of non-fraudsters. Only 5 fraudsters were not detected by the proto-type system. We thus obtained a particularly high recall, which

is essential for all fraud detection systems. The majority of un-labeled participants were classified as unsuspicious (not shownin table 4), but the corresponding collisions are mainly isolatedand the participants could have been trivially eliminated any-way (for our purposes).

We proceed with discussion of different assessment models(table 5). Performance of the simplest of the modelsIAAraw,which uses no domain expert’s knowledge, could already provesufficient in many circumstances. It can still be significantlyimproved by also considering the factors, set by the domainexpert (modelIAAbas). Model IAAmean

· further improves theassessment of both (simple) models, confirming the hypothe-sis behind it (see section 4.3). Although the models (probablyincorrectly) assume that the fraudulence of an entity is linear(in the fraudulences of the related entities), they give a goodapproximation of the fraudulent behavior.

The analysis of the factors used in the models confirmstheir importance for the final assessment. As expected,modelIAAmean

bas outperformsIAAmeanint , and the latter outperforms

IAAmeanraw (table 6). First, this confirms the hypothesis that do-

main knowledge can be incorporated into the model using fac-tors (as defined in section 4.3). Second, it shows that betterunderstanding of the domain can improve assignment of fac-tors. Combination of both makes the system extremely flexible,allowing for detection of new types of fraud immediately afterthey have been noticed by the domain expert or investigator.

As already mentioned, running theIAA algorithm for toolong over-fits the model and decreases algorithm’s final perfor-mance (see Fig. 7, Fig. 8, note different scales used). Earlyiterations of the algorithm still increase the performancein allcases analyzed, which proves the importance of iterative assess-ment as opposed tosingle-passapproach. However, after someparticular number of iterations has been reached, performancedecreases (at least slightly). Also note that the decrease is muchlarger in the case ofAMmean

raw model thanAMmeanbas , indicating that

the latter is superior to the first. We propose to use thisdecreasein performanceas an additional evaluation of any model usedwith IAA, or similar, algorithm.

It is preferable to run the algorithm for only a few iterationsfor one more reason. Networks are often extremely large, es-pecially when they describe many characteristics of entities. Inthis case, running the algorithm until some fixed point is sim-ply not feasible. Since the prototype system uses only the basicattributes of the entities, the latter does not present a problem.

The number of iterations that achieves the best performanceclearly depends on various factors (data set, model, etc.).Ourevaluation shows that superior, or at least very good, perfor-mance (Fig. 7, Fig. 8) can be achieved with the use of a dynamicsetting of the number of iterations (equation (28)).

When no detection of fraudulent components is done, thecomparison betweenIAA algorithm and measures of central-ity shows no significant difference (table 7). On the other hand,when we usePRIDIT analysis for fraudulent components de-tection, theIAA algorithm dominates others. Still, the resultsobtained withDegCenandEigCenare comparable to those ob-tained with supervised approaches over flat data. This showsthat even a simple approach can detect a reasonably large por-

13

tion of fraud, if appropriate representation of data is used(net-works).

The analysis of different approaches for detection of fraudu-lent components produces no major surprises (table 8) – the bestresults are obtained using(P)RIDIT analysis. Note that a singleindicator can match the performance of majority classifierMA-JOR, confirming its naiveness (see section 4.2); exceptionallyhighAUC score obtained byMAJOR, prior to no fraudulent en-tities detection, only results from the fact, that the returned setof suspicious components is almost 10-times smaller than forother approaches. The precision of the approach is thus muchhigher, but for the price of lower recall (useless for fraud detec-tion).

We have already discussed the purpose of hierarchical detec-tion of groups of fraudsters – to simplify detection of fraud-ulent entities with appropriate detection of fraudulent compo-nents. Another implication of such an approach is also simpler,or is some cases even feasible,data collectionprocess. As thedetection of components is done using only the relations be-tween entities (relational attributes), a large portion ofdata canbe discarded without knowing the values of any of the intrinsicattributes. This characteristic of the system is vital whende-ploying in practice – (complete) data often cannot be obtainedfor all the participants, due to sensitivity of the domain.

Last, we briefly discuss the applicability of the propositionin other domains. The presentedIAA algorithm can be used forarbitrary assessment of entities over some relational domain,exploring the relations between entities with no demand foran(initial) labeled data set. When every entity is well definedwith(only) the entities directly related to it, considering thecontextobserved, one of the proposed assessment models can also beused. Furthermore, the presented framework (four modules ofthe system) could be employed for fraud detection in other do-mains. The system is also applicable for use in other domains,where we are interested in groups of related entities sharingsome particular characteristics. The framework exploits the re-lations between entities, in order to improve the assessment,and is structured hierarchically, to make it applicable in prac-tice.

7. Conclusion

The article proposes a novel expert system approach for de-tection of groups of automobile insurance fraudsters with net-works. Empirical evaluation shows that such fraud can be ef-ficiently detected using the proposition and, in particular, thatproper representation of data is vital. For the system to be appli-cable in practice, no labeled data set is used. The system ratherallows the imputation of domain expert’s knowledge, and it canthus be adopted to new types of fraud as soon as they are no-ticed. The approach can aid the domain investigator to detectand investigate fraud much faster and more efficiently. More-over, the employed framework is easy to implement and is alsoapplicable for detection (of fraud) in other relational domains.

Future research will be focused on further analyses of dif-ferent assessment models forIAA algorithm, considering alsothe nonlinear models. Moreover, theIAA will be altered into

an unsupervised algorithm, learning the factors of the modelin an unsupervised manner during the actual assessment. Thefactors would thus not have to be specified by the domain ex-pert. Applications of the system in other domains will also beinvestigated.

Acknowledgements

Authors thank Matjaz Kukar, from University of Ljubljana,and Jure Leskovec, (currently) from Cornell University, for allthe effort and useful suggestions; Tjasa Krisper Kutin for lec-toring the article; and Optilab d.o.o. for the data used in thestudy. This work has been supported by the Slovene ResearchAgencyARRSwithin the research program P2-0359.

References

Artis, M., Ayuso, M., & Guillen, M. (2002). Detection of automobile insurancefraud with discrete choice models and misclassified claims.Journal of Riskand Insurance, 69(3), 325–340.

Ball, F., Mollison, D., & Scalia-Tomba, G. (1997). Epidemics with two levelsof mixing. Annals of Applied Probability, 7(1), 46–89.

Barabasi, A. L., & Albert, R. (1999). Emergence of scaling inrandom networks.Science, 286(5439), 509–512.

Bolton, R. J., & Hand, D. J. (2002). Statistical fraud detection: A review.Statistical Science, 17(3), 235–249.

Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual websearch engine.Computer Networks and ISDN Systems, 30(1-7), 107–117.

Brockett, P. L. (1981). A note on the numerical assignment ofscores to rankedcategorical-data.Journal of Mathematical Sociology, 8(1), 91–101.

Brockett, P. L., Derrig, R. A., Golden, L. L., Levine, A., & Alpert, M. (2002).Fraud classification using principal component analysis ofRIDITs. Journalof Risk and Insurance, 69(3), 341–371.

Brockett, P. L., & Levine, A. (1977). Characterization of RIDITs. Annals ofStatistics, 5(6), 1245–1248.

Bross, I. (1958). How to use RIDIT analysis.Biometrics, 14(1), 18–38.Chakrabarti, S., Dom, B., & Indyk, P. (1998). Enhanced hypertext categoriza-

tion using hyperlinks. InProceedings of the International Conference onManagement of Data, (pp. 307–318).

Demsar, J., Zupan, B., Leban, G., & Curk, T. (2004). Orange: From experimen-tal machine learning to interactive data mining. InKnowledge Discovery inDatabases: PKDD 2004, (pp. 537–539). Springer Berlin, Heidelberg.

Domingos, P., & Richardson, M. (2001). Mining the network value of cus-tomers. InProceedings of the International Conference on Knowledge Dis-covery and Data Mining, (pp. 57–66).

Eppstein, D., & Wang, J. (2002). A steady state model for graph power laws.In Proceedings of the International Workshop on Web Dynamics.

Estevez, P. A., Held, C. M., & Perez, C. A. (2006). Subscription fraud preven-tion in telecommunications using fuzzy rules and neural networks. ExpertSystems with Applications, 31(2), 337–344.

Freeman, L. (1977). A set of measures of centrality based on betweenness.Sociometry, 40(1), 35–41.

Freeman, L. C. (1979). Centrality in social networks - conceptual clarification.Social Networks, 1(3), 215–239.

Furlan, S., & Bajec, M. (2008). Holistic approach to fraud management inhealth insurance. Journal of Information and Organizational Sciences,32(2), 99–114.

Ghosh, A. K., & Schwartzbard, A. (1999). A study in using neural networksfor anomaly and misuse detection. InProceedings of the Conference onUSENIX Security Symposium, (pp. 12–12).

Girvan, M., & Newman, M. E. J. (2002). Community structure insocial andbiological networks.Proceedings of the National Academy of Sciences USA,99(12), 7821–7826.

Holder, L. B., & Cook, D. J. (2003). Graph-based relational learning: currentand future directions.SIGKDD Explorations, 5(1), 90–93.

14

Hu, W., Liao, Y., & Vemuri, V. R. (2007). Robust anomaly detection usingsupport vector machines. InProceedings of the International Conference onMachine Learning.

Hummel, R. A., & Zucker, S. W. (1983). On the foundations of relaxationlabeling processes.IEEE Transactions on Pattern Analysis and MachineIntelligence, 5(3), 267–287.

Jensen, D. (1997). Prospective assessment of AI technologies for fraud de-tection and risk management. InProceedings of the AAAI Workshop on AIApproaches to Fraud Detection and Risk Management, (pp. 34–38).

Jensen, D. (1999). Statistical challenges to inductive inference in linked data.In Proceedings of the International Workshop on Artificial Intelligence andStatistics, (pp. 59–62).

Kempe, D., Kleinberg, J., & Tardos, E. (2003). Maximizing the spread of in-fluence through a social network. InProceedings of the International Con-ference on Knowledge Discovery and Data Mining, (pp. 137–146).

Kirkos, E., Spathis, C., & Manolopoulos, Y. (2007). Data mining techniquesfor the detection of fraudulent financial statements.Expert Systems withApplications, 32(4), 995–1003.

Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment.Journal of the ACM, 46(5), 604–632.

Kschischang, F. R., & Frey, B. J. (1998). Iterative decodingof compound codesby probability propagation in graphical models.IEEE Journal on SelectedAreas in Communications, 16(2), 219–230.

Lu, Q., & Getoor, L. (2003a). Link-based classification using labeled and un-labeled data. InProceedings of the ICML Workshop on the Continuum fromLabeled to Unlabeled Data in Machine Learning and Data Mining.

Lu, Q., & Getoor, L. (2003b). Link-based text classification. In Proceedings ofthe IJCAI Workshop on Text Mining and Link Analysis.

Maxion, R. A., & Tan, K. M. C. (2000). Benchmarking anomaly-based detec-tion systems. InProceedings of the International Conference on DependableSystems and Networks, (pp. 623–630).

Minka, T. (2001). Expectation propagation for approximatebayesian inference.In Proceedings of the Conference on Uncertanty in Artificial Intelligence,(pp. 362–369).

Neville, J., & Jensen, D. (2000). Iterative classification in relational data. InProceedings of the Workshop on Learning Statistical Modelsfrom Rela-tional Data, (pp. 13–20).

Newman, M. E. J. (2003). The structure and function of complex networks.SIAM Review, 45(2), 167–256.

Newman, M. E. J. (2008). Mathematics of networks. InThe New PalgraveEncyclopedia of Economics. Palgrave Macmillan, Basingstoke.

Noble, C. C., & Cook, D. J. (2003). Graph-based anomaly detection. In Pro-ceedings of the ACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining, (pp. 631–636).

Oh, H. J., Myaeng, S. H., & Lee, M. H. (2000). A practical hypertext catego-rization method using links and incrementally available class information.In Proceedings of the ACM SIGIR International Conference on Researchand Development in Information Retrieval, (pp. 264–271).

Perez, J. M., Muguerza, J., Arbelaitz, O., Gurrutxaga, I., &Martin, J. I. (2005).Consolidated tree classifier learning in a car insurance fraud detection do-main with class imbalance. InProceedings of the International Conferenceon Advances in Pattern Recognition, (pp. 381–389).

Phua, C., Lee, V., Smith, K., & Gayler, R. (2005). A comprehensive survey ofdata mining-based fraud detection research.Artificial Intelligence Review.

Quah, J. T. S., & Sriganesh, M. (2008). Real-time credit cardfraud detectionusing computational intelligence.Expert Systems with Applications, 35(4),1721–1732.

Rupnik, R., Kukar, M., & Krisper, M. (2007). Integrating data mining and de-cision support through data mining based decision support system.Journalof Computer Information Systems, 47(3), 89–104.

Sanchez, D., Vila, M. A., Cerda, L., & Serrano, J. M. (2009). Association rulesapplied to credit card fraud detection.Expert Systems with Applications,36(2), 3630–3640.

Sen, P., & Getoor, L. (2007). Link-based classification. Tech. Rep. CS-TR-4858, University of Maryland.

Sun, J., Qu, H., Chakravarti, D., & Faloutsos, C. (2005). Relevance search andanomaly detection in bipartite graphs.ACM SIGKDD Explorations Newslet-ter, 7(2), 48–55.

Viaene, S., Dedene, G., & Derrig, R. A. (2005). Auto claim fraud detectionusing bayesian learning neural networks.Expert Systems with Applications,29(3), 653–666.

Viaene, S., Derrig, R. A., Baesens, B., & Dedene, G. (2002). Acomparisonof state-of-the-art classification techniques for expert automobile insuranceclaim fraud detection.Journal of Risk and Insurance, 69(3), 373–421.

Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of ’small-world’networks.Nature, 393(6684), 440–442.

Weisberg, H. I., & Derrig, R. A. (1998). Quantitative methods for detectingfraudulent automobile bodily injury claims.Risques, 35, 75–101.

Yang, W. S., & Hwang, S. Y. (2006). A process-mining framework for thedetection of healthcare fraud and abuse.Expert Systems with Applications,31(1), 56–68.

15

arxiv:1104.3904v1 [cs.ai] 19 apr 2011

Documents