a graph-based method to detect rare events: an application to...

A Graph-based Method to Detect Rare Events:

An Application to Identify Pathologic Cells

Eniko SzekelyLIRMM

Universite Montpellier 2France

Arnaud SallaberryLIRMM


Faraz ZaidiUniversity of Lausanne,

Switzerland, andKarachi Institute of Economics

and Technology, Pakistan

Pascal PonceletLIRMM


Abstract

Detection of outliers and anomalous behavior is a well known prob-lem in the field of data mining and statistics. Although the problem ofidentifying single outliers has been extensively studied in the literature,little or some effort has been consecrated to the detection of small groupsof outliers which are similar to each other but markedly different fromthe entire population. Many real world scenarios have small groups ofoutliers, for example, a group of students that excel in a classroom or agroup of spammers in an online social network. In this paper, we pro-pose a novel method to solve this challenging problem which lies at thefrontiers of outlier detection and clustering similar groups. The methodtransforms a multidimensional dataset as graph, applies a network met-ric to detect clusters and renders a representation for visual explorationto find rare events. We test the proposed method to detect pathologiccells (e.g. Cancer, HIV, CVA, etc.) in the domain of biomedical science.The results are very promising and conform the available ground truthprovided by the domain experts.

Keywords: Outlier detection, Rare Events, Group of Outliers, Visualization,Clustering, Pathologic Cells

1 Introduction

With the recent technological developments, acquisition and storage of largedatasets from various domains has become a common feature. Outliers in these

1

datasets can arise due to changes in system behaviour, instrument or humanerror, intentional fraudulent behaviour, or through natural deviations in pop-ulation due to epidemics and virus infections. Detection of these outliers hasfound many applications such as data cleansing, stopping fraudulent intentions,controlling disease outbreaks and detecting infected individuals.

A recent area in the study of outlier detection is the identification of agroup of outliers, called rare events. The members of this group are similar toeach other, and deviate from the general behavior of the dataset as to arousesuspicions that they were generated by a different mechanism. Furthermore,these groups are usually very small when compared to clusters or groupings ofthe entire data set, thus classifying them as outliers. Detection of such a groupbecomes more challenging when the objective is to detect only groups of suchoutliers which are similar to each other and ignore outliers which are differentamongst themselves. Examples of such rare events include identification ofstudents in a class that excel academically or a group of spammers or autobotsthat increase the popularity of an individual or an event in an online socialnetwork.

Another critical application of detecting rare events appears in biomedi-cal science and clinical research, the problem of extracting rare events fromFlow Cytometry Standard (FCS) data files [10]. Such a file consists of multi-parametric descriptions of thousands to millions of individual and rare cellscalled biological markers of interest and are used in the monitoring of vasculardiseases, oncology and infectious diseases. Detection of rare events with a highrecall, i.e. with no false negatives is critical in this domain as the cost of missingpathologic cells is significantly much higher than the cost of misclassifying ahealthy group of cells.

In this article, we address this challenging problem of detecting rare eventsin FCS files. The proposed approach is based on initial candidate selection usingkNN, filtering irrelevant candidates, applying a metric for detecting densely con-nected data items and rendering a representation for interactive visual detectionof the cluster of interest. Results demonstrate the accuracy and effectivenessof the proposed algorithm when compared with available ground truth. Theproposed approach is although generic, but we have tested it only to detectpathologic cells. We intend to extend this study to other data sets as part offuture work.

Rest of the paper is organized as follows: In section 3, we described the pro-posed method to detect group of outliers. Section 2 explains the real world dataset of cells used for experimentation and section 4 presents how the proposedmethod is used to detect pathologic cells. Finally in section 5, we present ourconclusion with possible future research prospects.

2

Defining an Outlier and a Group of OutliersAccording to [1], ‘An outlier is an observation which deviates so much from theother observations as to arouse suspicions that it was generated by a differentmechanism’. [2] defines an outlier as ‘An observation or subset of observationswhich appear to be inconsistent with the remainder of that dataset’.

Other terminologies commonly used for outliers are novelty detection,anomaly detection, one-class classification, noise detection, deviation detectionor exception mining [3]. Outlier detection has been found useful in numerousapplications such as intrusion detection in computer networks, fraudulent usageof credit cards, topic detection in news documents and web pages, discoveryof temporal changes in evolving online social networks, and identification ofinconsistent digital records.

Analogous to an outlier, a group of outliers can be defined as a sub-population of individuals deviating from the general behavior of the entire pop-ulation but similar to each other. The cardinality of the set of rare events isusually very small as compared to the general grouping of the data set, whichclassifies them as outliers. We use the term rare events in this text to refer tothis group of outliers whereas synonyms such as cluster of outliers [4], clusteredanomaly [5], anomaly collection [6] and micro-clusters [7] are also used in theliterature.

1. Douglas M Hawkins. Identification of outliers, volume 11. Chapman and Hall London, 1980.

2. Vic Barnett and Toby Lewis. Outliers in statistical data. Wiley Series in Probability andMathematical Statistics. Applied Probability and Statistics, Chichester: Wiley, 1984, 2nded., 1, 1984.

3. Victoria Hodge and Jim Austin. A survey of outlier detection methodologies. ArtifcialIntelligence Review, 22(2):85-126, 2004.

4. David M. Rocke and David L. Woodruff Identification of outliers in multivariate data. Jour-nal of the American Statistical Association,91(435):1047-1061, 1996.

5. Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. On detecting clustered anomalies us-ing sciforest. In Machine Learning and Knowledge Discovery in Databases, pages 274-290.Springer, 2010.

6. Hanbo Dai, Feida Zhu, Ee-Peng Lim, and HweeHwa Pang. Mining coherent anomaly collec-tions on web data. In Proceedings of CIKM, 1557-1561, 2012.

7. Duck-Ho Bae, Seo Jeong, Sang-Wook Kim, and Minsoo Lee. Outlier detection using central-ity and center-proximity. In Proceedings of CIKM, 2251-2254, 2012.

3

Outlier Detection ApproachesThere are different approaches to detect outliers in the literature [1]. We candetermine outliers without the availability of any prior knowledge. This ap-proach processes data and identifies the most distant points as outliers. If thedistribution of the data is known, we can identify data following the distribu-tion as normal and the remaining data points as outliers. There can be multiplepre-determined classes of normal data and deviation from these classes revealoutliers. Another perspective of this above mentioned classification is on thebasis of parametric and non-parametric methods. Statistical methods often takesome parameters as input whereas methods based on distance and density donot require any input parameters [1].

A completly different technique to detect group of outliers is based on clus-tering algorithms where the idea is to identify clusters which are smaller in sizeas outliers [2]. Clustering algorithms such as K-means are ideally suited to findconvex clusters but these algorithms are highly sensitive to input parametersand can result in misclassfication of clusters and outliers [3]. Usually clusteringalgorithms attempt to balance clusters of varying sizes. For example, Spec-tral clustering algorithms have gained lot of popularity due to there low timecomplexity and scalability, but they use RatioCut and Ncut to create balancedclusters [4] making them impractical to detect rare events. Some clustering al-gorithms allow the generation of different size clusters [5] but a priori knowledgeabout the size of clusters is required to detect rare events which is difficult inmost domains. Furthermore different clustering algorithms can result in differ-ent clusters for the same data set. With all these described inconsistencies inthe clustering algorithms, it is difficult to rely on them to detect true positivesespecially in critical applications.

Cluster based approaches use densities and distances to identify outliers. Adifferent approach is based on the notion of isolation [6]. These methods exploitthe fact that anomalies occur rarely in datasets. Based on training using sub-samplings and evaluation stages, these methods discover rare events by buildingforests of binary trees. These methods are effective in the discovery of globalrare events, but perform sub-optimally when rare events are close to the generalbehavior of the entire population.

Another approach to detecting outliers is the use of summary statistics andvisual representations. Boxplots, along with its several variations [7] have beencommonly used to compare univariate distributions as well as detection of out-liers. These visual representations are hard to read and are not scalable withmultivariate data.

Recent advances in visual analytics and visual data mining have also in-troduced approaches to detect outliers through visual representation and in-teractive exploration [8]. Visualization techniques exploit the human patternrecognition capacity to detect anomalies and are developed by building user in-terfaces and interactions to manipulate the graphical representation of data [9].Problem with these approaches is there limited application to large data sets

4

as it becomes difficult to interactively explore and find rare events in thousandsand millions of data items.

Extensive literature is available in the form of surveys and books to addressthe issue of outlier detection [10, 11, 12, 13, 14] but these works do not approachthe problem of detecting rare events nor they address it from an interactivevisualization perspective.

1. N. Suri, M Murty, and G Athithan. Data mining techniques for outlier detection. VisualAnalytics and Interactive Technologies: Data, Text, and Web Mining Applications, page 19,2011.

2. Chandola V, Banerjee A, Kumar V, Anomaly detection: a survey. ACM Computing Surveys41(15), 2009.

3. Shekhar R Gaddam, Vir V Phoha, and Kiran S Balagani. K-means+ id3: A novel method forsupervised anomaly detection by cascading k-means clustering and id3 decision tree learningmethods. Knowledge and Data Engineering, IEEE Transactions on, 19(3):345-354, 2007.

4. Ulrike Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395-416,2007.

5. Zhu S, Wang D, Li T. Data clustering with size constraints. Knowledge Based Systems,Elsevier 23:883-889, 2010.

6. Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. On detecting clustered anomalies us-ing sciforest. In Machine Learning and Knowledge Discovery in Databases, pages 274-290.Springer, 2010.

7. Robert McGill, John W. Tukey, and Wayne A. Larsen. Variations of box plots. The AmericanStatistician, 32(1):12-16, 1978.

8. Ming C. Hao, Daniel A Keim, Umeshwar Dayal, and Jorn Schneidewind. Business processimpact visualization and anomaly detection. Information Visualization, 5(1):15-27, 2006.

9. Yufeng Kou, Chang-Tien Lu, Sirirat Sirwongwattana, and Yo-Ping Huang. Survey of frauddetection techniques. In Networking, sensing and control, 2004 IEEE international confer-ence on, volume 2, pages 749-754. IEEE, 2004.

10. Vic Barnett and Toby Lewis. Outliers in statistical data. Wiley Series in Probability andMathematical Statistics. Applied Probability and Statistics, Chichester: Wiley, 1984, 2nded., 1, 1984.

11. Victoria Hodge and Jim Austin. A survey of outlier detection methodologies. ArtifcialIntelligence Review, 22(2):85-126, 2004.

12. Peter J Rousseeuw and Annick M Leroy. Robust regression and outlier detection, volume589. Wiley, 2005.

13. Malik Agyemang, Ken Barker, and Rada Alhajj. A comprehensive survey of numeric andsymbolic outlier mining techniques. Intelligent Data Analysis, 10(6):521-538, 2006.

14. Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACMComputing Surveys (CSUR), 41(3):15, 2009.

5

Figure 1: Steps of the proposed method to detect Rare Events.

2 Data Sets and Problem Definition

In this section, we define the data sets used for experimentation purposes wherethe objective is to detect pathologic cells.

Flow cytometry is a laser based, biophysical technology used in cell counting,sorting, biomarker detection and protein engineering. Flow cytometers of thecurrent generation have a capacity of analyzing more than 1 × 105 cells persecond. It measures the characteristics of single cells, determined by visible andfluorescent light emissions from the markers on the cells. These labeled cellspass a laser that emits light at a particular wavelength according to the specificmarkers attached to the cell fluoresce. For each cell, a fluorescence intensityvalue is collected and stored in Flow Cytometry Standard (FCS) data files. FCSdata file thus consists of multi-parametric descriptions of thousands to millionsof individual cells. Analyzing and sorting subpopulations widely representing

6

Patient Disease Total Nodes Pathologic Cells

P1 Intracranial aneurysm 1,895,261 25P2 Intracranial aneurysm 2,524,916 15P3 Cancer 2,470,042 7

Table 1: Data sets used for experimentation with available ground truth.

immune cells (CD4 + T lymphocytes, CD8 +, B, or NK) is a common practice[8].

Recently biomedical science and clinical research have addressed the problemof extracting rare events from these data files [10] with 1× 102 to 1× 103 cellsper milliliter (ml) of blood cells for 20 ml of blood. FCS data files are then usedin the monitoring of vascular diseases, oncology and infectious diseases. Rareevents in these cells often occur with a very small frequency where researchershave cited this number in between 0.1% to 0.00001% of the total population [3].Recent advances in flow cytometry have enabled it to emerge as an importanttool in the systems biological approach to theoretical and clinical research [10](See [3] for more applications of detecting rare events in flow cytometry).

Methods for analyzing FCS consist in plotting different combinations of de-scriptors two at a time in a 2D scatter plot and then select subgroups of cellsusing gates. These gates are regions that can have any shape. The cells withinthe gates are selected for further analysis and plotted in another 2D scatter plotwith different axis. The main problem with this approach is that it is not onlytedious, it can also miss potential subgroup of rare cells [4].

3 Proposed Method

The proposed method takes as input, data in a tabular form where each rowcorrespondes to a data item (a blood cell) along with its numeric attributes andoutputs the cluster of interest which is a set of rare events as shown in figure 1.The processing steps are explained below.

For each pair of data item in the table, we calculate the euclidean distanceas the similarity metric among data items. The next step is to construct a graphbased on this similarity among data items. For each node, we find the nearestneighbors in terms of distances using K nearest neighbor algorithm (kNN) [5]where K is known apriori. For every node, directed edges are introduced wherethe target nodes are a node’s K nearest neighbors in terms of euclidean distancecalculated above. The choice of value K depends upon the estimated size of therare events that we are trying to detect. For problems where this size cannotbe estimated, we can interactively test different values to find an appropriatethreshold. In case of pathologic cells, since we know that the usual cluster sizeis 50, we consider K=100 to ensure that we do not miss any true positives.

As a result, we obtain a directed graph where each node is connected to theirK most similar data items. Since the maximum number of data items in a cluster

7

are known apriori, we apply a filter to remove all nodes with in-degree greaterthan the known cluster size. This is because all these nodes which are similar tomany nodes represent the regular data items which can be found readily in thegraph and thus, cannot belong to the group of rare events. After filtering nodeswith in-degree higher than the cluster size, we ignore the orientation of edgesobtaining a simple graph. This is because the direction of edges is no longerused in further processing. Figure 2 shows the graph obtained as a result of theabove steps for Patient 1.

Figure 2: Visual representation of the Simple graph generated in Step 4 of theProposed Method. A force directed algorithm FM3 [6] (see [7] for a comparisonof fast graph drawing algorithms) is used to render the graph for Patient 1 (seetable 1).

The next step is find clusters of nodes based on structural similarity. Weuse the strength metric to detect densely connected group of nodes within thegraph. Strength metric was introduced by Auber et al. [2] to quantify theneighborhood’s cohesion of a given edge and thus identifies if an edge is an intra-community or an inter-community edge within a network. It assigns nodes andedges a high value if they are connected densely to each other just as clusteringcoefficient but it takes into consideration cliques of size 4 as well.

The strength of an edge e given by ws(e) is defined as follows:

ws(e) =γ3,4(e)

γmax(e)

where γ3,4(e) is the number of cycles of size 3 or 4 the edge e belongs to,and γmax(e) is the maximum possible number of such cycles. Based on thisdefinition, we define the strength of a vertex as follows:

ws(u) =

∑e∈adj(u) ws(e)

deg(u)

8

where adj(u) is the set of edges adjacent to u and deg(u) is the degree of vertexu. The time complexity to calculate the strength metric over all vertices (V)and edges (E) is O(|E| · (degmax)2) where degmax is the maximum degree ofthe graph but in our case, since the maximum degree is bounded by a constantfactor, the calculation remains constant in linear time in terms of number ofedges in the graph.

Figure 3 shows the result of calculating strength metric and applying a colorencoding on nodes and vertices of the graph. The scale depends on the intervalof the node values. We use the tulip plugin for color mapping [1]. The userdefines a list of colors. The first one is given to nodes having the highest values,the last one to nodes having the lowest one. A linear interpolation betweenconsecutive colors of the list is used for the other values. In our example, nodesand edges in Red highlight the potential rare events in the figure (Figure 3 showsthe result and Figure 4 shows the color scale).

Figure 3: Computing Strength metric [2] in step 6 and visual encoding it onnodes/edges, from blue (low values) to red (high values). The histogram showsthe frequency distribution of different strength values. The rare events appearsin red at the bottom left of the graph.

We prefer strength metric over clustering coefficient because triads are morefrequently present in these data sets as compared to cliques of size 4. If we useclustering coefficient as a metric to detect densely connected group of nodes,we will detect a very large number of true negatives. This will slow down theinteractive detection process at the end as more manual verification would berequired by domain experts. We did not use metrics to calculate cliques of size5 and above because such a calculation would become too slow for large datasets as discussed in detail in the literature [9].

The final step is to visually detect the presence of rare events which form acluster in the graph. We plot the histogram of strength of nodes along with theirfrequencies (as shown in figure 3) which immediately reveals that a large numberof nodes have very low strength value. We interactively remove nodes having

9

Figure 4: Color mapping plugin of Tulip [1]: the user defines a list of colors.The first one is given to nodes having the highest values, the last one to nodeshaving the lowest one. A linear interpolation between consecutive colors of thelist is used for the other values.

Patient Total Nodes Nodes After Removal Pathologic FalseNodes after after of disconnected Cells Positives

Step 3 Step 6 nodes

P1 1,895,261 126,049 33 31 25 6P2 2,524,916 138,647 80 22 15 7P3 2,470,042 182,626 20 10 7 3

Table 2: Number of Nodes after Different Filtration Steps of Detection Process.

low strength values from the graph using histograms of frequency distribution(as shown in figure 3) eventually leaving us with only a few nodes with highstrength value as shown in figure 5.

Domain experts then manually remove true negatives and obtain the re-quired rare events which is a set of densely connected nodes having high sim-ilarity to each other as shown in figure 6. High similarity of nodes is depictedwith color encoding where color Red refers to high similarity and a gradualdegradation to Blue color which refers to low similarity among pair of nodes.

4 Case Studies and Prototype

The proposed method has been implemented on Tulip [1] (see video to view howthe software helps to detect rare events for Patient 1). A comparative study ofdifferent graph drawing softwares clearly shows that tulip scales well to rendergraphs and networks with hundreds of thousands of nodes and edges [11], thusmaking it the ideal platform to implement the proposed method.

Figures 7, 8 and 9 show the results obtained for the three data sets presentedin Table 1. The results for the three data sets when compared to the availableground truth, clearly demonstrate that the proposed method successfully found

10

Figure 5: Interactive and Manual Removal of nodes in step 7, selecting nodeswith low strength values in the histogram. Nodes with high strength could notbe removed using histograms.

Figure 6: Resultant set of rare events after step 7 of the proposed method.

the pathologic cells within the data set provided to us by doctors. In all thethree cases, 3 to 7 false positives were also identified along with the groundtruth, which is an acceptable result since none of the true positives were missedby the proposed method. Table 2 shows the the number of nodes filtered atdifferent stages of the process as the proposed method demonstrates a very highaccuracy. The number of false positives detected are negligible as compared tothe size of the dataset provided.

11

Figure 7: Patient 1: Highlighted region shows the pathologic cells found usingthe proposed method.


5 Conclusion

In this paper, we have proposed a novel method to detect group of rare events inlarge networks. The method is based on directed networks obtained using KNNfor filtering large datasets, strength metric on edges to detect close communitystructures and visual encoding of strength metric to detect densely connectedgroup of rare events. Results on real world biological dataset shows the perfor-mance of the proposed method and the results clearly exhibit the superiority inaccuracy of the proposed method. Further more, the algorithm is highly efficientin terms of time complexity once we have calculated the k nearest neighbors.

We intend to explore this method on other real world data sets, notably

12


social networks, where detection of group of outliers and rare events has foundmany applications.

Eniko Szekely is a Post doctoral researcher at LIRMM, University of Mont-pellier 2, France. She has a PhD from University of Geneva, Switzerland andher research interests include Data Mining, Machine Learning, Visualizationand Information Retrieval. She can be reached at [email protected].

Arnaud Sallaberry is an Assistant Professor at the University of Montpellier3, France. His research interests revolve around graph visualization, visualdata mining, interactive visualizations and network analysis. He has been apostdoctoral researcher at the University of California at Davis and holds aPhD in Computer Science from University of Bordeaux 1, France. He can bereached at [email protected].

Faraz Zaidi is currently a postdoctoral researcher at the University of Lau-sanne, Switzerland. He holds a permanent position as an Assistant Professorat the Karachi Institute of Economics and Technology, Karachi, Pakistan. Hisresearch interests include data mining, information visualization, social networkanalysis, graphs and algorithms. He has a PhD in Computer Science from Uni-versity of Bordeaux 1, France. He can be reached at [email protected].

Pascal Poncelet Pascal Poncelet is a Full Professor at University Montpellier2, France and head of the data mining research group in the LIRMM Laboratory.His research interest can be summarized as advanced data analysis techniquesfor emerging applications. He is currently interested in various techniques ofdata mining and on the definition of new algorithms for mining patterns. Hecan be reached at [email protected].

13

References

[1] David Auber. Tulip - a huge graph visualization framework. In PetraMutzel and Mickael Junger, editors, Graph Drawing Software, Mathematicsand Visualization Series. Springer Verlag, 2003.

[2] David Auber, Yves Chiricota, Fabien Jourdan, and Guy Melancon. Mul-tiscale visualization of small world networks. In Proceedings of the IEEESymposium on Information Visualization (InfoVis), pages 75–81, 2003.

[3] Albert D. Donnenberg and Vera S. Donnenberg. Rare-event analysis inflow cytometry. Clinics in laboratory medicine, 27(3):627–652, 2007.

[4] Albert D. Donnenberg, Vera S. Donnenberg, and Gillian Byrne. Rapiddata handling in flow cytometric rare event analysis. Biotech International,21(1), 2009.

[5] Evelyn Fix and J.L. Hodges. Discriminatory analysis, nonparametric dis-crimination: Consistency properties. Technical Report 4, USAF School ofAviation Medicine, Randolph Field, Texas, 1951.

[6] Stefan Hachul and Michael Junger. Drawing large graphs with a potential-field-based multilevel algorithm. In Proceedings of the 12th InternationalSymposium on Graph Drawing (GD 2004), volume 3383 of Lecture Notesin Computer Science, pages 285–295. Springer, 2005.

[7] Stefan Hachul and Michael Junger. An experimental comparison of fastalgorithms for drawing general large graphs. In Proceedings of the 13thInternational Symposium on Graph Drawing (GD 2005), volume 3843 ofLecture Notes in Computer Science, pages 235–250. Springer, 2006.

[8] Enrico Lugli, Mario Roederer, and Andrea Cossarizza. Data analysis inflow cytometry: the future just started. Cytometry Part A, 77(7):705–713,2010.

[9] Guy Melancon and Arnaud Sallaberry. Edge metrics for visual graph ana-lytics: A comparative study. In Proceedings of the 12th International Con-ference Information Visualisation (IV), pages 610–615. IEEE ComputerSociety, 2008.

[10] Erika A. O’Donnell, David N. Ernst, and Ravi Hingorani. Multiparameterflow cytometry: Advances in high resolution analysis. Immune network,13(2):43–54, 2013.

[11] Bruno Pinaud and Pascale Kuntz. GVSR: an on-line guide for choosinga graph visualization software. In Proceedings of the 18th InternationalSymposium on Graph Drawing (GD 2010), volume 6502 of Lecture Notesin Computer Science, pages 400–401. Springer, 2011.

14

a graph-based method to detect rare events: an application to...

Documents