a general process mining framework for correlating

23
A general process mining framework for correlating, predicting and clustering dynamic behavior based on event logs Massimiliano de Leoni a,n , Wil M.P. van der Aalst a , Marcus Dees b a Department of Mathematics and Computer Science, Eindhoven University of Technology, Eindhoven, The Netherlands b Uitvoeringsinstituut Werknemersverzekeringen (UWV), The Netherlands article info Keywords: Process mining Decision and regression trees Event-log manipulation Event-log clustering abstract Process mining can be viewed as the missing link between model-based process analysis and data-oriented analysis techniques. Lion's share of process mining research has been focusing on process discovery (creating process models from raw data) and replay techniques to check conformance and analyze bottlenecks. These techniques have helped organizations to address compliance and performance problems. However, for a more refined analysis, it is essential to correlate different process characteristics. For example, do deviations from the normative process cause additional delays and costs? Are rejected cases handled differently in the initial phases of the process? What is the influence of a doctor's experience on treatment process? These and other questions may involve process characteristics related to different perspectives (control-flow, data-flow, time, organiza- tion, cost, compliance, etc.). Specific questions (e.g., predicting the remaining processing time) have been investigated before, but a generic approach was missing thus far. The proposed framework unifies a number of approaches for correlation analysis proposed in literature, proposing a general solution that can perform those analyses and many more. The approach has been implemented in ProM and combines process and data mining techniques. In this paper, we also demonstrate the applicability using a case study conducted with the UWV (Employee Insurance Agency), one of the largest administrative factoriesin The Netherlands. & 2015 Elsevier Ltd. All rights reserved. 1. Introduction Process Aware Information Systems (PAISs) are increas- ingly used by organizations to support their businesses. Some of these systems are driven by process models, e.g., Business Process Management (BPM) and Workflow Man- agement (WFM) systems. However, in most PAISs only an implicit process notion exists. Consider for example the enterprise software SAP and Oracle that is widely used to manage business operations and customer relations. Although these systems provide BPM/WFM functionality, most processes are partly hard-coded in application soft- ware and exist only in the minds of the people using the software. This provides flexibility and simplifies imple- mentation efforts, but also results in poor management support. If there is no explicit process model that reflects reality, it is impossible to reason about compliance and performance in a precise and unified manner. Manage- ment dashboards provided by Business Intelligence (BI) software tend to oversimplify reality and do not use explicit process models. Fortunately, all these systems record the execution of process instances in so-called event logs. These logs thus capture information about Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/infosys Information Systems http://dx.doi.org/10.1016/j.is.2015.07.003 0306-4379/& 2015 Elsevier Ltd. All rights reserved. n Corresponding author. E-mail addresses: [email protected] (M. de Leoni), [email protected] (W.M.P. van der Aalst), [email protected] (M. Dees). Information Systems ] (]]]]) ]]]]]] Please cite this article as: M. de Leoni, et al., A general process mining framework for correlating, predicting and clustering dynamic behavior based on event logs, Information Systems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

Upload: others

Post on 04-Dec-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Contents lists available at ScienceDirect

Information Systems

Information Systems ] (]]]]) ]]]–]]]

http://d0306-43

n CorrE-m

w.m.p.vmarcus.

Pleasclust

journal homepage: www.elsevier.com/locate/infosys

A general process mining framework for correlating,predicting and clustering dynamic behavior basedon event logs

Massimiliano de Leoni a,n, Wil M.P. van der Aalst a, Marcus Dees b

a Department of Mathematics and Computer Science, Eindhoven University of Technology, Eindhoven, The Netherlandsb Uitvoeringsinstituut Werknemersverzekeringen (UWV), The Netherlands

a r t i c l e i n f o

Keywords:Process miningDecision and regression treesEvent-log manipulationEvent-log clustering

x.doi.org/10.1016/j.is.2015.07.00379/& 2015 Elsevier Ltd. All rights reserved.

esponding author.ail addresses: [email protected] (M. de Leoni)[email protected] (W.M.P. van der Aalst),[email protected] (M. Dees).

e cite this article as: M. de Leoniering dynamic behavior based on ev

a b s t r a c t

Process mining can be viewed as the missing link between model-based process analysisand data-oriented analysis techniques. Lion's share of process mining research has beenfocusing on process discovery (creating process models from raw data) and replaytechniques to check conformance and analyze bottlenecks. These techniques have helpedorganizations to address compliance and performance problems. However, for a morerefined analysis, it is essential to correlate different process characteristics. For example, dodeviations from the normative process cause additional delays and costs? Are rejectedcases handled differently in the initial phases of the process? What is the influence of adoctor's experience on treatment process? These and other questions may involve processcharacteristics related to different perspectives (control-flow, data-flow, time, organiza-tion, cost, compliance, etc.). Specific questions (e.g., predicting the remaining processingtime) have been investigated before, but a generic approach was missing thus far. Theproposed framework unifies a number of approaches for correlation analysis proposed inliterature, proposing a general solution that can perform those analyses and many more.The approach has been implemented in ProM and combines process and data miningtechniques. In this paper, we also demonstrate the applicability using a case studyconducted with the UWV (Employee Insurance Agency), one of the largest “administrativefactories” in The Netherlands.

& 2015 Elsevier Ltd. All rights reserved.

1. Introduction

Process Aware Information Systems (PAISs) are increas-ingly used by organizations to support their businesses.Some of these systems are driven by process models, e.g.,Business Process Management (BPM) and Workflow Man-agement (WFM) systems. However, in most PAISs only animplicit process notion exists. Consider for example theenterprise software SAP and Oracle that is widely used to

,

, et al., A general proent logs, Information

manage business operations and customer relations.Although these systems provide BPM/WFM functionality,most processes are partly hard-coded in application soft-ware and exist only in the minds of the people using thesoftware. This provides flexibility and simplifies imple-mentation efforts, but also results in poor managementsupport. If there is no explicit process model that reflectsreality, it is impossible to reason about compliance andperformance in a precise and unified manner. Manage-ment dashboards provided by Business Intelligence (BI)software tend to oversimplify reality and do not useexplicit process models. Fortunately, all these systemsrecord the execution of process instances in so-calledevent logs. These logs thus capture information about

cess mining framework for correlating, predicting andSystems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]]2

activities performed. Each event records the execution ofan activity instance by a given resource at a certain pointin time along with the output produced.

Event logs are the key enabler for process mining, whichis capable of extracting, from event logs, in-depth insightsin process-related problems that contemporary enter-prises face [1]. Through the application of process mining,organizations can discover the processes as they wereconducted in reality, check whether certain practices andregulations were really followed and gain insight intobottlenecks, resource utilization, and other performance-related aspects of processes.

Process mining often starts with process discovery, i.e.,automatically learning process models based on raw eventdata. Once there is a process model (discovered or made byhand), the events can be replayed on the model to checkconformance and to uncover bottlenecks in the process [1].However, such analyses are often only the starting point forproviding initial insights. When discovering a bottleneck orfrequent deviation, one would like to understand why itexists. This requires the correlation of different processcharacteristics. These characteristics can be based on:

Pc

the control-flow perspective (e.g., the next activitygoing to be performed);

the data-flow perspective (e.g., the age of the patient orthe amount of glucose in a blood sample);

the time perspective (e.g., the activity duration or theremaining time to the end of the process);

the resource/organization perspective (e.g., the resourcegoing to perform a particular activity or the currentworkload), or,

if a normative process model exists, the conformanceperspective (e.g., the skipping of a mandatory activityor executing two activities in the wrong order).

1 See the FeaturePrediction package available in ProM 6.4 or newer,downloadable from http://www.promtools.org.

There are of course other perspectives possible (e.g., thecost perspective), but these are often not process-specificand can be easily encoded in the data-flow perspective. Forexample, there may be data attributes capturing variableand fixed costs of an activity.

These problems are specific instances of a more generalproblem, which is concerned with relating any process orevent characteristic to other characteristics associated withsingle events or the entire process. This paper proposes aframework to solve the more general correlation problemand provides a very powerful tool that unifies the numer-ous ad hoc approaches described in literature. This isachieved by providing (1) a broad and extendable set ofcharacteristics related to control-flow, data-flow, time,resources, organizations and conformance, and (2) a gen-eric framework where any characteristic (dependent vari-able) can be explained in terms of correlations with any setof other characteristics (independent variables). Forinstance, the involvement of a particular resource orrouting decision can be related to the elapsed time, butalso the other way around: the elapsed time can be relatedto resource behavior or routing.

The approach is fully supported by a new package thathas been added to the open-source process mining

lease cite this article as: M. de Leoni, et al., A general prolustering dynamic behavior based on event logs, Information

framework ProM.1 The evaluation of our approach is basedon two case studies involving UWV, a Dutch governmentalinstitute in charge of supplying benefits. For the first casestudy, the framework allows us to successfully answerprocess-related questions related to causes of observedproblems within UWV (e.g., reclamations of customers).For some problems, we could show surprising root causes.For other problems, we could only show that somesuspected correlations were not present, thus providingnovel insights. A second case study was concerned withdiscovering the process model that describes the UWV'smanagement of provisions of benefits for citizens who areunemployed and unable to look for a job for a relativelylong period of time because of physical or mental illnesses.Analysis shows that there is a lot of variability in processexecution, which is strongly related to the length of thebenefit's provision. Therefore, the discovery of one singleprocess led to unsatisfactory results as this variabilitycannot be captured in a single process model. Aftersplitting the event log into clusters based on the distin-guishing features discovered through our approach, theresults significantly improved.

It is important to note that our framework does notenable analyses that previously were not possible. Thenovelty of our framework is concerned with providing asingle environment where a broad range of process-centric analyses can be performed much quicker withoutrequiring process analysts with a solid technical back-ground. Without the framework, for instance, processanalysts are confronted with database systems where theyneed to perform tedious operations to import data fromexternal sources and, later, to carefully design complex SQLqueries to perform joins and even self-joins that involvedifferent database tables and views.

The remainder of the paper is as follows. Section 2presents the overall framework proposed in this paper.The section also discusses how the problem of relatingbusiness process characteristics can be formulated andsolved as a data-mining problem. In particular, we leverageof data-mining methods to construct a decision or regres-sion tree. In the remainder, we refer them to as predictiontrees if there is no reason to make a distinction. Ourframework relies on the availability of key process char-acteristics. Therefore, Section 3 classifies and providesexamples of these characteristics, along with showinghow to extract them from raw event data. Section 4presents event-selection filters to determine the instancesof the learning problem. In Section 5 the filters and theprocess characteristics are used to provide an overview ofthe wide range of questions that can be answered usingour framework. Several well-studied problems turn out tobe specific instances of the more general problem con-sidered in this paper. Section 6 discusses the notion ofclustering parts of the log based on prediction trees.Section 7 illustrates the implementation of the frameworkas well as two case studies that have benefitted from theapplication of the framework. Finally, Section 8 positions

cess mining framework for correlating, predicting andSystems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

Event Log

2. Manipulate and Enrich Event log

1. Define AnalysisUse Case

Event Log

AnalysisUse Case

3. Make Analysis

If analysis needs to be refined

Analysis Result

Process Model

Context Data

...

Additional O

bjects

4. Cluster

Sub Log 1

Sub Log n

...

5. Process Discovery

AB

Process model

A1

A2

a b

A3

c a

A1

c d

Fig. 1. The general framework proposed in this paper: based on an analysis use case the event log is preprocessed and used as input for classification. Basedon the analysis result, the use case can be adapted to gather additional insights. The part in red (or gray if printed in gray scale) is optional. The result of theanalysis use case provides a classification that can be used to split the event log into clusters: each tree's leaf becomes a different event log. Each sublog canbe used by different process mining techniques (including, but not only, process discovery).

2 Several definitions of workload are possible. Here and later in thepaper, the workload of a resource r at time t is interpreted as the amountof work that is assigned to r at time t [2]. This accounts for activities that ris executing or that have been scheduled to r for later performance. Theworkload of a set of resources (e.g., within a department) is the sum ofthe workload of all resources in the set. Other workload definitions arepossible as described in [2].

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]] 3

this work with respect to the state of the art and Section 9concludes the paper.

2. A framework for correlating, predicting and clusteringdynamic behavior based on event logs

This section introduces the framework reported on inthis paper. To facilitate the understandability of this frame-work, Section 2.1 first provides a general overview, fol-lowed by Section 2.2 which provides a formalization of theframework. Section 2.3 then shows how the problem ofperforming an analysis use case can be transformed into adata-mining problem.

2.1. Overview

A high level overview of the framework proposed inthis paper is shown in Fig. 1. Starting point is an event log.For each process instance (i.e., case) there is a trace, i.e., asequence of events. Events have different characteristics.Mandatory characteristics are activity and timestamp.Other standard characteristics are the resource used toperform the activity, transactional information (start, com-plete, suspend, resume, etc.), and costs. However, any othercharacteristic can be associated to an activity (e.g., the ageof a patient or size of an order). Characteristics areattached to events as name-value pairs: (name_of_charac-teristic, value). As mentioned in Section 1 and illustrated inFig. 2(a), these characteristics can focus on the control-flowperspective, the data-flow perspective, the resource/

Please cite this article as: M. de Leoni, et al., A general proclustering dynamic behavior based on event logs, Information

organization perspective, the time perspective, and theconformance perspective.

Starting from the event log, a process analysis candefine a so-called analysis use case, which requires theselection of a dependent characteristic, the independentcharacteristics and an event filter to describe which eventsto retain for the analysis.

Sometimes, such analysis requires the incorporation ofcharacteristics that are not (yet) available in the event log.These characteristics can be added by deriving their valuesthrough computations over the event log or from externalinformation sources. This motivates the need of the secondstep of our framework where the event log is manipulatedand enriched with additional characteristics. As an exam-ple, the remaining time until case completion can beadded to the event characteristics by simply comparingthe time of the current event with the time of the lastevent for the same case. Another example is concernedwith adding the workload of a resource to any eventoccurring at time t by scanning the event log and calculat-ing the set of activity instances that are being executed byor are queued for the resource at time t.2

cess mining framework for correlating, predicting andSystems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

control-flow

data-flow

resource/organization

time

conformance

contextperspectives

event

case

resource

process

organization

event belongs to

a case

external context unrelated to process and

organizations (e.g., weather)

event was done by a resource

resource belongs to an organization

case belongs to a process

Fig. 2. Overview of event characteristics. (a) Perspectives to which event characteristics refer. (b) Context of a selected event.

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]]4

Fig. 2(b) provides an overview of the sources ofinformation that can be used to add information to events.Event characteristics can derived over other events of thesame process instance, other instances of the same pro-cess, instances of different processes but also over eventsexecuted by the same resource or by resources belongingto the same organization entity or to related organizationentities. All of the above classes of event characteristics aresolely based on the event log. However, event character-istics can be extracted through cross correlations withexternal data sources, as shown in Fig. 1. We may use aprocess model (e.g., for conformance checking) or contextdata (such as, the weather or stock market). For example,we may compute conformance metrics like fitness orprecision and attach this to events. For instance, certainresources may be “meteoropathic”: the level of theirperformance is influenced by weather conditions. It isworth analyzing whether, indeed, certain resources aremore efficient when the weather is better. If we augmentevents with the weather conditions when they occurred,we can use the augmented log and check whether theefficiency of single resources is somehow related to theweather condition. This can be done by using the activityduration or resource workload as a dependent character-istic and the weather as one of the independent ones. Ifthe characteristic occurs in the prediction tree, then thereis some correlation between efficiency and weather. Someevent characteristics (like weather information) cannot bederived from the event log. In general, the process contextis acknowledged by many authors to be a huge source ofvaluable information to find correlations [3–5].

Once the analysis use case is defined and the event logis enriched, the analysis use case can be made. Theoutcome of performing the analysis is the generation of adecision or a regression tree, which aims to explain thevalues of the dependent characteristic as a function of theindependent characteristics.

The analysis result obtained after the third step of theframework can be used for multiple purposes. Forinstance, the results can be used to derive which condi-tions may have a negative impact on certain KPIs; as anexample, when certain process participants or entire

Please cite this article as: M. de Leoni, et al., A general proclustering dynamic behavior based on event logs, Information

company's branches are involved in execution of a givenprocess, the customer's satisfaction is lower. As results,corrective actions can be put in place to avoid thoseconditions (e.g., the low-performing branches can beclosed or the unsatisfactory process participants can bemotivated, moved to different tasks or, as extreme solu-tion, be fired). The results can also be used at run-time toprovide decision support. For instance, suppose that cer-tain conditions yield low values for certain KPIs. Whileexecuting the process, participants can be suggested toavoid certain decisions that have been observed to likelylead to those conditions. However, the production of theresults is always done a posteriori using an event log thatrecords past executions of process instances. If newprocess instances conclude, the result is not automaticallyupdated. Conversely, the analysis needs to be repeatedfrom scratch.

In addition to the three steps mentioned earlier, thegeneral framework also envisions an optional fourth stepthat enables the clustering of traces of event logs on thebasis of the correlation analysis results; see the red part (orgray if printed in gray scale) in Fig. 1. In many settings, anevent-log clustering is an important preprocessing step forprocess mining. By grouping similar log traces together, itmay be possible to construct partial process models thatare easier to understand. If the event log shows a lot ofbehavioral variability, the process model discovered for alltraces together is too complex to comprehend; in thosecases, it makes sense to split the event log into clustersand apply discovering techniques to the log clusters, thusobtaining a simpler model for each cluster. This can becombined with process cubes where event data are storedin a multi-dimensional space [6]. Prediction trees providea simple and visual way to perform such a clustering: eachleaf is associated with a number of event-log traces.Therefore, a cluster is constructed for each leaf.

2.2. Formalization

As mentioned in Section 2.1, the main input of ourframework is an event log. An event can have any numberof attributes, here called characteristics. A trace is a

cess mining framework for correlating, predicting andSystems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]] 5

sequence of events and an event log is a collection oftraces.

Definition 1 (Events, traces and log). Let C and U be theuniverse of characteristics and the universe of possiblevalues respectively. An event e is an assignment of valuesto characteristics, i.e. eAC↛U . In the remainder E ¼ C↛U isthe universe of events. A trace tAEn is a sequence ofevents. Let T ¼ En be the universe of traces. An event log Lis a multi-set of traces, i.e. LABðT Þ.3

For each characteristic cAC, typeðcÞDU denotes the setof possible values. We use a special value ? for anycharacteristic c which an event e is not assigning a valueto, i.e. eðcÞ ¼ ? if c=2domðeÞ. This way eAE can be used as atotal function rather than a partial one. Note that ? =2U .

Typically, an event refers to an activity that is performedwithin a certain case (i.e. process instance) by a resource at agiven timestamp. In our framework, these are treated asordinary event characteristics. We assume characteristics suchas Activity, Case, Resource, and Timestamp but do not treatthem differently. No two events can have exactly the samevalues for all characteristics. This is not a limitation, one canalways introduce an event identifier as an additional char-acteristic to ensure uniqueness.

Our framework aims to support so-called analysis usecases.

Definition 2 (Analysis use case). An analysis use case is atriple A¼ ðcr ; Cd; FÞ consisting of

Pc

a dependent characteristic crAC⧹Cd,

� a set CdDC⧹fcrg of independent characteristics, � an event-selection filter FDE, which selects the eventsthat are retained for the analysis.

The definition of analysis use cases specializes standarddefinitions of supervised learning [7] for the process-mining case. The specialization consists in the fact thatthe instances that are used for learning are a subset of theevents in an event log. As a consequence, the dependentand independent characteristics are linked to the processattributes that are recorded in the event log.

The result of performing an analysis use case is adecision or a regression tree that relates the independentcharacteristics (also called the predictor or explanatoryvariables) to the dependent characteristic (also called theresponse variable). F selects the events for which wewould like to make a prediction. Sometimes we select allevents, sometimes only the last one in each case, and atother times all events corresponding to a particular activ-ity or resource. The events selected through F correspondto the instances used to construct the decision or regres-sion tree. If we would like to predict the remainingprocessing time (the dependent characteristic), then weselect all events and may use independent characteristicssuch as the resource that worked on it, the last activity, thecurrent workload, and the type of customer.

3 Given any set X, BðXÞ denotes the set of all possible multisets over X.

lease cite this article as: M. de Leoni, et al., A general prolustering dynamic behavior based on event logs, Information

As discussed in Section 2.1, many analysis use casesrequire the presence of dependent and/or independentcharacteristics that are not readily available in the eventlog. Similarly, using business domain knowledge, an ana-lyst may want to verify a reasonable hypothesis of theexistence of a correlation to a given set of independentcharacteristics, which may not be explicitly available in theevent log.

Step 2 in Fig. 1 aims to enrich the event log withvaluable information required for the analysis use caseswe are interested in. We provide a powerful framework tomanipulate event logs and obtain a new event log thatsuits the specific analysis use case, e.g. events are enrichedwith additional characteristics.

Definition 3 (Trace and log manipulation). Let T be theuniverse of traces and let LABðT Þ be an event log. A tracemanipulation is a function δLAT -T such that for anytAL: jtj ¼ jδLðtÞj.

Hence, given a trace tAL, δLðtÞ yields a new trace of thesame length. A trace manipulation function can only addevent characteristics, without removing or adding events.

δLðtÞ not only depends on t but may also depend onother traces in L. For example, to compute a resource'sworkload, we need to look at all traces. δLðtÞ may evendepend on any other contextual information (although thisis not formalized here). For example, we may considermultiple processes or multiple organizations at the sametime and even include weather information.

By applying δL to all traces in L, we obtain a new logL0 ¼ ½δLðtÞ∣tAL�. Recall that, technically, a log is multiset oftraces.

After applying various log manipulations, we obtain anenriched event log where an event may have any numberof event characteristics. Thenwe apply the desired analysisuse case ðcr ; Cd; FÞ. Each selected event corresponds to aninstance of the supervised learning problem where thedependent characteristic cr is explained in terms of theindependent characteristics Cd. Many analysis use casesrequire to only consider a subset of all log's events. Forexample, if we aim at relating the total case duration to thenumber of times a particular resource was involved in it,we want to consider each case as a whole and, hence,select the last event of the enriched event log. In this casethe number of instances used to learn a tree classifier issame as the number of cases in the event log. If we wouldlike to analyze a decision in the process, then the numberof instances should equal the number of times the decisionis made. Hence, F should select the events correspondingto the decision we are interested in.

The resulting tree produced by Step 3 in Fig. 1 mayprovide valuable information regarding performance (e.g.,time, costs, and quality) and conformance. However, as Fig. 1shows it is also possible to use the analysis result to clustertraces and create smaller event logs. These so-called sublogscan be used for any type of process mining. For example, wemay compare process models learned for the differentclasses of traces identified in the decision tree.

In addition, it can be given as input to split the traces ofthe event logs into groups in order to, e.g., construct partial

cess mining framework for correlating, predicting andSystems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]]6

process models that are easier to understand. The traceclustering can be defined as follows4:

Definition 4 (Trace clustering). Let T be the universe oftraces. A trace clustering based on an analysis use caseA¼ ðcr ; Cd; FÞ is a function λAABðT Þ-BðBðT ÞÞ such that forany LABðT Þ: L+⨄L0 AλAðLÞ L

0 and for any L0; L″AλAðLÞ, if tAL0

and tAL″, then L0 ¼ L″.

Function λA associates a multiset of trace clusters to anevent log, where each cluster is again a multiset of traces(i.e., a sublog). Trace clustering cannot generate new tracesand a trace can only belong to one cluster.

Note that λA does not necessarily partition the traces inthe log, e.g., traces can be considered as outliers and,hence, discarded from any of the clusters.

Step 4 in Fig. 1 provides the possibility to conductcomparative process mining [6], i.e., comparing differentvariants of the process, different groups of cases, ordifferent organizational units.

In the remainder of this paper, we provide furtherinformation on how make the concepts discussed opera-tional. Note that the approach described using Fig. 1 is verygeneric and also fully implement, as extensively discussedin Section 7.

2.3. Use of tree-learning methods to perform analysis usecases

The result of performing an analysis use case is aprediction tree: decision or regression tree. Decision andregression trees classify instances (in our case the eventsselected through F) by sorting them down in a tree fromthe root to some leaf node. Each non-leaf node specifies atest of some attribute (in our case, an independentcharacteristic) and each branch descending from that nodecorresponds to a range of possible values for this attribute.Each leaf node is associated with a value of a class attribute(in our case, the dependent characteristic). A path fromroot to a leaf represents a rule for classification or regres-sion. There exist numerous algorithms to build a decisionor a regression tree starting from a training set [7]. Everyalgorithm for constructing a prediction tree takes threeinput parameters: (1) a dependent variable (a.k.a. classvariable) cr, (2) a set of independent variables Cd, and (3) amulti-set IABðEÞ of instances for constructing the tree.The multi-set of instances I is constructed starting froman event log L. Given the multi-set E of all events in L, i.e.E¼⨄tA L⨄eA t e, the multi-set of instances for training iscomputed as follows: I ¼ E∣F .5 Multi-set I contains everyinstance that can be used to train and test the treeclassifier; instances that have been filtered out are dis-carded. If cross-validation is employed, any instance isgoing to be used for both training and testing. Obviously, it

4 The union operator ⨄ is defined over two multiset sets. Given 2multi-sets A and B, A⊎B is a multiset that contains all elements of A and B;the cardinality of each element eAA⊎B is the sum of the cardinalities of ein A and B.

5 Operator ∣F is the defined as follows: given a multi-set E and a set F,E∣F contains all elements that occur in both E and F and the cardinality ofeach element eAE∣F is equal to the cardinality of e in E.

Please cite this article as: M. de Leoni, et al., A general proclustering dynamic behavior based on event logs, Information

is also possible to clearly split I in a train and test set. Wedo not prevent any possibility from being used, as, for aspecific case, it may be worthwhile exploring both.

Our framework is agnostic with respect to specificalgorithms used to construct prediction trees. However,for a concrete implementation, we opted for the C4.5algorithm to construct decision tree [7] and the REPTreealgorithm to construct regression tree [8]. Decision treesrequire the dependent variable to be nominal. If not, thedomain of the class attribute needs to be discretized.Literature provides several ways to discretize dependentvariables. While any discretization technique can beemployed, our implementation provides two specific ones:equal-width binning and equal-frequency binning [9].

3. Extracting relevant process characteristics

This section focuses on Step 2 in Fig. 1 and provides anoverview of different manipulations to enrich the eventlog such that the required dependent and independentvariables are present for building the desired decision orregression tree.

Remember that formally trace manipulation is a func-tion δLAT -T transforming one trace into another whilepossibly using the broader context shown in Fig. 2(b). Thetrace manipulations described in this section always addone new event characteristic to each case in the event log.Recall that for any tAL: jtj ¼ jδLðtÞj, i.e., no events areadded or removed.

The remainder of this section focuses on showingexamples of manipulation for the different perspectivesof Fig. 2(a). The lists of manipulations are far from beingcomplete since the number of potential manipulations isinfinite. We limit ourself to discuss the manipulations thatare available in implementation at the time of writing.However, the framework is generic: one can easily plug-innew manipulations. In fact, also the implementation issuch that one can easily encode new manipulations, as itwill be discussed in Section 7.

In the remainder of Section 3, we also provide formaldefinitions for some of the provided examples of themanipulations. For this aim, we need to introduce somehelper functions and operators that are going to be used inthe definitions.

Helper functions and operators: As mentioned in Section2.2, each event e is considered as unique. Therefore, given anevent e, we can define functions prefixðeÞ and postfixðeÞAEn

that return the sequence of events that respectively precedeand follow event e in the same trace. Events are aggregationsof process characteristics. Given an event e, e(c) returns thevalue of characteristic c for event e if it exists. If no value exists,the special value ? =2U is returned. Given an event e and aprocess characteristic as pair ðc;uÞAC � ðU [ f?gÞ, we intro-duce the overriding operator e0 ¼ e � ðc;uÞAE such thate0ðcÞ ¼ u and, for each c0AC⧹fcg, e0ðcÞ ¼ eðcÞ. We also intro-duce the trace concatenation operator � as follows: giventwo traces t0 ¼ ⟨e01;…; e0n⟩ and t″ ¼ ⟨e″1;…; e″m⟩, t0 � t″ ¼⟨e01;…; e0n; e

″1;…; e0m⟩. Given a trace tAEn, a characteristic

cAC and a set of values UDU , function firstOccðt; c;U Þ returnsthe first occurrence of any value uAU in trace t for

cess mining framework for correlating, predicting andSystems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

6 To keep the notation simple, we assume here that functions avg,min and max and the sum operator

Pautomatically exclude all

occurrences of value ? . Also, when the functions and the operator areapplied over an empty set, they return ? .

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]] 7

characteristic c. It can be recursively defined as follows:

firstOccð⟨⟩; c;U Þ ¼ ?firstOccð⟨e⟩ � t0; c;U Þ

¼eðcÞ if eðcÞAUfirstOccðt0; c;U Þ if eðcÞ=2U

(

Similarly, one can define the last occurrence: last Occð⟨e1;…;

en⟩; c;U Þ ¼ firstOccð⟨en;…; e1⟩; c;U Þ

3.1. Control-flow perspective

The control-flow perspective refers to the occurrenceand ordering of activities.

Trace Manipulation CFP1x: Number of Executions ofActivity x Until the Current Event. This operation countsthe number of times x was executed until the currentevent in the trace. This event characteristic is added to alltraces. Formally, given a trace t ¼ ⟨e1;…; en⟩AEn,CFP1xðtÞ ¼ ⟨e01;…; e0n⟩AEn where, for 1r irn, e0i ¼ ei �ðCFP1;nÞ with n¼ jfe″AprefixðeÞ∣e″ðActivityÞ ¼ eðActivityÞgj.

Trace Manipulation CFP2X: First Occurrence of AnyActivity in Set A After Current Event. This operationaugments every event e with the name of the activity thatbelongs to set X and firstly occurs in the trace after e. Ifthere is no next event or no activity in X that occurs in thetrace after e, special value ? is used. Set X can betechnically empty but it would be meaningless: any eventwould be always augmented with the ? value. If set Xcontains all activities to which log's events refer, the eventis augmented with the name of the activity that follows ein the trace, or ? if such an event is the last in the trace.Formally, given a trace t ¼ ⟨e1;…; en⟩AEn, CFP2xðtÞ ¼ ⟨e01;…;

e0n⟩AEn where, for 1r irn, e0i ¼ ei � ðCFP2;uÞ withu¼ firstOccðpostfixðeÞ;Activity;XÞ

Trace Manipulation CFP3: Previous Activity in theTrace. This operation determines the activity directlypreceding the current event in the same case (i.e. thesame log trace). If there is no previous event, a ? value isused. Formally, given a trace t ¼ ⟨e1;…; en⟩AEn, CFP3ðtÞ ¼⟨e01;…; e0n⟩AEn where e01 ¼ e1 � ðCFP3; ?Þ and, for 1o irn,e0i ¼ ei � ðCFP3; ei�1ðActivityÞÞ.

Trace Manipulation CFP4: Current Activity. Thisoperation determines the current event's activity name.Strictly speaking no manipulation is needed since this isalready an event attribute. Nevertheless, we list it to beable to refer to it later.

One can think of many other control-flow trace manip-ulation, e.g., counting the number of times a given activityis followed by another activity.

3.2. Data-flow perspective

Next we consider the data-flow perspective. Here wefocus on the data attributes attached to events, these mayrefer to the event itself or to properties of the case theevent belongs to. Let us recall that U indicates thatuniverse of possible values which a characteristic cantake on.

Please cite this article as: M. de Leoni, et al., A general proclustering dynamic behavior based on event logs, Information

Trace Manipulation DFP1c: Latest Recorded Value ofCharacteristic c Before Current Event. This operation deter-mines the latest value of some attribute c not considering thecurrent event. Note that, when enriching a certain event e,there may be multiple events before e in the same trace thatassign a value to c. Then, we consider the latest of theseevents. If the attribute was not present in any of the precedingevents, a ? value is returned. As before all events arepreserved in the trace; this characteristic is added to eachevent. Formally, given a trace t ¼ ⟨e1;…; en⟩AEn, DFP1cðtÞ ¼⟨e01;…; e0n⟩AEn where, for 1r irn, e0i ¼ ei � ðDFP1c;uÞ withu¼ lastOccðprefixðeÞ; c;UÞ.

Trace Manipulation DFP2c: Latest Recorded Value ofCharacteristic c Until Current Event. This operationdetermines the latest value of c also considering thecurrent event. If the current event has attribute c, thenthe corresponding value is the latest value. If not, theprevious events for the same case are considered. For-mally, given a trace t ¼ ⟨e1;…; en⟩AEn, DFP2cðtÞ ¼ ⟨e01;…;

e0n⟩AEn where, for 1r irn, e0i ¼ ei � ðDFP2c;uÞ withu¼ lastOccðprefixðeÞ � ⟨e⟩; c;UÞ.

Trace Manipulation DFP3c: Average Value of Charac-teristic c Until the Current Event. This operation aug-ments an event e with the average of the values assignedto characteristic c by e and by those events that arepreceding e in the trace. Formally, given a tracet ¼ ⟨e1;…; en⟩AEn, DFP3cðtÞ ¼ ⟨e01;…; e0n⟩AEn where, for1r irn, e0i ¼ ei � ðDFP3c;uÞ with u¼ avge″ AprefixðeÞ� ⟨e⟩eðcÞ.6

Trace Manipulation DFP4c: Maximum Value of Char-acteristic c Until the Current Event. This operationaugments an event e with the maximum value assignedto characteristic c by e and by those events that arepreceding e in the trace. Formally, given a tracet ¼ ⟨e1;…; en⟩AEn, DFP4cðtÞ ¼ ⟨e01;…; e0n⟩AEn where, for1r irn, e0i ¼ ei � ðDFP4c;uÞ with u¼maxe″ AprefixðeÞ� ⟨e⟩eðcÞ(footnote 6).

Trace Manipulation DFP5c: Minimum Value of Char-acteristic c Until the Current Event. This operationaugments an event e with the minimum value assignedto characteristic c by e and by those events that arepreceding e in the trace. Formally, given a tracet ¼ ⟨e1;…; en⟩AEn, DFP5cðtÞ ¼ ⟨e01;…; e0n⟩AEn where, for1r irn, e0i ¼ ei � ðDFP5c;uÞ with u¼mine″ AprefixðeÞ� ⟨e⟩eðcÞ(footnote 6).

Trace Manipulation DFP6c: Sum Of Value of Charac-teristic c Until the Current Event. This operation aug-ments an event e with the sum of the values assigned tocharacteristic c by e and by those events that are precedinge in the trace. Formally, given a trace t ¼ ⟨e1;…; en⟩AEn,DFP6cðtÞ ¼ ⟨e01;…; e0n⟩AEn where, for 1r irn, e0i ¼ ei �ðDFP6c;uÞ with u¼P

e″ AprefixðeÞ� ⟨e⟩eðcÞ (footnote 6).Note that, of course, manipulations DFP3c;…;DFP6c are

only applicable if c is numerical, such as a timestamp, acost or the customer's age.

cess mining framework for correlating, predicting andSystems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

7 Experience shows this assumption holds in the large majority ofprocesses due to the fact that information systems rarely support thepossibility of concurrently starting multiple instances of the same activitywithin the same process instance.

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]]8

3.3. Resource and organization perspective

The resource and organization perspective focusses onthe resource executing the event or corresponding activityinstance. Resources may be grouped in different organiza-tional entities (groups, roles, positions, departments, etc.).For the manipulations listed below we do not provide aformal introduction because it would not add much to thetextual description, unless the workload calculations areexplained in detail, which is out of the scope of this paper.To compute the total workload or that of a single resource,we employ the approach discussed in [2].

Trace Manipulation ROP1: Workload per Resource.Events may be related to resources. By scanning the eventlog one can get an indication of the amount of workperformed in a particular period. Moreover, one can alsosee waiting work items that are later executed by aparticular resource. The resource's workload at the timethe event occurs can be attached to the event.

Trace Manipulation ROP2: Total Workload. The pre-vious event characteristic is related to the particularresource performing the event. However, we can also takethe sum over all running or queuing work items. Forexample, we can simply count the number of activitiesin-between a start and complete event.

Trace Manipulation ROP3: Current Resource. Thisoperation determines the current event's resource (ifany). Strictly speaking no manipulation is needed, becauseit is already part of the event. However, we list it to be ableto refer to it later.

Trace Manipulation ROP4: Current Role. This opera-tion determines the role of the current event. This can bealready part of the event; in that case, no manipulation isneeded. If not, it can be derived by cross-correlatingorganizational information, stored, e.g., in a database,and the current resource.

Many other resource-related event characteristics arepossible: many ways exist to measure workload, utiliza-tion, queue lengths, response times, and otherperformance-related properties. Information may bederived from a single process or multiple processes.

3.4. Time perspective

The time perspective refers to various types of dura-tions (service times, flow times, waiting times, etc.) inprocesses. Clearly the resource/organization perspectiveand the time perspective are closely related.

Some of the manipulations for the resource/organiza-tion perspective have already exploited transactional infor-mation in the event log. For example, to measure thenumber of running activity instances, we need to correlatethe start events and the complete events. Schedule, start,complete, suspend, resume, etc. are examples of possibletransaction types [1]. Each event has such a transactiontype (default is complete). An activity instance correspondsto a set of events. For one activity instance there may be aschedule event, a start event, a suspend event, a resumeevent, and finally a complete event. The duration of anactivity instance is the time between the start event andthe complete event.

Please cite this article as: M. de Leoni, et al., A general proclustering dynamic behavior based on event logs, Information

Trace Manipulation TIP1: Activity Duration. The char-acteristic adds information on activity instance durationsto complete events. We can measure the duration of anactivity instance by subtracting the timestamp of the startevent from that of the complete event, under the assump-tion that never more than one instance of the givenactivity is concurrently being executed.7 Formally,t ¼ ⟨e1;…; en⟩AEn, TIP1ðtÞ ¼ ⟨e01;…; e0n⟩AEn where, for1r irn:

cesSys

if eiðTransitionÞa “Complete”, then e0i ¼ ei;

� if eiðTransitionÞ ¼ “Complete”, then e0i ¼ ei �TIP1; eiðTimestampÞ�ejðTimestampÞ� �

such that jo iand ejðActivityÞ ¼ eiðActivityÞ and, for all joko i,ekðActivityÞaeiðActivityÞ.

Note that start or complete events may be missing andthat correlation may be non-trivial [1]. Hence, there aremultiple ways to measure activity durations in these cases.We support the approach in [2] where the start time isapproximated as follows: it either the time of completionof the previous activity within the same process instanceor the time of completion of the previous activity by thesame resource (possibly in a different process instance)depending on which one comes latest. This is based on theassumption that the resource could have started to workon the activity if the previous activity completed and theresource was not working on other activities. The fullformalization is beyond the scope of this paper.

Trace Manipulation TIP2: Time Elapsed since theStart of the Case. Each non-empty case has a first event.The timestamp of this event marks the start of the case.Events corresponding to the same case also have a time-stamp. The difference between both corresponds to theelapsed since the start of the case. This information isadded to all events. Formally, given a trace t ¼ ⟨e1;…;

en⟩AEn, TIP2ðtÞ ¼ ⟨e01;…; e0n⟩AEn where, for 1r irn,e0i ¼ ei � ðTIP2; eiðTimestampÞ�e1ðTimestampÞÞ.

Trace Manipulation TIP3: Remaining Time Until theEnd of Case. Each non-empty case has a last event. Thetimestamp of this event marks the end of the case. Eventscorresponding to the same case also have a timestamp.The difference between both corresponds to the remainingflow time. This information is added to all events. For-mally, given a trace t ¼ ⟨e1;…; en⟩AEn, TIP3ðtÞ ¼ ⟨e01;…;

e0n⟩AEn where, for 1r irn, e0i ¼ ei � ðTIP3;enðTimestampÞÞ�eiðTimestampÞÞ.

Trace Manipulation TIP4: Case duration. This eventcharacteristic measures the difference between the lastevent and first event of the case the current event belongsto. Formally, given a trace t ¼ ⟨e1;…; en⟩AEn, TIP4ðtÞ ¼⟨e01;…; e0n⟩AEn where, for 1r irn, e0i ¼ ei � ðTIP4;enðTimestampÞÞ�e1ðTimestampÞÞ.

Trace Manipulation TIP5: Current Timestamp. Thisinformation corresponds to the event's timestamp. Strictly

s mining framework for correlating, predicting andtems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]] 9

speaking no manipulation is needed since this is alreadyan event attribute. Nevertheless, we list it to be able torefer to it later.

Other time-related event characteristics are also possi-ble. For example, if schedule events are present, one canmeasure waiting times and attach this information toevents.

3.5. Conformance perspective

The time perspective is important for performance-related questions. However, process mining also includesconformance-related questions in the context of compli-ance management, auditing, security, etc. To be able toanswer conformance-related questions we need to have aspecification of the desired behavior. This may be acomplete process model or just a set of simple rules. Herewe consider two forms of specified normative behavior:(1) a process model (typically a Petri net) and (2) aconstraint expressed in terms of LTL (Linear TemporalLogic). The LTL-based checking is based on the techniquedescribed in [10]. The others rely on the ProM implemen-tation of the techniques discussed in [11], which areconcerned with finding an alignment of traces in the logwith the control flow of process models. An extension ofthe technique is provided in [12] where the other per-spectives are also considered. Note that these are justexamples, any evaluation of the observed behavior withrespect to the normative behavior can be used (includingdeclarative languages like Declare).

To understand such trace manipulations, it is importantto understand the seminal notion of alignments [11]. Toestablish an alignment between process model and eventlog we need to relate “moves” in the log to “moves” in themodel. However, it may be the case that some of themoves in the log cannot be mimicked by the model andvice versa. Basically, there are three kinds of moves afteralignment:

Pc

ðe;⪢Þ is a “move on log”, i.e., in reality an event eoccurred that could not be related to a move ofthe model,

ð⪢; aÞ is a “move on model”, i.e., in the process modelactivity a needs to be executed, but in reality there wasno related event that actually occurred, and

(e,a) is a “synchronous move” if in reality an event eoccurred that corresponds to an activity a executed inthe model and vice versa.

In an optimal alignment the number of synchronousmoves is maximized and the number of moves on logand moves on model is minimized. The optimal alignmentof a trace wrt a given model is also used to compute thefitness of a trace. Fitness is measured as a value between 0and 1; value 1 indicates perfect fitness, which means thatthe alignment only contains synchronous moves (thus, nodeviations). As the percentage of moves on log and modelover all moves increases, the fitness value decreases. Value0 indicates a very poor fitness. Interested readers arereferred to [12,13].

lease cite this article as: M. de Leoni, et al., A general prolustering dynamic behavior based on event logs, Information

Trace Manipulation COP1P: Trace Fitness. Given aprocess model P, it augments each event with the valueof fitness of the trace to which the event belongs

Trace Manipulation COP2P,a: Number of Not AllowedExecutions of Activity a Thus Far. Using process model Pan optimal alignment is created per trace. This tracemanipulation augments each event e with an integercharacteristic that denotes the number of “moves onlog”. It counts the number of times something happenedin reality before e in the corresponding trace that was notpossible or allowed according to the process model P [11].

Trace Manipulation COP3P,a: Number of MissingExecutions of Activity a Thus Far. This trace manipulationaugments each event with the number of “moves onmodel”. It counts the number of times something had tohappen in model P before e in the corresponding tracealthough it did not happen in reality [11].

Trace Manipulation COP4P,a: Number of CorrectExecutions of Activity a Thus Far. This trace manipulationaugments each event the number of “synchronous moves”,i.e., the number of times that, before e in the correspond-ing trace, an event actually occurred when it was expectedaccording to the model P [11].

Trace Manipulation COP5F: Satisfaction of Formula FConsidering the Whole Trace. This trace manipulationaugments each event with an additional boolean charac-teristic and assigns a true value if and only if formula F issatisfied. Up to this date, formula F can be defined throughLTL and checked using the LTL checker discussed in [10].

3.6. Trace manipulation: a concrete example

The trace manipulations described thus far have allbeen implemented in ProM (see the FeaturePredictionpackage). Note that the set of manipulations is not fixedand can be extended. Also note that any context informa-tion can be used to enrich the event log with furtherinformation, i.e., it is not limited to the perspectives inFig. 2(a) and may include truly external things likeweather or traffic information.

Table 1 illustrates the application of two trace manip-ulation functions to a fragment of an event log. As a result,two event characteristics have been added to each event;characteristic NextActivityInTrace is nominal whereas Elap-sedTime is numerical. This shows that the log manipula-tions are applied to create both nominal and numericalcharacteristics. Let us denote the set of all activity namesseen in the event log with XL, i.e. XL ¼ {PreoperativeScreening, Laparoscopic Gastrectomy, Nursing, First HospitalAdmission}. Trace manipulation First Occurrence of AnyActivity in Set XL After Current Event ðCFP2XL Þ was used toadd the next activity to each event. Note that the last eventof each case has value ? because there is no next activity.Trace manipulation Time Elapsed since the Start of the Case(TIP2) adds durations. Note that the last event of each casenow has a characteristic capturing the case's flow time.

4. Event selection filter

As described in Definition 2 an analysis use case is atriplet ðcr ; Cd; FÞ. The trace manipulations discussed in

cess mining framework for correlating, predicting andSystems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

Table 1Fragment of a hospital's event log with four traces. Let us denote the set of all activity names seen in the event log with XL. The gray columns have beenadded after applying two trace manipulations: First Occurrence of Any Activity in Set XL After Current Event ðCFP2Aex Þ and Time Elapsed since the Start of theCase (TIP2). NextActivityInTrace and ElapsedTime are the names of the characteristics that are added as a result of these manipulations.

Case Timestamp Activity Resource Cost NextActivityInTrace ElapsedTime

1 1-12-2011:11.00 Preoperative Screening Giuseppe 350 Laparoscopic Gastrectomy 0 days1 2-12-2011:15.00 Laparoscopic Gastrectomy Simon 500 Nursing 1.16 days1 2-12-2011:16.00 Nursing Clare 250 Laparoscopic Gastrectomy 1.20 days1 3-12-2011:13.00 Laparoscopic Gastrectomy Paul 500 Nursing 2.08 days1 3-12-2011:15.00 Nursing Andrew 250 First Hospital Admission 2.16 days1 4-12-2011:9.00 First Hospital Admission Victor 90 ? 3.92 days2 7-12-2011:10.00 First Hospital Admission Jane 90 Laparoscopic Gastrectomy 0 days2 8-12-2011:13.00 Laparoscopic Gastrectomy Giulia 500 Nursing 1.08 days2 9-12-2011:16.00 Nursing Paul 250 ? 2.163 6-12-2011:14.00 First Hospital Admission Gianluca 90 Preoperative Screening 0 days3 8-12-2011:13.00 Preoperative Screening Robert 350 Preoperative Screening 1.96 days3 10-12-2011:16.00 Preoperative Screening Giuseppe 350 Laparoscopic Gastrectomy 4.08 days3 13-12-2011:11.00 Laparoscopic Gastrectomy Simon 500 First Hospital Admission 6.88 days3 13-12-2011:16.00 First Hospital Admission Jane 90 ? 7.02 days4 7-12-2011:15.00 First Hospital Admission Carol 90 Preoperative Screening 0 days4 9-12-2011:7.00 Preoperative Screening Susanne 350 Laparoscopic Gastrectomy 0.66 days4 13-12-2011:11.00 Laparoscopic Gastrectomy Simon 500 Nursing 5.84 days4 13-12-2011:13.00 Nursing Clare 250 Nursing 5.92 days4 13-12-2011:19.00 Nursing Vivianne 250 ? 6.16 days

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]]10

Section 3 can be used to create the dependent character-istic cr and the independent characteristics Cd. In thissection we elaborate on the choices for the event-selection filter FDE that selects the events that areretained for the analysis. Recall that each of the remainingevents corresponds to an instance of the learning problemthat aims to explain cr in terms of Cd.

Recall that we assume that events are uniquely identi-fiable, i.e., no two events have identical attributes. Thereare two extremes: we can retain all events or just oneevent per case. However, we can also retain selectedevents, e.g., all complete events.

Event Filter EF1: Keep All Events. No events areremoved and all are kept. Formally, EF1¼ E.

Event Filter EF2: Keep First Event Only. Per case onlythe first event is retained. Given a trace t⟨e1; e2;…en⟩, onlye1 is kept and the rest is removed. Formally,EF2¼ feAE: prefixðeÞ ¼ ⟨⟩g.

Event Filter EF3: Keep Last Event Only. Per case onlythe last event is retained. Given a trace t⟨e1; e2;…en⟩, onlyen is kept and the rest is removed. Formally,EF3¼ feAE: postfixðeÞ ¼ ⟨⟩g.

Event Filter EF4: Keep All Complete Events. Onlyevents that have as transaction type complete are retained.All other events (e.g., start events) are removed. Formally,EF4¼ feAE: eðTransitionÞ ¼ “Complete”g.

Event Filter EF5: Keep All Start Events. Only eventsthat have as transaction type start are retained. Formally,EF5¼ feAE: eðTransitionÞ ¼ ″Start″g.

Event Filter EF6a: Keep All Events Corresponding toActivity a. Only events that correspond to activity a areretained. Formally, EF6a ¼ feAE: eðActivityÞ ¼ ag.

Event Filter EF7r: Keep All Events Corresponding toResource r. Only events that were performed by resource rare retained. Formally, EF7r ¼ feAE: eðResourceÞ ¼ rg.

Event Filter EF8F: Keep All Events Satisfying a givenformula F. F is a user-provided boolean expression definedover the set C of names of process characteristics. F can

Please cite this article as: M. de Leoni, et al., A general proclustering dynamic behavior based on event logs, Information

make use of relational operators o ; 4 ; ¼ and logicaloperators, namely conjunction ð4Þ, disjunction ð3Þ, andnegation ð:Þ. This filter retains all events for which Fevaluate to true. This generalizes filters EF4;…; EF7. Forinstance, filter EF4¼ EF8Transition ¼ ″Complete″ . Filter EF1 is alsoa specialization: EF1¼ EF8true

It is also possible to combine a set of filters F 1;…; Fn

where, for all 1r irn, F iDE. The combined filter bF isequal to ⋂iAN:1r irnF i. For instance, if one wants to keepall complete events corresponding to activity a, the filter isEF4 \ EF6a.

The event-selection filter F selects the instances used tolearn the desired tree. Some analysis use cases require tohave one instance per case. Table 2 shows the applicationof event filter Keep Last Event Only (EF3) to Table 1.

5. Overview of process-related questions that can beanswered

Literature proposes several research works to correlatespecific process characteristics to each other. This sectionaims to illustrate that those works propose ad hoc solu-tions for a more general problem: finding any type ofcorrelation among arbitrary process characteristics at anylevel (event, case, process, resource, etc.). Our frameworkattempts to solve the more general problem. In this way,we can certainly perform the same correlation analyses asthose research works. However, by applying the appro-priate trace manipulations and event-filter functionsdescribed earlier and by choosing the correct dependentand independent characteristics, we can provide ananswer to many more correlation analyses.

Tables 3 and 4 illustrate concrete analysis use cases thatallows for performing the same correlation analyses as inmany earlier research works. This illustrates that it ispossible to unify existing approaches. For the sake ofsimplifying the definition of the analysis use cases, weassume that all trace manipulations described in Section 3

cess mining framework for correlating, predicting andSystems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

Table 2The retained events after applying event filter Keep Last Event Only (EF3) to the event log shown in Table 1.

Case Timestamp Activity Resource Cost NextActivityInTrace ElapsedTime

1 4-12-2011:9.00 Case Victor 90 ? 3.92 days2 9-12-2011:16.00 Case Paul 250 ? 2.16 days3 13-12-2011:16.00 Case Jane 90 ? 7.02 days4 13-12-2011:19.00 Case Vivianne 250 ? 6.16 days

Table 3Three analysis use cases that illustrate how our framework works to unify existing approaches to correlate specific process characteristics.

#1. Analysis Use Case ðCOP5F ; fCFP4;ROP3; 8 cACAval DFP1cg; EF1Þ: Run-time predictions of violations of formula F. The aim of this use case is topredict, given the current status of the process instances, the next activities to work on to maximize the chances of achieving a given businessgoal expressed using some LTL formula F. The selected dependent characteristic is COP5F (“Satisfaction of Formula F Considering the Prefix TraceUntil Current Event e”), the selected independent characteristics are CFP4 (“Current Activity”), ROP3 (“Current Resource”), and DFP1c (“LatestRecorded Value of Characteristic c Before Current Event”) for all recorded characteristics c. The selected event-selection filter is EF1 (“Keep AllEvents”). In [14], an ad hoc solution is proposed for this problem where formulas are expressed in LTL

#2. Analysis Use Case ðDFP2Outcome ; f8 cACAvail⧹fOutcomeg DFP2cg; EF3Þ: Prediction of the outcomes of the executions of process instances, which isstored as characteristic Outcome. The goal of this use case is to predict the outcome of a case. Predictions are computed using a set of completeprocess instances that are recorded in the event log. The last event of each trace is associated with a characteristic Outcome, which may benumerical or nominal (including boolean) depending on the specific setting. The prediction is done at the case level: one instance for learning iscreated for each trace in the event log. The dependent characteristic is DFP2Outcome (“Latest Recorded Value of Characteristic Outcome AfterCurrent Event”) for a specific Outcome or interest. The selected independent characteristics are DFP2c for all event characteristics c exceptOutcome. The chosen event-selection filter is EF3 (“Keep Last Event Only”). In [15], an ad hoc solution is proposed for this problem. In fact thisanalysis use case is very close to traditional data mining where the case is considered as a whole rather than rooming in on the individual eventsand behavior. Conforti et al. also proposes an ad hoc solution to predict whether or not an process instance is going to complete with a fault. Ifcompleted with a fault, its magnitude is also predicted [16]. In this case, the outcome is a numerical value that ranges between 0 and 1, whereas 0indicates no fault and 1 the highest level of fault

#3. Analysis Use Case ðCFPAX ; f8 cACAvail DFP2cg; EF4 \ EF6x Þ: Mining of conditions at a decision point that determine the activity to execute withina set X�XL after execution of an activity xAXL . This analysis use case aims to determine the conditions that enable the execution of one ofmultiple activities when the execution flow reaches a decision point. Using the BPMN notation, a decision point is represented through theExclusive Choice construct. An example of decision point analysis is shown in Fig. 3 using the BPMN notation, where we aim to discover whichconditions enable a, b or c for execution. For the analysis use case, we retain every event of type complete (i.e. filter EF4 is applied) that is referringto executions of activity x (i.e. EF6x ). For our example, x ¼ z, which is the activity before the decision point. These conditions are defined over theprocess variables, which are the characteristics CAvail present in the original event log. Therefore, characteristic DFP2c (“Latest Recorded Value ofCharacteristic c Until Current Event”) is used as an independent characteristic for all cACAvail . The dependent characteristic is CFPAX (“FirstOccurrence of Any Activity in Set X After Current Event”) where X is the set of activities that are at the decision point; for our example, X ¼ fa; b; cg.In [17], an ad hoc solution is proposed for this problem

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]] 11

are applied. Of course, in practice, one only needs to applythe sub-set of log manipulations that generate character-istics that are used in the definition of the analysis usecase.

Let CAvail be the set of characteristics available in theoriginal event log L, which are not obtained through tracemanipulations. Every event of a log always contains theactivity name; hence, the Activity characteristic is alwaysin CAvail. Let XL denote the set of all values observed for theActivity characteristic in an event log L.

6. Trace clustering based on analysis use cases

As already discussed, the performance of an analysisuse case produces a decision or a regression tree. Thesetrees can be used to cluster event logs. Our clusteringalgorithm requires that each trace is associated with onesingle event; for instance, this is guaranteed by the eventfilters EF2 and EF3 in Section 4. As discussed in Section 2.3,each event becomes a training/test instance for learningdecision or regression trees. Therefore, each trace is linkedto one single instance that is exactly associated with one

Please cite this article as: M. de Leoni, et al., A general proclustering dynamic behavior based on event logs, Information

leaf of the tree. Our basic idea for clustering is that all logtraces associated with the same leaf are grouped withinthe same cluster; therefore, there are as many clusters asthe number of leaves in the decision tree.

For the sake of explanation, let us consider the secondanalysis use case in Table 3: “Prediction of the outcomes ofthe executions of process instances” and assume variableOutcome is numerical. The tree that results from thisanalysis use case can be used to cluster traces accordingto the outcome's value. Therefore, each cluster groupstraces that record executions of process instances withsimilar outcomes. However, traces with similar outcomesmay still be in different clusters if they have with differentvalues for other process characteristics. This happenswhen the tree contains two different leaves with similarvalues for the Outcome variable.

The significant intervals of values of process character-istics that determine the cluster to which to add a giveninstance are not known in advance; conversely, they arecomputed as a result of the construction of the tree. If theywere known in advance, one can easily define clusters byadding traces to cluster according to the values of such a

cess mining framework for correlating, predicting andSystems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

Table 4Additional analysis use cases that further illustrate how our framework research works unify existing approaches to correlate specific processcharacteristics.

#4. Analysis Use Case ðROP3; f8 cACAvail DFP1c ;…g; EF6xÞ: Prediction of the executor of a certain activity x. The goal is to discover the conditions thatdetermine which resource is going to work on a given activity x during the process execution. The dependent characteristic is ROP3 (“CurrentResource”). The set of independent characteristics contain a characteristic DFP2c (“Latest Recorded Value of Characteristic c Before Current Event”)for each cACAvail , at least. However, potentially any characteristic obtained through trace manipulation can be used to predict the executor; thisexplains the presence of “…” in the definition of the set of independent characteristics. Since the analysis use case is restricted to a single activity,the selected event-selection filter is EF6x (“Keep All Events Corresponding to Activity x”)

#5. Analysis Use Case ðTIP3; f8xAXL CFP1x ; 8 cACAvail DFP2c ;…g; EF1Þ: Prediction of the remaining time to the end of the process instance. The goal is,given a running process instance, to predict how long it would take for that instance to conclude. All the events are used to train the tree-basedclassifier and, hence, the event filter is EF1 (“Keep all events”). The analysis use case aims to predict the remaining time; hence, the dependentcharacteristic is TIP3 (“Remaining Time Until the End of Case”). As far as concerning the independent characteristics, any characteristics that canbe obtained through trace manipulation may be suitable; this explains the presence of “…” in the definition of the set of independentcharacteristics. Certainly, the following characteristics should be included: DFP2c (“Latest Recorded Value of Characteristic c Until Current Event”)for each cACAvail as well as CFP1x (“The number of executions of activity x Until the Current Event”) for each activity x observed in the event log (i.e. for all xAXL). It is similar to [18] when a multi-set abstraction is used; additionally, the current values of process variables are also taken intoaccount

z

a

b

c

y

Fig. 3. Example of decision point analysis. Our framework can find theconditions that determine which activity is enabled at a decision point. Inthis figure, the decision point is highlighted through a circle: the analysisaims to discover the disjoined conditions that enable the three activities:a, b and c.

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]]12

variable. However, even in the case, traces with similarvalues for the dependent characteristic but very differentvalues for other relevant characteristics would be groupedin the same cluster, which, hence, would still containtraces with heterogenous behavior.

When using any classifier, it is extremely common tohave outliers, which in our framework translates to havingtraces to be filtered out and included in no cluster.

According to our clustering criteria, one trace canbelong to one cluster only. In the following, we discusswhether the trace should be retained in its own cluster orfiltered out.

This depends on whether the result of performing ananalysis use case is a decision or regression tree. Let C andU be the universe of characteristics and the universe ofpossible values, respectively; also E ¼ C↛U is the universeof events.

Decision Tree: Let us suppose to have a decision tree for adependent characteristic cAC. This decision treeis learned from a set IABðEÞ of instances, each ofwhich is associated with one different trace. Let

Please cite this article as: M. de Leoni, et al., A general proclustering dynamic behavior based on event logs, Information

us suppose to have a log trace that correspondsto an instance iAI . According to the decisiontree, this instance is classified as belonging to atree leaf l with value v for characteristic c. Trace tis retained in the cluster for l if iðcÞ ¼ v, i.e. it iscorrectly classified.

Regression Tree: Let us suppose to have a regression treetr for a dependent characteristic c. This decisiontree is learned from a multi-set I of instances.Let us suppose to have a log trace t that corre-sponds to an instance iAI . According to regres-sion tree tr, iðcÞ should be vexptr;i;cAR. However,differences can occur; therefore, we introducethe mean absolute percentage error (MAPE) ofinstance i when predicting through regressiontree tr as follows:

MAPEtr;i;c i; cð Þ ¼ 100%iðcÞ�vexptr;i;c

iðcÞ

���������� ð1Þ

The trace associated with an instance i isretained if MAPEtr;i;cði; cÞoX% where X is a cut-off threshold that is defined by the processanalyst. Clearly, a larger value for X filters outfewer instances. If X¼0, only instances withexactly the same value for the dependent char-acteristic are retained. As said, this extreme isundesirable: it would probably remove most oftraces, since it is unlikely to have exactly thesame value for the dependent characteristic.

In sum, our approach has the advantage of dynamicallycomputing the intervals of the dependent characteristicthat determine the clusters to which instances are added.Furthermore, it can determine whether single traces areoutliers and, thus, should be excluded from further analysis.When employing regression trees, the determination of thevalue to assign to the cut-off threshold is not always easy.

Section 7.1 discusses an example where it has beencrucial to discard a number of traces from a cluster

cess mining framework for correlating, predicting andSystems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]] 13

because they were considered as noise. The example alsoshows how, sometimes, one wants to discard an entirecluster, as each of the comprised traces is, in fact, noise.

7. Implementation and evaluation

The general framework discussed above is implemen-ted as a plug-in of ProM, an open-source “pluggable”framework for the implementation of process mining toolsin a standardized environment (see http://www.promtools.org). The ProM framework is based on the conceptof packages each of which is an aggregation of severalplug-ins that are conceptually related. Our new plug-in isavailable in a new package named FeaturePrediction, whichis available in ProM version 6.4.

The main input object of our plug-in is an event log,whereas the output is a decision or a regression tree. Theconstruction of decision and regression trees relies on theimplementations of algorithms C4.5 and RepTree developedin the Weka toolkit.8 As mentioned before, our frameworkenvisions the possibility to augment/manipulate the eventlogs with additional features. On this concern, the tool iseasily extensible: a new log manipulation can be easilyplugged in by (1) implementing 3 methods in a Java classthat inherits from an abstract class and (2) programmaticallyadding it to a given Java set of available log manipulations.

This section illustrates how our framework can be used tohelp UWV. UWV (Employee Insurance Agency) is an auton-omous administrative authority to implement employeeinsurances and provide labor market and data services.Within UWV, several processes are carried on to providedifferent types of social benefits for residents in the Nether-lands. This paper analyzes two of their processes.

The first process is concerned with providing supportfor Dutch residents when they develop disabilities whilstentitled to unemployment benefits; later on, we refer tothis process as illness-management process. UWV providesillness benefits for these ill people, hereafter referred to ascustomers, and provides customized paths for them toreturn to work. The customer is supported and assessed bya team that consists of a reintegration supervisor, anoccupational health expert and a doctor. They cooperateto assess the remaining labor capacity of the employeeduring the illness. If the person is not completely recov-ered after two years, the employee can apply for the Workand Income according to Labor Capacity Act (WIA).

The second process is about ensuring that benefits areprovided quickly and correctly when a Dutch residentceases a work contract and cannot immediately find anew employment; later, we refer to this as unemploymentprocess. An instance of this process starts when a person,hereafter referred to as customer, applies. Subsequently,checks are performed to verify the entitlement conditions.If the checks are positive, the instance is being executedfor the entire period in which the customer receives themonetary benefits, which are paid in monthly install-ments. Entitled customers receive as many monthlyinstallments as the number of years for which they were

8 http://weka.sourceforge.net/

Please cite this article as: M. de Leoni, et al., A general proclustering dynamic behavior based on event logs, Information

working. Therefore, an instance can potentially be exe-cuted for more than one year.

UWV is facing various undesired executions of thesetwo processes. For the unemployment process, UWV isinterested in discovering the root-causes of a variety ofproblems identified by UWV's management; for theillness-management process, UWV is interested to gainan insight about how process instances are actually beingcarried out.

For the illness-management process, UWV is interested todiscover if there is any correlation between the character-istics of the process instances and the duration of theirexecution. The reason is that UWV is trying to reduce theduration of the execution of long instances with thepurpose of speeding up the reintegration of the customersinto the job market. In Section 7.1, we illustrate one analysisuse case that aims to discover the causes that determinewhether a process execution is longer or shorter. In addi-tion, we show that it is useful to cluster the event logaccording to this analysis use case. Without clustering theevent log, we discover a single process model that does notprovide insights into how process instances are carried out:the resulting model is too imprecise as it would allow fortoo much behavior. Conversely, by splitting the event loginto clusters according to the case duration, we can discoverinsightful process models.

For the unemployment process, during the entire period,customers must comply with certain duties, otherwise acustomer is sanctioned and a reclamation is opened. Whena reclamation occurs, this directly impacts the customer,who will receive lower benefits than expected or has toreturn part of the benefits. It also has negative impact fromUWV's viewpoint, as this tends to consume lots ofresources and time. Therefore, UWV is interested to knowthe root causes of opening reclamations to reduce theirnumber. If the root causes are known, UWV can predictwhen a reclamation is likely going to be opened and,hence, it can enact appropriate actions to prevent itbeforehand. Section 7.2 discusses a number of analysisuse cases that aim to predict when reclamations are likelyto happen.

7.1. Discovery of the illness management process

For the illness management process, UWV is interestedto discover how process instances are being carried out. Inorder to answer this question through the application ofprocess mining, we used an event log L containing 1000traces.

Fig. 4 shows the BPMN process model that has beendiscovered from the event log with 1000 traces, men-tioned above. In particular, we employ the so-called“Inductive Miner” [19] using a noise threshold of 30%.The Inductive Miner generates a Petri net, which hassubsequently been converted into a BPMN model [20].The resulting model contains a loop that internally com-prises an exclusive choice between 19 activities. Thismeans that these 19 activities can occur an arbitrarynumber of times in any order, interleaved with fewadditional activities. Therefore, the model is clearly

cess mining framework for correlating, predicting andSystems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

Fig. 4. Discovery of the BPMN process model for the ill-management process when using every trace in log.

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]]14

unsatisfactory because it is not sufficiently precise: anexcessively large variety of behavior is possible.9

This result is due to the fact that the miner tries todiscover one single process model out of an event log thatrecords heterogenous process behavior. In these situations,as mentioned above, it is advisable to split the event log inclusters, each comprising traces with similar behavior, andto discover one model per cluster.

UWV had the opinion that the behavioral heterogeneitycan be tackled by splitting the event log based on theprocess-instance duration. If instances of the same dura-tion have similar behavior, then this can lead to discover aset of more precise process models. For this purpose, wedefined and performed an analysis use case where:

9 There are various ways to quantify the notion of precision. Here, weemploy the notion presented in [11,1]. A model is precise if it does notallow for “too much” behavior. A model that is not precise is “under-fitting”. Underfitting is the problem that the model allows for behaviorunrelated to example behavior seen in the log. Intuitively, precision is theratio between the amount of behavior observed in the event log andallowed by the model and the total amount of behavior allowed by themodel. Precision is 1 if all modeled behavior was actually observed in theevent log. Values close to 0 suggest that the model is underfitting.

Please cite this article as: M. de Leoni, et al., A general proclustering dynamic behavior based on event logs, Information

Dependent characteristic: Case duration (TIP4).Independent characteristics: The number of executions of

each activity x observed in the log (DFP2x) andthe latest value observed for each characteristic cpresented in log.

Filter: The last event of each trace (EF3): each processinstance needs to be considered as a whole.

Since the case duration is a numerical value, we optedfor generating a regression tree.

7.1.1. Visual interfaceFig. 5 illustrates a screenshot of the tool where we show the

regression tree that provides an answer for the analysis usecase in question. Before continuing to report on the applicationof our framework, it is worthwhile to quickly discuss on howregression and decision trees are visualized in the tool.

In particular, each internal node refers to a different processcharacteristic, which concerns with one of the five processperspectives considered in Section 3: control-flow, data-flow,resource, time and conformance. Compared with the imple-mentation reported in [21], the nodes of the trees are nowfilled in with a color. The color of an internal node depends onthe perspective of the referred process characteristics. For

cess mining framework for correlating, predicting andSystems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

Fig. 5. A screenshot of the implementation for ProM. The screenshot shows the regression tree that correlates several process characteristics with theduration of process instances. (For interpretation of the references to colors in this figure caption, the reader is referred to the web version of this paper.)

11 In particular, these activities concern reassessments of customersto establish the causes why they have not been reintegrated into the jobmarket yet and determine whether they will be ever reintegrated. In fact,

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]] 15

instance, looking at Fig. 5, one can observe that the internalnodes about the number of executions of activities are coloredpurple (control-flow perspective), whereas the internal nodereferring to the timestamp of the process-instance completionis colored blue (time perspective). Leaves are filled in with acolor that varies from white to black if a leaf is associated withthe lowest or highest value for the dependent characteristic, itis colored white or black, respectively. The color shades fromwhite till black, going through different intensities of yellow,red and brown, as the value associated with the leaf increases.When using a regression tree, each leaf is labeled with theaverage value of the instances associatedwith the leaf, followedin brackets by the number of instances for that leaf and theabsolute error averaged over all instances associated with thatleaf (in milliseconds).

Before concluding, it is good to stress that end users canclick on any tree node and save the associated instances in afile in Attribute-Relation File Format (ARFF)10 to be externallyloaded in data-mining tools, such as Weka, to employdifferent data-mining techniques.

7.1.2. Log clusterThe tree in Fig. 5 contains five leaves and, hence, five

clusters are going to be created. However, one clusterneeds to be filtered out because it contains traces that haveclearly issues, as discussed in the following. The treeillustrates the relation between the duration of process

10 http://weka.wikispaces.com/ARFF

Please cite this article as: M. de Leoni, et al., A general proclustering dynamic behavior based on event logs, Information

instances and the number of the executions of twoactivities, namely 26e_week_herijkingmoment (in English:26th week recalibration time) and 78e_week_herijking-moment (in English: 78th week recalibration time). As thenames suggest, these activities are executed when thecustomer is being handled since 26 and 78 weeks, respec-tively.11. Therefore, it is normal that, if those are executed,the duration of process instances is respectively longerthan 26 or 78 weeks (around 0.5 and 1.5 years).

From the discussion above, it is clear that, if activity78e_week_herijkingmoment is executed, activity26e_week_herijkingmoment should also be in the logtrace. Nonetheless, in 10% of the process instances, activity26e_week_herijkingmoment is not performed whereasactivity 78e_week_herijkingmoment is executed. Thisclearly shows that a number of process instances havesome issues because the assessment after 26 weeks wasskipped, even though it is mandatory according to theinternal UWV's procedures. This means that this entirecluster should be discarded as it only contains outlier logtraces.

We conclude this discussion by observing that, when both26e_week_herijkingmoment and 78e_week_herijkingmoment

if the illness results to remain permanent, it is possible that the customercannot restart his/her previous employment and, hence, UWV helpsfinding a new job that suits the new physical situation.

cess mining framework for correlating, predicting andSystems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]]16

are performed, the process-instance duration depends onwhenthe instance concluded. Process instances that conclude beforeDecember 2013 run for a shorter time than those concludedafter that date: 541 versus 683 days. Therefore, it is worthsplitting the cluster of the traces containing executions of bothactivities according to the time of process-instance completion.

In the remainder of Section 7.1, we illustrate how theset of models that can be discovered from these clustersallow stakeholders to gain more insight into the processstructure.

7.1.3. Discovery of models from log clustersThe first cluster refers to those traces where activities

26e_week_herijkingmoment and 78e_week_herijkingmo-ment were never executed and the process-instance dura-tion was around 83.04 days760%. This cluster contains179 traces, which were used to discover a process modelthrough the Inductive Miner, using the same noise thresh-old as previously, followed by the BPMN conversion. Theresulting model is depicted in Fig. 6; compared with themodel in Fig. 4, the model is much more desirable. First,the model is precise: it does not contain a large exclusivechoice. Second, it does not contain arbitrary loops allowingfor arbitrary executions of activities: each activity can beexecuted at most once.

It is worthy observing that, even though Fig. 5 showsthat 482 traces are such that 26e_week_herijkingmomentand 78e_week_herijkingmoment were never executed, thecorresponding cluster only contains 179 traces. This indi-cates that the choice of only retaining traces with an errorless than 60% has caused around 62% of the traces to befiltered out. Although this may seem to not be a goodchoice, it can be easily motivated. Many of the traces thathave been filtered out contain less than 5 events, which

Fig. 6. Discovery of the BPMN process model using the clustering of the traces wment were never executed and the process-instance duration was around 83.discovered when all traces are considered together.

Fig. 7. Discovery of the BPMN process model using the clustering of the traces wment were both executed at least once, the timestamp when the case concludaround 541.75 days760%. The model is again quite well structured and precise

Please cite this article as: M. de Leoni, et al., A general proclustering dynamic behavior based on event logs, Information

might indicate that those are incomplete cases.One might raise objections concerning the fact that, if

only 179 out of 1000 traces are retained, less behavior isobserved in the event log and, thus, it is natural to obtainmore precise models. However, this does not apply to thiscase study as discussed below. First, we sampled 200random traces four times, thus generating 4 event logs.Afterwards, we mined 4 process models: all models werevery imprecise, similar to the model in Fig. 4 and definitelydifferent from Fig. 6.

Fig. 7 illustrates the process model discovered for thelong cases, where activities 26e_week_herijkingmomentand 78e_week_herijkingmoment were both executed atleast once, the timestamp when the case concluded wasbefore 10 December 2013 and the process-instance dura-tion was around 541.75 days760%. The log cluster con-tained around 100 traces. The model is highly precise: noexclusive choice is presented among several alternativeactivities. The inclusion of the timestamp as independentcharacteristic in the analysis use case is very important todiscover a precise model. If we exclude the timestamp, wecreate one single cluster for each trace in which both26e_week_herijkingmoment and 78e_week_herijkingmo-ment occur, unrespectfully when each trace concludes. Thediscovered model is similar to the model in Fig. 4 and,hence, insufficiently precise. This clearly illustrates thatduring 2013, the structure of the process and the orderwith which activities are performed has gone through aradical change. In other words, using the consolidatedprocess-mining terminology, the process has experienceda so-called concept drift. This is also confirmed by theUWV, which was aware of a concept drift at beginning of2014, even though it did not expect to observe so radicalchanges.

here activities 26e_week_herijkingmoment and 78e_week_herijkingmo-04 days760%. The model is significantly more precise than the model

here activities 26e_week_herijkingmoment and 78e_week_herijkingmo-ed was before 10 December 2013 and the process-instance duration was: there are no exclusive choices among several alternatives.

cess mining framework for correlating, predicting andSystems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]] 17

Space limitations prevent us from showing the modelderived from the cases where activity 26e_week_herij-kingmoment was executed at least once, activity78e_week_herijkingmoment was never executed and theprocess-instance duration was around 304.75 days760%.As mentioned above, we excluded from the analysis thecluster of the traces where activity 26e_week_herijking-moment was never executed whereas activity 78e_week_-herijkingmoment was at least once. Those traces showundesirable behavior: the customer assessment did notoccur during the 26th week, even though the UWV'sprocedure prescribes it.

7.2. Analysis of the unemployment process

This section is looking at the process to deal withrequests of unemployment benefits. As mentioned at thebeginning of this section, UWV wants to investigate whatfactors yield reclamations to occur with the aim of pre-venting them from happening. In order to discover theroot causes, UWV formulated four questions:

Q1

Plcl

Are customer characteristics linked to the occurrenceof reclamations? And if so, which characteristics aremost prominent?

Q2

Are characteristics concerned with how processinstances are executed linked to the occurrence ofreclamations? And if any, which characteristicsmatter most?

Q3

If the prescribed process flow is not followed, will thisinfluence whether or not a reclamation occurs?

Q4

When an instance of the unemployment-benefit pay-ment process is being handled, is there any character-istic that may trigger whether a reclamation is goingto occur?

12 Customers are recurrent if they apply for monetary benefitsmultiple times because they find multiple temporary jobs and, hence,they become unemployed multiple times.

Table 5 enumerates some of the analysis use cases thathave been performed to answer the questions above. Theanalyses have been performed using a UWV's event logcontaining 2232 process instances and 77 551 events. Theremainder of this section details how the analysis usecases have been used to answer the four questions above.

The use cases U1, U2, U3 use the number of executionsof Reclamation as dependent characteristic. For thoseanalysis use cases, we opted to use decision trees anddiscretize the number of executions. This is because UWVis interested to clearly distinguish cases in which noreclamation occurs versus those in which reclamationsoccur. The number of executions of Reclamation is discre-tized as two values: (0.0,0.0) and (0.0,5.0). When thenumber of executions of Reclamation is 0, this is shown as(0.0,0.0); conversely, any value greater than 0 for thenumber of executions is discretized as (0.0,5.0). Weused the same discretization for U1, U2, U3.

Question Q1: To answer this question, we performedthe use case U1 in Table 5. The results of performing thisanalysis are represented through the decision tree in Fig. 8.In particular, the screenshot refers to our implementationin ProM. The implementation allows the end user toconfigure a number of parameters, such as the level of

ease cite this article as: M. de Leoni, et al., A general proustering dynamic behavior based on event logs, Information

decision-tree pruning, the minimum number of instancesper leaf or the discretization method. In this way, the usercan try several configurations, thus, e.g., balancingbetween over- and under-fitting. In particular, the screen-shot refers to the configuration in which the minimumnumber of instances per leaf is set to 100.

Looking at the tree in Fig. 8, some business rules seemto be derived. For instance, if the customer is a recurrentcustomer (WW_IND_HERLEVING40), a reclamationoccurs, i.e. the leaf is labeled as (0.0,5.0).12 If thiscorrelation really held, it would be quite unexpected:recurrent customers tend to disregard their duties. None-theless, the label is also annotated with 318.0/126.0,which indicates that a reclamation is not opened for 126out of the 318 recurrent customers (39%). Though not verystrong, a correlation seems to exist between being recur-rent customers and incurring in reclamations. Furtherinvestigation is certainly needed; perhaps, additional cus-tomer's characteristics might be needed to better discri-minate, but they are currently not present in the event logused for analysis.

Question Q2: Firstly, we performed the analysis use caseU2. We obtained a decision tree that showed correlationsbetween the number of reclamations and certain character-istics that are judged as trivial by UWV. For instance, therewas a correlation of the number of reclamations with (1) themethod of payment of the benefit installments to customersand (2) the number of executions of activity Call Contact doorHH deskundige, which is executed to push customers toperform their duties. Being these correlations consideredtrivial by UWV, the respective characteristics should be leftout of the analysis. So, we excluded these characteristics fromthe set of independent characteristics and repeated theanalysis. We refined the analysis use case multiple times byremoving more and more independent characteristics. After 9iterations, we performed an analysis use case that led tosatisfactory results. The results of performing this analysis arerepresented through the decision tree in Fig. 9, whichclassifies 77% of the instances correctly.

This tree illustrates interesting correlation rules. Recla-mations are usually not opened in those process instancesin which (1) UWV never informs (or has to inform) acustomer about changes in his/her benefits (the number ofexecutions of Brief Uitkering gewijzigd WW is 0), (2) UWV'semployees do not hand over work to each other (thenumber of executions of Brief Interne Memo is 0) and (3)either of the following conditions holds:

cesSys

no letter is sent to the customers (the number ofexecutions of Brief van UWV aan Klant is 0);

at least one letter is sent but UWV never calls thecustomer (the number of executions of Call Telefoonno-titie is equal to 0) and, also, the number of months forwhich the customer is entitled to receive a benefit ismore than 12.

s mining framework for correlating, predicting andtems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

Table 5Additional analysis use cases that illustrate how our framework works to unify existing approaches to correlate specific process characteristics.

U1. Analysis Use Case ðCFP1Reclamation ; f8 customercharacteristicsc DFP1cg; EF2Þ: Are customer characteristics linked to the occurrence of reclamations? Weaim to correlate the number of executions of activity Reclamation to the customer characteristics. Customer characteristics are a subset of theentire set of characteristics available in the original log, i.e. before applying any trace manipulation. For this analysis use case, we aim to analyzeeach trace as a whole and, hence, we use event filter EF2 to only retain the last event of each trace. The dependent characteristic is CFP1Reclamation

(“The Number of Executions of Activity Reclamation”) and the set of independent characteristics includes DFP1c (“Latest Recorded Value ofCharacteristic c Until Current Event”) for each customer characteristic c of the original event log

U2. Analysis Use Case ðCFP1Reclamation ; f8process characteristics c DFP1c; TIP2; 8activity x different from Reclamation CFP1x ; g; EF2Þ: Are characteristics concerned withhow process instances are executed linked to the occurrence of reclamation? We aim to correlate the number of executions of activityReclamation to the number of executions of every activity, process-related characteristics, the elapsed time, i.e. the time since process instancesare started. Process-related characteristics are a subset of the entire set of characteristics available in the original log, i.e. before applying any tracemanipulation. They differ from the customer characteristics because they are related to the outcome of the management of the provision ofunemployment benefits. For this analysis use case, we aim to analyze each trace as a whole and, hence, use event filter EF2 to only retain the lastevent of each trace. The dependent characteristic is CFP1Reclamation (“The Number of Executions of Activity Reclamation”) and the set ofindependent characteristics includes DFP1c (“Latest Recorded Value of Characteristic c Until Current Event”) for each process-relatedcharacteristic c of the original event log, TIP2 (“Elapsed Time since the Start of the Case”) and CFP1x (“Number of Executions of Activity x”) for eachprocess activity x

U3. Analysis Use Case ðCFP1Reclamation ; fCOP1P ; 8activity x different from ″Reclamation″ COP2P;x ;COP3P;x ;COP4P;xg; EF2Þ: If the prescribed process flow is not

followed, will this influence whether or not a reclamation occurs? We aim to correlate the number of executions of activity Reclamation to thetrace fitness, the number of non-allowed, missing and correct executions of any other activity. For this analysis use case, a process model P isprovided. Since we again aim to analyze each trace as a whole, we use event filter EF2 to only retain the last event of each trace. The dependentcharacteristic is CFP1Reclamation (“The Number of Executions of Activity Reclamation”) and the set of independent characteristics include the fitnessof each trace wrt model P (manipulation COP1P) as well as, for each activity x different from Reclamation, COP2P;x (“Number of Not AllowedExecutions of Activity x Thus Far”), COP3P;x (“Number of Missing Executions of Activity x Thus Far”) and COP4P;x (“Number of Correct Executions ofActivity x Thus Far”)

U4. Analysis Use Case ðCFP2All; fTIP2; 8process characteristicsc DFP1c; 8activity x different from ″Call Contact door HH deskundige″ CFP1x ; g; EF4Þ When an instance of the

unemployment process is handled, is there any characteristic that may trigger whether a reclamation is going to occur next? We aim topredict when a reclamation is going to occur as the next activity during the execution of a process instance. For this purpose, we predict whichactivity is going to follow for any process activity and, then, we focus on the paths leading to predicting that the following activity is, indeed,Reclamation. For this analysis use case, differently from the others in the table, we use any complete event; thus, the event filter is EF4 (“Keep AllComplete Events”). The dependent characteristic is CFP2All , indicating the first occurrence of any activity after the current event, i.e. the nextactivity in the trace. The set of independent characteristics includes TIP2 (“Time Elapsed since the Start of the Case” as well as DFP1c (“LatestRecorded Value of Characteristic c Until Current Event”) for each process-related characteristic c and CFP1x (“Number of Executions of Activity x”)for each process activity x different from Call Contact door HH deskundige

Fig. 8. A screenshot of the framework's implementation in ProM that shows the decision tree used to answer question Q1.

Please cite this article as: M. de Leoni, et al., A general process mining framework for correlating, predicting andclustering dynamic behavior based on event logs, Information Systems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]]18

Fig. 9. The decision tree used to answer question Q2.

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]] 19

From this analysis, we can conclude that UWV shouldreduce the hand-over of work. Moreover, it should paymore attention to customers when their situation changes,e.g. they find a new job. When customers find a job, theystart having a monetary income, again. The problem seemsto be related to the customers who often do not provideinformation about the new job on time. In these cases,their benefits are not stopped or reduced when theyshould. Consequently, a reclamation needs to be openedbecause these customers need to return the amount thatwas overpaid to them. Conversely, if a customer hasalready received benefits for 12 months, it is unlikely thata reclamation is going to occur. This can be motivated quiteclearly and it is again related to the presence of changes ofthe customer's job situation. If benefits are received formore than 12 months, the customer has not found a job inthe latest 12 months and, thus, it is probably going to behard for him to find one. So, UWV does not have to paymuch attention to customers entitled to long benefitswhen they aim to limit the number of reclamations.

Question Q3: The answer to this question is given byperforming the analysis use case U3 in Table 5. This usecase relies on a process model that describes the normalexecution flow. This model was designed by hand, usingknowledge of the UWV domain. The results of performingU3 are represented by the decision tree in Fig. 10. Analyz-ing the decision tree, a correlation is clear between tracefitness and the number of reclamations. 610 out of the 826process executions (nearly 70%) with fitness higher than0.89 do not comprise any reclamation. Therefore, it seemscrucial for UWV to make the best to follow the normalflow, although this is often made difficult by a hastybehavior of customers. This rule seems quite reliable andis also confirmed by the fact that 70% of the executionswith fitness lower than 0.83 incur in reclamations.

The decision tree contains an intermediate nodelabeled MREAL for IKF van Klant aan UWV. This character-istic refers to the number of missing executions ofactivity IKF van Klant aan UWV. This activity is executedin a process instance every time that UWV receives adeclaration form from the customer. UWV requestscustomers to send a form every month to declarewhether or not their condition has changed in the lastmonth, e.g. they found a job. The decision tree statesthat, when an execution deviates moderately, i.e. the

Please cite this article as: M. de Leoni, et al., A general proclustering dynamic behavior based on event logs, Information

fitness is roughly between 0.83 and 0.89, a reclamation isstill unlikely being opened if the customer forgets tosend the declaration form for at most 3 months (notnecessarily in a row). Note that, since traces are quitelong, considering how fitness is computed, a differenceof 0.06 in fitness can be quite remarkable. This rule isquite reliable since it holds in 79% of cases. Therefore, itis worthwhile for UWV to enact appropriate actions(such as calling by phone) to increase the chances thatcustomers send the declaration form every month.

Question Q4: The answer to this question is given byperforming the analysis use case U4 in Table 5. We built adecision tree for this use case by limiting the minimalnumber of instances per leaf to 50. We are interested intree paths that lead to Reclamation as next activity in thetrace. Unfortunately, the F-measure for Reclamation wasvery low (0.349), which indicates that it is not possible toreliably estimate if a reclamation is going to occur at acertain moment of the execution of a process instance. Wealso tried to reduce the limit of the minimum number ofinstances per leaf. Unfortunately, the resulting decisiontree was not valuable since it overfitted the instance sets:the majority of the leaves were associated to less than 1%of the total number of instances. Conversely, the decisiontree with 50 as minimum number of instances per leafcould be useful to predict when a payment is sent out to acustomer: the F score for the payment activity is nearly0.735. Unfortunately, finding this correlation does notanswer question Q4.

7.3. Final remarks

As mentioned in Section 1, we do not claim that ourframework is able to perform analyses that previouslywere not possible. For some types of analysis use cases,dedicated solutions were provided and, for other types ofanalysis use cases, the analyst had to manually go throughseveral transformation and selection steps. The novelty ofour framework that we provide a single environmentwhere analyses can be performed much quicker and donot require process analysts to have solid technicalbackground.

In the remainder of this section, we illustrate thecomplexity and time consumption of performing analysisuse cases without the framework. For this aim, we

cess mining framework for correlating, predicting andSystems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

Fig. 10. The decision tree used to answer question Q3.

Table 6Create of the set of learning instances using SQL in place of our framework.

CREATE FUNCTION NumReclamation (@cId)

AS (select count(*) from event_table as et where et.cId¼@cId and et.activity¼’Reclamation’)

CREATE FUNCTION LatestValue (@cId,@Attr) AS(select @Attr from event_table as et1 where et1.cId¼@cId and et2.@Attr is not null

and 0¼(select count(*) from event_table as et2 where et2.cId¼et1.cId and et2.@Attr is not null

and et2.timestamp4et1.timestamp ))

select et1.*,NumReclamation(et.cID),LatestValue(et.cID,Attr_1),…,LatestValue(et.cID,Attr_n)from event_table et1 where et.timestamp ¼ (select max(timestamp) from event_table et2 where et2.cId ¼ et.cId)

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]]20

consider Question Q1 introduced in Section 7.2, whosecorresponding analysis use case is in Table 5. For simpli-city, let us suppose that the event log is stored in adatabase table event_table of a certain database. Eachtable's row is a different log's event. This table containscolumns cId, timestamp, activity, which representthe identifier of the case (i.e., of the process instance),the event's timestamp and the activity name, respectively.In addition to that, each additional process attribute ischaracterized by a respective column Attr_1,…,Attr_n.In order to perform the analysis use case, we aim to write aSQL query that returns a tabular result that can beexported in a CSV file and loaded in data-mining tools.Table 6 shows the corresponding SQL query; in particular,to increase the readability, we introduce two SQL functionsthat compute the number of occurrences of activityReclamation in a process instance with identifier cId andthe latest value written for a certain process attribute Attr

for a process instance with identifier cId.Space limitations prevent us from detailing how the

SQL query and the helper functions are structured. How-ever, we believe that this shows how long, error-prone andtime-consuming the task of manually defining such SQLqueries is for each analysis use case. Furthermore, processanalysts are required to possess a large knowledge of SQL,which is not always true. If we replaced SQL and databaseswith alternative languages, the complexity would becertainly comparable. Here we have showcased an analysisuse case for Q1 that is relatively simple if compared withother analysis use cases discussed in this paper. Forinstance, without our framework the analysis use case

Please cite this article as: M. de Leoni, et al., A general proclustering dynamic behavior based on event logs, Information

for Q3 would require to write complex codes or scripts toimport the conformance results.

Conversely, using our framework and our correspond-ing ProM implementation, the task of defining analysis usecases is simple while remaining relatively generic. This canbe done in a matter of minutes by simply selecting theopportune dependent and independent characteristicsfrom a provided list. The complexity would only be raisedif one needed to incorporate new log-manipulationfunctions.

8. Related work

This paper has clearly shown that the study of processcharacteristics and how they influence each other is ofcrucial importance when an organization aims to improveand redesign its own processes. Many authors have pro-posed techniques to relate specific characteristics in an adad-hoc manner. For example, several approaches havebeen proposed to predict the remaining processing timeof a case depending on characteristics of the partial traceexecuted [22,18,23]. Other approaches are only targeted tocorrelating certain predefined characteristics to the pro-cess outcome [15,24,16], process service time [25] or theviolations of business rules [14]. Senderovich et al. [26]leverages on queue theory to correlate service time to thearrival of new activities to perform; the main drawback isthat the approach can be only applied to single activitiesand assumes a certain arrival distribution.

The general framework proposed in this paper is in linewith typical methodologies to discover knowledge,intended as a set of interesting and nontrivial patterns,

cess mining framework for correlating, predicting andSystems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]] 21

from data sets with some degree of certainty. In [27],Fayyad et al. propose a methodology and guidelines toexecute the process of knowledge discovery. A number ofabstract steps are proposed: selection, pre-processing,transformation, data mining and interpretation. The simi-larity between these steps and those proposed by ourframework is quite evident. The selection and pre-processing steps match the definition of the analysis usecase, whereas the transformation step coincides with themanipulation of event logs as discussed in Section 3.Finally, the data mining step in the methodology of Fayyadet al. corresponds to the actual construction of decisionand regression trees to provide answers to analysis usecases. While our framework shares commonalities withthe methodology of Fayyad et al., several major differencesexist. It is not in the scope of the methodology of Fayyadet al. to provide a concrete framework and operationaliza-tion that allow non-experts to easily conduct a process-related analysis.

Furthermore, our paper adapts the methodology ofFayyad et al. for the domain of process mining. The funda-mental difference between data and process mining is thatprocess mining is concerned with processes. As extensivelydiscussed, processes take several perspectives into account.Therefore, these perspectives need to be elected as first-classcitizens to obtain reliable correlations. The outcomes, thetime for executing a certain activity or any other character-istic in a process instance may depend on several factors thatare outside the activity in question. For instance, resourceand group workloads, delays, deadlines, the customer type(e.g., silver, gold or platinum) or the number of undergonechecks may have strong influence. Factors can be evenoutside the specific process instance; namely, these factorsare part of the external context around which the activity isbeing executed. Without taking them into consideration, thecorrelation is certainly weaker. Therefore, each event cannotbe considered in isolation but should consider the otherevents and, even, aspects outside the process in question.Somehow, Data Mining tends to take events, i.e. the learninginstances, in isolation. This also motivates the importance ofaugmenting and manipulating the event log.

In addition to decision and regression trees, many othermachine-learning techniques exist and some have alreadybeen applied in BPM, such as Bayesian Networks [28], Case-Based Reasoning [29], Markov Models [23], Naive BayesClassifiers and Support Vector Machines [30]. These arecertainly valuable but they are only able to make correlationsfor single instances of interest or to return significant exam-ples of relevant instances. Conversely, we aim to aggregateknowledge extracted from the event logs and return it asdecision rules. Association rules [3] could have been analternative, but decision and regression trees have the advan-tage of clearly highlighting the characteristics that are mostdiscriminating. Also, linear regression [7] could be applied;however, this could only be applied when dependent andindependent characteristics are numerical, thus excludingmany valuable process characteristics. Moreover, we believethat linear formulas would not be understood by non-expertusers as much as decision or regression trees would. Last butnot least, stochastic analysis has been used to predict theremaining service time [31]; the approach requires to have a

Please cite this article as: M. de Leoni, et al., A general proclustering dynamic behavior based on event logs, Information

process model and it is not clear how it can be generalized toany process characteristics.

There is also a body of work that is concerned withproviding intuitive visual representations of root causes ofproblems or undesirable events, such as Cause–Effect orWhy-Why Diagrams (see, e.g., [32, pp. 190–198]). Weenvision these intuitive visualizations as complementaryto our framework and visualization. Furthermore, beforedrawing such diagrams, one needs to gain the necessaryinsights, which our framework facilitates to obtain.

This paper extends the framework initially proposed in[21] to allow grouping the traces of an event log intoclusters. Literature proposes a large number of approachesto address this topic, e.g. [33–38]. All existing approachesshare the same idea: similar traces are grouped togetherand placed in the same cluster; the difference among theseapproaches resides on the use of different notions of tracesimilarity. However, existing approaches propose a prede-termined repertoire of similarity's notion, whereas thispaper proposes an approach to trace clustering that is basedon a configurable notion of trace similarity. We believe thatthere is no concept of trace similarity that is generallyapplicable in every situations; in fact, the similarity amongtraces depends on the specific process domain. Further-more, the predetermined repertoire is mostly based on thecontrol-flow perspective (e.g. the order with which activ-ities are performed). Conversely, our framework enables todefine the similarity concept as a function of characteristicsrelated to several perspectives: the duration of the process-instance execution, the workload of the process participantswhen the instance was carried out, the execution's com-pliance and many others.

In addition to using different concepts of trace similar-ity, existing approaches also use different clustering tech-niques, such as agglomerative hierarchical clustering, k-means or DBScan [7]. When using these techniques fortrace clustering, process analysts do not have a clearinsight into the process characteristics that are commonamong all traces within the same cluster. Conversely,decision or regression trees clearly highlight the charac-teristics that discriminate. This means that, when using itfor trace clustering, it becomes clear why a certain logtrace belongs to a given cluster and not to another.

9. Conclusion

Process mining can be viewed as the missing linkbetween model-based process analysis and data-orientedanalysis techniques. Lion's share of process miningresearch has been focusing on process discovery (creatingprocess models from raw data) and replay techniques tocheck conformance and analyze bottlenecks. It is crucialthat certain phenomena can be explained, e.g., “Why arethese cases delayed at this point?”, “Why do these devia-tions take place?”, “What kind of cases are more costly dueto following this undesirable route?”, and “Why is thedistribution of work so unbalanced”. These and otherquestions may involve process characteristics related todifferent perspectives (control-flow, data-flow, time, orga-nization, cost, compliance, etc.). Specific questions (e.g.,predicting the remaining processing time) have been

cess mining framework for correlating, predicting andSystems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]]22

investigated before, a generic framework for correlatingbusiness process characteristics was missing. In this paper,we presented such a framework and its implementation inProM. By defining an analysis use case composed of threeelements (one dependent characteristic, multiple indepen-dent characteristics and a filter), we can create a classifica-tion or regression problem. The results of performing ananalysis use case is a decision or a regression tree thatdescribes the dependent characteristic in terms of theindependent characteristics. As discussed, the resultingdecision or regression tree can also be used to split eventlogs into clusters of traces with similar behavior. Cluster-ing is a necessary action when the event log shows such aheterogenous behavior that many process-mining techni-ques, such as for process discovery, cannot return insight-ful results. The advance of the proposed clusteringtechnique is that the notion of trace similarity is configur-able according to the specific application domain and canleverage on many process perspectives, instead of only thecontrol-flow perspective.

The approach has been evaluated using a case studywithin the UWV, one of the largest “administrative fac-tories” in the Netherlands. The evaluation has demon-strated the usefulness of performing correlation analysesto gain insight into processes as well as of clustering eventlogs according to the results of performing analysisuse cases.

Acknowledgments

The work of Dr. de Leoni has received funding from theEuropean Community's Seventh Framework Program FP7under Grant agreement num. 603993 (CORE).

References

[1] W.M.P. van der Aalst, Process Mining: Discovery, Conformance andEnhancement of Business Processes, 1st edition, Springer PublishingCompany, Incorporated, 2011.

[2] J. Nakatumba, Resource-aware business process management: ana-lysis and support (Ph.D. thesis), Eindhoven University of Technology,ISBN: 978-90-386-3472-2, 2014.

[3] A. Dohmen, J. Moormann, Identifying drivers of inefficiency inbusiness processes: a DEA and data mining perspective, in: Enter-prise, Business-Process and Information Systems Modeling, LNBIP,vol. 50, Springer, Berlin, Heidelberg, 2010, pp. 120–132.

[4] L. Zeng, C. Lingenfelder, H. Lei, H. Chang, Event-driven quality ofservice prediction, in: Proceedings of the 8th International Con-ference of Service-Oriented Computing (ICSOC 2008), Lecture Notesin Computer Science, vol. 5364, Springer, Berlin, Heidelberg, 2008,pp. 147–161.

[5] W.M.P. van der Aalst, S. Dustdar, Process mining put into context,IEEE Internet Comput. 16 (1) (2012) 82–86.

[6] W.M.P. van der Aalst, Process cubes: slicing, dicing, rolling up anddrilling down event data for process mining, in: M. Song, M. Wynn,J. Liu (Eds.), Asia Pacific Conference on Business Process Manage-ment (AP-BPM 2013), LNBIP, vol. 159, Springer-Verlag, Beijing, China,2013, pp. 1–22.

[7] T.M. Mitchell, Machine Learning, 1st edition, McGraw-Hill, Inc., NewYork, NY, USA, 1997.

[8] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Toolsand Techniques, 2nd edition, Morgan Kaufmann Series in DataManagement Systems, 3rd edition, Morgan Kaufmann PublishersInc., San Francisco, CA, USA, 2011.

[9] J. Dougherty, R. Kohavi, M. Sahami, Supervised and unsuperviseddiscretization of continuous features, in: Proceedings of the Twelfth

Please cite this article as: M. de Leoni, et al., A general proclustering dynamic behavior based on event logs, Information

International Conference on Machine Learning (ICML'95), MorganKaufmann, Tahoe City, California, USA, 1995, pp. 194–202.

[10] W.M.P. van der Aalst, H.T. Beer, B.F. van Dongen, Process mining andverification of properties: an approach based on temporal logic, in:Conference on the Move to Meaningful Internet Systems 2005:CoopIS, DOA, and ODBASE, Lecture Notes in Computer Science, vol.3760, Springer, Berlin, Heidelberg, 2005, pp. 130–147.

[11] W.M.P. van der Aalst, A. Adriansyah, B.v. Dongen, Replaying historyon process models for conformance checking and performanceanalysis, WIREs Data Min. Knowl. Discov. 2 (2) (2012) 182–192.

[12] M. de Leoni, W.M.P. van der Aalst, Aligning event logs and processmodels for multi-perspective conformance checking: an approachbased on integer linear programming, in: Proceedings of the 11thInternational Conference on Business Process Management(BPM'13), Lecture Notes in Computer Science, vol. 8094, Springer-Verlag, Beijing, China, 2013, pp. 113–129.

[13] M. de Leoni, F.M. Maggi, W.M.P. van der Aalst, An alignment-basedframework to check the conformance of declarative process modelsand to preprocess event-log data, Inf. Syst. 47 (2015) 258–277.

[14] F.M. Maggi, C.D. Francescomarino, M. Dumas, C. Ghidini, Predictivemonitoring of business processes, in: Proceedings of the 26thInternational Conference on Advanced Information Systems Engi-neering (CAiSE 2014), Lecture Notes in Computer Science, vol. 8484,2014, pp. 457–472.

[15] J. Ghattas, P. Soffer, M. Peleg, Improving business process decisionmaking based on past experience, Decis. Support Syst. 59 (2014)93–107.

[16] R. Conforti, M. de Leoni, M. La Rosa, W.M.P. van der Aalst, Supportingrisk-informed decisions during business process execution, in:Proceedings of the 25th International Conference on AdvancedInformation Systems Engineering (CAISE'13), Lecture Notes in Com-puter Science, vol. 7908, Springer-Verlag, Valencia, Spain, 2013,pp. 116–132.

[17] A. Rozinat, W.M.P. van der Aalst, Decision mining in ProM, in:Proceedings of the 4th International Conference on Business ProcessManagement (BPM'06), Lecture Notes in Computer Science,Springer-Verlag, Vienna, Austria, 2006, pp. 420–425.

[18] W.M.P. van der Aalst, M.H. Schonenberg, M. Song, Time predictionbased on process mining, Inf. Syst. 36 (2) (2011) 450–475.

[19] S.J. Leemans, D. Fahland, W.M.P. van der Aalst, Discovering block-structured process models from incomplete event logs, in: Proceed-ings of the 35th International Conference on Application and Theoryof Petri Nets and Concurrency (Petri Net 2014), vol. 8489, SpringerInternational Publishing, Tunis, Tunisia, 2014, pp. 91–110.

[20] A. Kalenkova, M. de Leoni, W.M.P. van der Aalst, Discovering,analyzing and enhancing BPMN models using ProM, in: Proceedingsof the Demo Sessions of the 12th International Conference onBusiness Process Management (BPM 2014), CEUR Workshop Pro-ceedings, vol. 1295, CEUR-WS.org, 2014, p. 36.

[21] M. de Leoni, W.M. van der Aalst, M. Dees, A general framework forcorrelating business process characteristics, in: Proceedings of the11th Business Process Management (BPM'14), Lecture Notes inComputer Science, vol. 8659, Springer International Publishing,Eindhoven, The Netherlands, 2014, pp. 250–266.

[22] F. Folino, M. Guarascio, L. Pontieri, Discovering context-awaremodels for predicting business process performances, in: On theMove to Meaningful Internet Systems: OTM 2012, Lecture Notes inComputer Science, vol. 7565, Springer, Berlin, Heidelberg, 2012,pp. 287–304.

[23] G. Lakshmanan, D. Shamsi, Y. Doganata, M. Unuvar, R. Khalaf, AMarkov prediction model for data-driven semi-structured businessprocesses, Knowl. Inf. Syst. (2013) 1–30.

[24] A. Kim, J. Obregon, J.-Y. Jung, Constructing decision trees fromprocess logs for performer recommendation, in: Proceedings of2013 Business Process Management Workshops, LNBIP, vol. 171,Springer, Beijing, China, 2014, pp. 224–236.

[25] A. Senderovich, M. Weidlich, A. Gal, A. Mandelbaum, Miningresource scheduling protocols, in: Proceedings of the 11th BusinessProcess Management (BPM'14), Lecture Notes in Computer Science,vol. 8659, Springer International Publishing, Eindhoven, The Nether-lands, 2014, pp. 200–216.

[26] A. Senderovich, M. Weidlich, A. Gal, A. Mandelbaum, Queue mining– predicting delays in service processes, in: Proceedings of CAiSE,Lecture Notes in Computer Science, vol. 7565, Springer, Thessaloniki,Greece, 2014, pp. 42–57.

[27] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, From data mining toknowledge discovery: an overview, in: U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (Eds.), Advances in Knowledge

cess mining framework for correlating, predicting andSystems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i

M. de Leoni et al. / Information Systems ] (]]]]) ]]]–]]] 23

Discovery and Data Mining, American Association for ArtificialIntelligence, 1996, pp. 37–54.

[28] R.A. Sutrisnowati, H. Bae, J. Park, B.-H. Ha, Learning bayesiannetwork from event logs using mutual information test, in: Pro-ceedings of the 6th International Conference on Service-OrientedComputing and Applications (SOCA 2013), 2013, pp. 356–360.

[29] A. Aamodt, E. Plaza, Case-based reasoning: foundational issues,methodological variations, and system approaches, AI Commun. 7(1) (1994) 39–59.

[30] M. Polato, A. Sperduti, A. Burattin, M. de Leoni, Data-aware remain-ing time prediction of business process instances, in: Proceedings ofthe 2014 International Joint Conference on Neural Networks (IJCNN),2014.

[31] A. Rogge-Solti, M. Weske, Prediction of remaining service executiontime using stochastic petri nets with arbitrary firing delays, in:Proceedings of ICSOC, Lecture Notes in Computer Science, vol. 8274,Springer, Berlin, Germany, 2013.

[32] M. Dumas, M. la Rosa, J. Mendling, H.A. Reijers, Fundamentals ofBusiness Process Management, Springer, Berlin, Heidelberg, 2013.

[33] C.C. Ekanayake, M. Dumas, L. Garca-Bauelos, M. La Rosa, Slice, mineand dice: complexity-aware automated discovery of business

Please cite this article as: M. de Leoni, et al., A general proclustering dynamic behavior based on event logs, Information

process models, in: Proceedings of the 10th Business ProcessManagement (BPM'13), Lecture Notes in Computer Science, vol.8094, Springer, Berlin, Heidelberg, 2013, pp. 49–64.

[34] R.P.J.C. Bose, Process mining in the large: preprocessing, discovery,and diagnostics (Ph.D. thesis), Eindhoven University of Technology,Eindhoven, 2012.

[35] G.M. Veiga, D.R. Ferreira, Understanding spaghetti models withsequence clustering for ProM, in: Business Process ManagementWorkshops, Lecture Notes in Business Information Processing, vol.43, Springer, Berlin, Heidelberg, 2010, pp. 92–103.

[36] G. Greco, A. Guzzo, L. Pontieri, Mining taxonomies of processmodels, Data Knowl. Eng. 67 (1) (2008) 74–102.

[37] J. de Weerdt, S. vanden Broucke, J. Vanthienen, B. Baesens, Activetrace clustering for improved process discovery, IEEE Trans. Knowl.Data Eng. 25 (12) (2013) 2708–2720.

[38] M. Sole, J. Carmona, A high-level strategy for c-net discovery, in:Proceedings of the 12th International Conference on Application ofConcurrency to System Design (ACSD), 2012, pp. 102–111.

cess mining framework for correlating, predicting andSystems (2015), http://dx.doi.org/10.1016/j.is.2015.07.003i