java 2013 ieee datamining project region based foldings in process discovery

Region-Based Foldings in Process Discovery

ABSTRACT

A central problem in the area of Process Mining is to obtain a formal model that represents the processes

that are conducted in a system. If realized, this simple motivation allows for powerful techniques that can be

used to formally analyze and optimize a system, without the need to resort to its semiformal and sometimes

inaccurate specification. The problem addressed in this paper is known as Process Discovery: to obtain a formal

model from a set of system executions. The theory of regions is a valuable tool in process discovery: it aims at

learning a formal model (Petri nets) from a set of traces. On its genuine form, the theory is applied on an

automaton and therefore one should convert the traces into an acyclic automaton in order to apply these

techniques. Given that the complexity of the region-based techniques depends on the size of the input automata,

revealing the underlying cycles and folding the initial automaton can incur in a significant complexity

alleviation of the region-based techniques. In this paper, we follow this idea by incorporating region

information in the cycle detection algorithm, enabling the identification of complex cycles that cannot be

obtained efficiently with state-of-the-art techniques. The experimental results obtained by the devised tool

suggest that the techniques presented in this paper are a big step into widening the application of the theory of

regions in Process Mining for industrial scenarios.

Existing System

The global patterns that can be used to make predictions about the future has been one of the key

elements that have brought Data Mining to be one of the most relevant research areas in the last decades. Data

GLOBALSOFT TECHNOLOGIESIEEE PROJECTS & SOFTWARE DEVELOPMENTS

IEEE FINAL YEAR PROJECTS|IEEE ENGINEERING PROJECTS|IEEE STUDENTS PROJECTS|IEEE

BULK PROJECTS|BE/BTECH/ME/MTECH/MS/MCA PROJECTS|CSE/IT/ECE/EEE PROJECTS

CELL: +91 98495 39085, +91 99662 35788, +91 98495 57908, +91 97014 40401

Visit: www.finalyearprojects.org Mail to:[email protected]

mailto:[email protected]

mining techniques can be applied naturally on large amount of data like databases or even the Internet, and with

the help of other disciplines like statistics or machine learning, can effectively reveal important patterns in many

scenarios such as health care, business or transportation. As in data mining, Process Discovery tries to reveal

patterns. However, the patterns aimed by Process Discovery techniques are process models, i.e., formal

representations of the processes of a system. Due to its different focus, Process Discovery techniques apply

disciplines different from the ones used in data mining, to allow for the derivation of both the statics and the

dynamics of a system process. Depending on the emphasis, different dimensions can be considered ranging

from social (the identification of communities) to control-flow (the identification of the complex interplay

between system’s tasks). In this work we consider the latter: discover a Petri net from a log, that is from a set of

traces corresponding to executions of a system. The first method to obtain a Petri net from a log was presented.

Disadvantages

To overcome this limitation, several extensions have been presented in the literature to widen the class

of Petri nets that the algorithm can discover.

The theory of regions was initially proposed to solve the synthesis problem: obtain a Petri net that has a

behavior equivalent to a given transition system.

Proposed System

The theory of regions was initially proposed to solve the synthesis problem: obtain a Petri net that has a

behavior equivalent to a given transition system. three conversions from a language to a TS were proposed,

namely sequence, multiset, and set. The main difference between them is how it is decided whether the

occurrence of an event in a trace produces a new state in the TS or just introduces an arc to an existing state.

Together with these conversions, a number of additional conversions producing smaller TSs by means of

abstractions have been proposed in the literature. Besides the sequence and multiset conversions, other

conversions have been proposed that can yield smaller TSs at the cost of sacrificing regions. We use the term

abstraction techniques to refer to them. The fundamental difference between all these methods and our proposal

is that, in our case, the set of sacrificed regions is controlled considering bounds that are already used by

process discovery tools, thus the compression of the TS does not involve a quality reduction.

Advantages

An advantage of region theory for process discovery is that it allows to perform label splitting.

The advantages offered by the theory of regions, there are two main reasons that hamper a wider

adoption of region-based Process Discovery methodologies in an industrial setting. One is their

sensitivity to noise.

The other hand the benefits for rbminer are twofold, since a smaller region basis reduces the amount of

regions to explore. In this case, both advantages (state and basis reduction) combine to achieve orders of

magnitude speedups.

Module

1. Get Input Text File

2. Discovery Sentence Word

3. Decided Sentence

4. Tandem Repeats

5. Sequence And Multiset Conversions

6. Counting Data

Module Description

Get Input Text File

The Process Discovery differs from synthesis in the knowledge assumption: while in synthesis one

assumes a complete description of the system, only a partial description of the system is assumed in Process

Discovery. Therefore, equivalence or bisimulation is no longer a goal to achieve. Instead, obtaining

approximations that succinctly represent the log under consideration are more valuable.

Discovery Sentence Word

The fact that a discovery algorithm returns a PN with a smaller language than desired is referred as

overfitting. A classical strategy to avoid overfitting is to allow the algorithms to restrict their output to k-

bounded PNs (kbounded discovery), usually for small values of k, as nets with high numbers of tokens are

considered harder to understand for humans than nets with fewer tokens. The particular k used in each case can

be either determined from the desired level of complexity of the resulting PN1 or the number of available

resources in the system (since places can represent resources).

Decided Sentence

The conversions from a language to a TS were proposed, namely sequence, multiset, and set. The main

difference between them is how it is decided whether the occurrence of an event in a trace produces a new state

in the TS or just introduces an arc to an existing state.

Tandem Repeats

The detection of unfolded cycles in an acyclic TS is a problem related to finding consecutively repeated patterns

in a string. The latter problem has been studied in several fields with many variations and under different

names, although it is often referred as the finding tandem repeats problem.

Sequence And Multiset Conversions

The sequence and multiset conversions, other conversions have been proposed that can yield smaller

TSs at the cost of sacrificing regions. We use the term abstraction techniques to refer to them. The fundamental

difference between all these methods and our proposal is that, in our case, the set of sacrificed regions is

controlled considering bounds that are already used by process discovery tools, thus the compression of the TS

does not involve a quality reduction.

Counting Data

The region-based approaches yield PNs that never reject a trace of the log, they are extremely sensitive

to noise. Hence, to be applicable, the approach presented in this paper must be preceded by a noise filtering

phase. The filtering can be done by clustering techniques or by outlier detection. Also, considering the

frequencies of the states is a possibility in our approach to distinguish between real and noisy states, because the

latter have often low frequency. For instance, only Parikh vector differences between frequent states could be

taken into account to differentiate real folding opportunities from spurious cycle unfoldings caused by noise. An

advantage of region theory for process discovery is that it allows to perform label splitting (i.e., to change the

label of some arcs in the TS so that an event is actually represented by a set of different events). Label splitting

is a technique that can help into improving the visualization of the PN, but also into avoiding to generalize too

much. This technique can also be used with the TSs produced by our approach. However, the splitting options

might be reduced as a consequence of arcs with the same label in the original TS that have been now merged

into one arc in the folded TS.

FLOW CHART

Region-Based Process Discovery

Get The Input Text File

Discovery Sentence Word

Sequence and Multiset Tandem Repeats Counting Data

CONCLUSION

The presents a novel technique for compacting a TS, one of the objects typically used in process discovery

algorithms. The two main characteristics of this technique makes it very attractive in the context of region-

based k-bounded process discovery: first, it is one of the most aggressive folding techniques in the literature,

and second, it preserves the important regions that are crucial for PN derivation. The use of folding techniques

that are region-aware like the one presented in this paper may be a crucial step to use region-based algorithms

for process discovery in industrial scenarios.

REFFERENCE

[1] W. van der Aalst, H. Reijers, and M. Song, “Discovering Social Networks from Event Logs,” Computer

Supported Cooperative Work, vol. 14, no. 6, pp. 549-593, 2005.

[2] W. van der Aalst, T. Weijters, and L. Maruster, “Workflow Mining: Discovering Process Models from

Event Logs,” IEEE Trans. Knowledge Data Eng., vol. 16, no. 9, pp. 1128-1142, Sept. 2004.

[3] A. de Medeiros, W. van der Aalst, and A. Weijters, “Workflow Mining: Current Status and Future

Directions,” Proc. On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE, pp. 389-

406, 2003.

[4] L. Wen, W. van der Aalst, J. Wang, and J. Sun, “Mining Process Models with Non-Free-Choice

Constructs,” Data Mining and Knowledge Discovery, vol. 15, no. 2, pp. 145-180, 2007.

[5] W. van der Aalst, A. de Medeiros, and A. Weijters, “Genetic Process Mining,” Proc. 26th Int’l Conf.

Applications and Theory of Petri Nets (ICATPN), pp. 48-69, 2005.

[6] A. Ehrenfeucht and G. Rozenberg, “Partial (Set) 2-Structures. Part I, II,” Acta Informatica, vol. 27, pp. 315-

368, 1990.

java 2013 ieee datamining project region based foldings in process discovery

Technology