enabling large-scale process discovery 9th july, 2015 •motivation • ... integration of apache...

Enabling Large-Scale Process Discovery

Sergio Hernández de Mesa{ [email protected],[email protected] }

Eindhoven, The Netherlands

9th July, 2015

mailto:[email protected]


• Motivation

• MapReduce-based distributed process discovery

• Integration within ProM

• Evaluation

• Summary and Future Work

Outline

Sergio Hernández de Mesa 2

Motivation


Data explosion phenomenon

Motivation


Big Data era

Motivation


Big Data and Process Mining

• Motivation



• Evaluation


Outline


• Programming model for data-oriented applications

• Proposed by Google

• Inspired by functional programming

• Scalable and easy-to-use

• Map: (k1, v1) list (k2,v2)

• Reduce: (k2, list(v2) ) list (v3)


MapReduce-based distributed process discoveryMapReduce

• Framework for reliable and scalable distributed computing

• Developed by Apache

• Core components:- Hadoop Distributed File System (HDFS)

- Hadoop YARN (Yet Another Resource Manager)


MapReduce-based distributed process discoveryHadoop

HDFS overview YARN overview

• Distribute/Parallelize process discovery techniques- Step 1: considering relations at trace level

- Step 2: aggregating information

- Step 3: apply some magic (discovery algorithm)

• HPC infrastructures

• Parallel programming models and technologies- MapReduce

- Hadoop

MapReduce-based distributed process discoveryPerformance improvement opportunities


• XES as input format

• Split log in smaller sublogs

- Horizontal partitioning

- Automatically managed by HDFS

• MapReduce-based approach

- Map: analyse event relations data (dfg, long-distance, split/joins, etc.…)

- Reduce: aggregate data and simple transformations

• Process model inside ProM

- Reuse algorithms, representations

- Visualize results


MapReduce-based distributed process discoveryHighlights


MapReduce-based distributed process discoveryOverview of process discovery techniques

Alpha Miner

Inductive Miner

Flexible Heuristics Miner

HDFS (Hadoop

DistributedFile System)

HDFS (Hadoop Distributed

File System)

XES Logs

Block 1

Block 2

Block N

…

<trace>...

</trace>

<trace>…..

</trace>

… MAP 1

MAP 2

MAP N

…<trace>

…</trace>

<trace>…

</trace>

…

<trace>…

</trace>

<trace>…

</trace>

…

DFG 1

DFG 2

DFG N

REDUCE

FINAL DFG

Splitphase

MapReduce-based distributed process discoveryComputing DFG: Hadoop/MapReduce approach

… …


• Motivation



• Evaluation


Outline


• Hadoop Cluster Parameters

- Connection with a Hadoop cluster

- Verify user has access to the cluster and HDFS is accessible

• Hadoop XLog

- Extend XLog interface

- Just a reference to the file when it is imported

- Actually loaded in memory if the plugin request some information

Integration within ProMCore Concepts


Integration within ProM


1.- Connection with the Hadoop cluster

2.- Virtual import the log

3.- Send executable jar file

4.- Execute MapReduce job

5.- Retrieve result

6.- Get final process model

Basic Operation

Integration within ProM


Screenshot

• Motivation



• Evaluation


Outline


• AIS Hadoop cluster

• 1 Master Node

- 8 Intel XEON CPU E5430 at 2.66 GHz

- 32 GB of RAM

- 5 300 GB hard disks

• 4 Worker Nodes

- 8 Intel XEON CPU E5430 at 2.66 GHz

- 64 GB of RAM

- 8 1 TB hard disks

EvaluationExperimental setup: Hardware configuration


• Apache Hadoop 2.6.0

• Up to 16 tasks (virtual cores) per worker node

• Up to 56 GB of RAM per worker node

• HDFS Block size: 256 MB

• 2 replicas per block

• Master node: Namenode and Resource Manager services

• Worker nodes: Datanode and Node Manager services

EvaluationExperimental setup: Hadoop configuration

20

• Alpha Miner

- No configuration parameters

• Inductive Miner

- Inductive Miner infrequent

- Noise thresholds: 0.2

• Flexible Heuristics Miner

- Heuristics: all tasks connected and long distance dependency

- Dependency thresholds 90.0

- Relative-to-best threshold: 5.0

EvaluationExperimental setup: Process mining techniques


• Synthetic datasets

- Process tree of 40 activities

- Random generation

• Log 1

- Average: 35 events per trace

EvaluationExperimental setup: Datasets


• Log 2

- 2 iterations synthetic dataset

- 40 activities


• Log 3

- Renamed activities 2nd iteration

- 80 activities


EvaluationExperimental setup: Datasets


EvaluationExperimentation: Put logs in HDFS


0

10

20

30

40

50

60

70

0 32 64 96 128 160 192 224 256 288 320 352 384 416

Tim

e (m

inut

es)

Log size (GB)

Put logs in HDFS

Log 3

Log 2

Log 1

• Log 1

EvaluationExperimentation: Scalability – log size


0

5

10

15

20

25

0 32 64 96 128 160 192 224 256

Tim

e (m

inut

es)

Log size (GB)

Alpha Miner

Inductive Miner

• Log 1

EvaluationExperimentation: Scalability – log size


0

25

50

75

100

125

150

175

200

0 32 64 96 128 160 192 224 256

Tim

e (m

inut

es)

Log size (GB)


XES 2 DGraph

DGraph 2 AnnotatedDGraph

• Log 1

EvaluationExperimentation: Scalability – worker nodes


0

0.5

1

1.5

2

2.5

3

3.5

4

0 32 64 96 128 160 192 224 256

Spee

d-up

Log size (GB)

Computing directly-follows graph

1 worker 2 workers

3 workers 4 workers

• Log 1

EvaluationExperimentation: Scalability – worker nodes


0

0.5

1

1.5

2

2.5

3

3.5

4

0 32 64 96 128 160 192 224 256

Spee

d-up

Log size (GB)


1 worker 2 workers

3 workers 4 workers

Evaluation


0

5

10

15

20

25

30

35

40

45

50

0 32 64 96 128 160 192 224 256 288 320 352 384 416

Tim

e (m

inut

es)

Log size (GB)

Inductive Miner

Log 3

Log 2

Log 1

Experimentation: 3 datasets comparison

Evaluation


0

100

200

300

400

500

600

700

800

0 32 64 96 128 160 192 224 256 288 320 352 384 416

Tim

e (m

inut

es)

Log size (GB)


Log 3

Log 2

Log 1

Experimentation: 3 datasets comparison

• Motivation



• Evaluation


Outline


• MapReduce-based approach for process discovery- Split the input event log in smaller sublogs

- Map phase: Computing intermediate data from sublogs

- Reduce phase: Aggregating all data

- Process model in ProM

• Results show the approach is scalable- Log size (number of events and traces)

- Computer resources

• Integration of Apache Hadoop within ProM- Using implemented algorithms

- Developing new Hadoop-based techniquesSergio Hernández de Mesa 32

Summary and Future WorkConclusions

• Developing new process discovery algorithms- ILP miner, social network miner, etc.

• Extend to other process mining dimensions- Computing alignments

• Explore other input formats- CSV

- Apache Avro

• Explore other distributed computing approaches- Apache Stark

- Cloud computing


Summary and Future WorkFuture work

Enabling Large-Scale Process DiscoverySergio Hernández de Mesa

{ [email protected],[email protected] }



enabling large-scale process discovery 9th july, 2015 •motivation • ... integration of apache...

Documents