enabling large-scale process discovery 9th july, 2015 •motivation • ... integration of apache...
TRANSCRIPT
Enabling Large-Scale Process Discovery
Sergio Hernández de Mesa{ [email protected],[email protected] }
Eindhoven, The Netherlands
9th July, 2015
• Motivation
• MapReduce-based distributed process discovery
• Integration within ProM
• Evaluation
• Summary and Future Work
Outline
Sergio Hernández de Mesa 2
Motivation
Sergio Hernández de Mesa 3
Data explosion phenomenon
Motivation
Sergio Hernández de Mesa 4
Big Data era
Motivation
Sergio Hernández de Mesa 5
Big Data and Process Mining
Motivation
Sergio Hernández de Mesa 6
Big Data and Process Mining
• Motivation
• MapReduce-based distributed process discovery
• Integration within ProM
• Evaluation
• Summary and Future Work
Outline
Sergio Hernández de Mesa 7
• Programming model for data-oriented applications
• Proposed by Google
• Inspired by functional programming
• Scalable and easy-to-use
• Map: (k1, v1) list (k2,v2)
• Reduce: (k2, list(v2) ) list (v3)
Sergio Hernández de Mesa 8
MapReduce-based distributed process discoveryMapReduce
• Framework for reliable and scalable distributed computing
• Developed by Apache
• Core components:- Hadoop Distributed File System (HDFS)
- Hadoop YARN (Yet Another Resource Manager)
Sergio Hernández de Mesa 9
MapReduce-based distributed process discoveryHadoop
HDFS overview YARN overview
• Distribute/Parallelize process discovery techniques- Step 1: considering relations at trace level
- Step 2: aggregating information
- Step 3: apply some magic (discovery algorithm)
• HPC infrastructures
• Parallel programming models and technologies- MapReduce
- Hadoop
MapReduce-based distributed process discoveryPerformance improvement opportunities
Sergio Hernández de Mesa 10
• XES as input format
• Split log in smaller sublogs
- Horizontal partitioning
- Automatically managed by HDFS
• MapReduce-based approach
- Map: analyse event relations data (dfg, long-distance, split/joins, etc.…)
- Reduce: aggregate data and simple transformations
• Process model inside ProM
- Reuse algorithms, representations
- Visualize results
Sergio Hernández de Mesa 11
MapReduce-based distributed process discoveryHighlights
Sergio Hernández de Mesa 12
MapReduce-based distributed process discoveryOverview of process discovery techniques
Alpha Miner
Inductive Miner
Flexible Heuristics Miner
HDFS (Hadoop
DistributedFile System)
HDFS (Hadoop Distributed
File System)
XES Logs
Block 1
Block 2
Block N
…
<trace>...
</trace>
<trace>…..
</trace>
… MAP 1
MAP 2
MAP N
…<trace>
…</trace>
<trace>…
</trace>
…
<trace>…
</trace>
<trace>…
</trace>
…
DFG 1
DFG 2
DFG N
REDUCE
FINAL DFG
Splitphase
MapReduce-based distributed process discoveryComputing DFG: Hadoop/MapReduce approach
… …
Sergio Hernández de Mesa 13
• Motivation
• MapReduce-based distributed process discovery
• Integration within ProM
• Evaluation
• Summary and Future Work
Outline
Sergio Hernández de Mesa 14
• Hadoop Cluster Parameters
- Connection with a Hadoop cluster
- Verify user has access to the cluster and HDFS is accessible
• Hadoop XLog
- Extend XLog interface
- Just a reference to the file when it is imported
- Actually loaded in memory if the plugin request some information
Integration within ProMCore Concepts
Sergio Hernández de Mesa 15
Integration within ProM
Sergio Hernández de Mesa 16
1.- Connection with the Hadoop cluster
2.- Virtual import the log
3.- Send executable jar file
4.- Execute MapReduce job
5.- Retrieve result
6.- Get final process model
Basic Operation
Integration within ProM
Sergio Hernández de Mesa 17
Screenshot
• Motivation
• MapReduce-based distributed process discovery
• Integration within ProM
• Evaluation
• Summary and Future Work
Outline
Sergio Hernández de Mesa 18
• AIS Hadoop cluster
• 1 Master Node
- 8 Intel XEON CPU E5430 at 2.66 GHz
- 32 GB of RAM
- 5 300 GB hard disks
• 4 Worker Nodes
- 8 Intel XEON CPU E5430 at 2.66 GHz
- 64 GB of RAM
- 8 1 TB hard disks
EvaluationExperimental setup: Hardware configuration
Sergio Hernández de Mesa 19
• Apache Hadoop 2.6.0
• Up to 16 tasks (virtual cores) per worker node
• Up to 56 GB of RAM per worker node
• HDFS Block size: 256 MB
• 2 replicas per block
• Master node: Namenode and Resource Manager services
• Worker nodes: Datanode and Node Manager services
EvaluationExperimental setup: Hadoop configuration
20
• Alpha Miner
- No configuration parameters
• Inductive Miner
- Inductive Miner infrequent
- Noise thresholds: 0.2
• Flexible Heuristics Miner
- Heuristics: all tasks connected and long distance dependency
- Dependency thresholds 90.0
- Relative-to-best threshold: 5.0
EvaluationExperimental setup: Process mining techniques
Sergio Hernández de Mesa 21
• Synthetic datasets
- Process tree of 40 activities
- Random generation
• Log 1
- Average: 35 events per trace
EvaluationExperimental setup: Datasets
Sergio Hernández de Mesa 22
• Log 2
- 2 iterations synthetic dataset
- 40 activities
- Average: 70 events per trace
• Log 3
- Renamed activities 2nd iteration
- 80 activities
- Average: 70 events per trace
EvaluationExperimental setup: Datasets
Sergio Hernández de Mesa 23
EvaluationExperimentation: Put logs in HDFS
Sergio Hernández de Mesa 24
0
10
20
30
40
50
60
70
0 32 64 96 128 160 192 224 256 288 320 352 384 416
Tim
e (m
inut
es)
Log size (GB)
Put logs in HDFS
Log 3
Log 2
Log 1
• Log 1
EvaluationExperimentation: Scalability – log size
Sergio Hernández de Mesa 25
0
5
10
15
20
25
0 32 64 96 128 160 192 224 256
Tim
e (m
inut
es)
Log size (GB)
Alpha Miner
Inductive Miner
• Log 1
EvaluationExperimentation: Scalability – log size
Sergio Hernández de Mesa 26
0
25
50
75
100
125
150
175
200
0 32 64 96 128 160 192 224 256
Tim
e (m
inut
es)
Log size (GB)
Flexible Heuristics Miner
XES 2 DGraph
DGraph 2 AnnotatedDGraph
• Log 1
EvaluationExperimentation: Scalability – worker nodes
Sergio Hernández de Mesa 27
0
0.5
1
1.5
2
2.5
3
3.5
4
0 32 64 96 128 160 192 224 256
Spee
d-up
Log size (GB)
Computing directly-follows graph
1 worker 2 workers
3 workers 4 workers
• Log 1
EvaluationExperimentation: Scalability – worker nodes
Sergio Hernández de Mesa 28
0
0.5
1
1.5
2
2.5
3
3.5
4
0 32 64 96 128 160 192 224 256
Spee
d-up
Log size (GB)
Flexible Heuristics Miner
1 worker 2 workers
3 workers 4 workers
Evaluation
Sergio Hernández de Mesa 29
0
5
10
15
20
25
30
35
40
45
50
0 32 64 96 128 160 192 224 256 288 320 352 384 416
Tim
e (m
inut
es)
Log size (GB)
Inductive Miner
Log 3
Log 2
Log 1
Experimentation: 3 datasets comparison
Evaluation
Sergio Hernández de Mesa 30
0
100
200
300
400
500
600
700
800
0 32 64 96 128 160 192 224 256 288 320 352 384 416
Tim
e (m
inut
es)
Log size (GB)
Flexible Heuristics Miner
Log 3
Log 2
Log 1
Experimentation: 3 datasets comparison
• Motivation
• MapReduce-based distributed process discovery
• Integration within ProM
• Evaluation
• Summary and Future Work
Outline
Sergio Hernández de Mesa 31
• MapReduce-based approach for process discovery- Split the input event log in smaller sublogs
- Map phase: Computing intermediate data from sublogs
- Reduce phase: Aggregating all data
- Process model in ProM
• Results show the approach is scalable- Log size (number of events and traces)
- Computer resources
• Integration of Apache Hadoop within ProM- Using implemented algorithms
- Developing new Hadoop-based techniquesSergio Hernández de Mesa 32
Summary and Future WorkConclusions
• Developing new process discovery algorithms- ILP miner, social network miner, etc.
• Extend to other process mining dimensions- Computing alignments
• Explore other input formats- CSV
- Apache Avro
• Explore other distributed computing approaches- Apache Stark
- Cloud computing
Sergio Hernández de Mesa 33
Summary and Future WorkFuture work
Enabling Large-Scale Process DiscoverySergio Hernández de Mesa