a parallel pattern mining algorithm for multi-core …...frequent itemset mining apriori eclat...

96
Laboratoire d’Informatique de Grenoble A Parallel Pattern Mining Algorithm for Multi-Core Architectures Benjamin Negrevergne Jury: Marie-Christine Rousset lig/Grenoble University Alexandre Termier lig/Grenoble University Jean-Fran¸coisM´ ehaut lig/Grenoble University Hiroki Arimura Graduate School of Science and Technology/Hokkaido University Bruno Cr´ emilleux greyc/University of Caen Anne Laurent lirmm/University of Montpellier Laboratoire d’Informatique de Grenoble 1

Upload: others

Post on 07-Mar-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Laboratoire d’Informatiquede Grenoble

A Parallel Pattern Mining Algorithm forMulti-Core Architectures

Benjamin Negrevergne

Jury:

• Marie-Christine Roussetlig/Grenoble University

• Alexandre Termierlig/Grenoble University

• Jean-Francois Mehautlig/Grenoble University

• Hiroki ArimuraGraduate School of Science and

Technology/Hokkaido University

• Bruno Cremilleuxgreyc/University of Caen

• Anne Laurentlirmm/University of Montpellier

Laboratoire d’Informatique de Grenoble 1

Page 2: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Pattern mining

Pattern mining is a sub-topic of Data MiningKnowledge discovery in databases [Fayyad, 96]

”Identifying valid, novel, potentially useful, and ultimatelyunderstandable patterns in data”

Pattern mining:

• Extract knowledge as patterns representing regularities (orirregularities) in data.

Laboratoire d’Informatique de Grenoble 2

Page 3: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Broad range of applications

� Mining frequent set of items purchased together

.

� Mining frequent sub molecules in a moleculardatabase

.

� Mining correlations in sensors data

.

Laboratoire d’Informatique de Grenoble 3

Page 4: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Broad range of applications

� Mining frequent set of items purchased together

O

Possible frequent pattern: {milk, cereals}

• Dataset: set of receipts

• Patterns: set of items

� Mining frequent sub molecules in a moleculardatabase

.

� Mining correlations in sensors data

.

Laboratoire d’Informatique de Grenoble 3

Page 5: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Broad range of applications

� Mining frequent set of items purchased together

.

� Mining frequent sub molecules in a moleculardatabase

.

� Mining correlations in sensors data

.

Laboratoire d’Informatique de Grenoble 3

Page 6: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Broad range of applications

� Mining frequent set of items purchased together

.

� Mining frequent sub molecules in a moleculardatabase

O

Krammer [2001] was able to identify the followingmolecular pattern in a set of anti-HIV molecules

OO

OO

N

N

N+

N

N-

HH

H

HHH

HH

H

H

H H

• Dataset: set of molecules (represented as graphs)

• Patterns: graphs

� Mining correlations in sensors data

.

Laboratoire d’Informatique de Grenoble 3

Page 7: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Broad range of applications

� Mining frequent set of items purchased together

.

� Mining frequent sub molecules in a moleculardatabase

.

� Mining correlations in sensors data

.

Laboratoire d’Informatique de Grenoble 3

Page 8: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Broad range of applications

� Mining frequent set of items purchased together

.

� Mining frequent sub molecules in a moleculardatabase

.

� Mining correlations in sensors data

O

Records from meteorological sensorsFrequent pattern:Temperature ↓ Pressure ↓ Windspeed ↑

• Dataset: list of numerical sensor records

• Patterns: set of variations

Laboratoire d’Informatique de Grenoble 3

Page 9: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Broad range of applications

� Mining frequent set of items purchased together

.

� Mining frequent sub molecules in a moleculardatabase

.

� Mining correlations in sensors data

.

Laboratoire d’Informatique de Grenoble 3

Page 10: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

The broad range of solutions

Nowadays: many different algorithms for each pattern miningproblem• Frequent itemset mining

Apriori Eclat FP-Growth DCI-Closed LCM[Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

• Graph MiningAGM gSpan FSG CloseGraph Gaston[Inokuchi 2000] [Yan 2002] [Kuramochi 2001] [Yan 2003] [Nijssen 2004]

• Tree MiningFreqT TreeMiner Dryade CMTreeMiner[Asai 2002] [Zaki 2005] [Termier 2004] [Chi 2004]

• Gradual itemset miningGrite PGP-mc PGLCM[Di Jorio 2009] [Laurent 2010] [Do 2010]

• SequenceSpade PrefixSpan CloSpan LCM seq[Zaki 2000] [Pei 2000] [Yan 2003] [Uno]

Laboratoire d’Informatique de Grenoble 4

Page 11: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Challenge #1: Genericity

This algorithmic diversity

1. data owners refrain using pattern mining techniques

2. use inadequate algorithms

Challenge #1:

Use a generic approach to address different pattern miningproblems with a unique algorithm.

Laboratoire d’Informatique de Grenoble 5

Page 12: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Challenge #2: Build an efficient and generic patternmining algorithm

Pattern mining is inherently combinatorial:I very large number of possible patterns

Example

basket market analysis: 1000 items → 21000 (∼ 10300) possiblepatterns.

To be efficient algorithms exploit problem properties to reduce thesearch space I efficient but non-generic

Challenge #2

Design an efficient generic and pattern mining algorithm

Laboratoire d’Informatique de Grenoble 6

Page 13: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Contributions of this thesis

We proposed ParaMiner: both generic and efficient algorithmHow:

• Extend theoretical work on pattern enumeration [Boley 2007]and [Arimura 2009]

• Tackle large real world datasets through dataset reduction

• Exploit parallelism for multi-core architectures

� Results

ParaMiner:

• solves various pattern mining problems

• is time-efficient (compete with ad-hoc algorithms)

Laboratoire d’Informatique de Grenoble 7

Page 14: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Contributions of this thesis

We proposed ParaMiner: both generic and efficient algorithmHow:

• Extend theoretical work on pattern enumeration [Boley 2007]and [Arimura 2009]

• Tackle large real world datasets through dataset reduction

• Exploit parallelism for multi-core architectures

� Results

ParaMiner:

• solves various pattern mining problems

• is time-efficient (compete with ad-hoc algorithms)

Laboratoire d’Informatique de Grenoble 7

Page 15: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Outline

Generic framework and problem statement

ParaMinerEfficient exploration of the set of candidate patternsSpeeding up candidate pattern testingParallel execution of ParaMiner

ExperimentsParallel performance evaluationComparative experiments

Conclusion and future work

Laboratoire d’Informatique de Grenoble 8

Page 16: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Illustration: frequent itemset mining

butter eggsmilk milk milk

breadbeer

receipt 1 receipt 2 receipt 3

cereals cereals

minimum frequency threshold = 50%

• P1 ={milk} is frequent (100%): t1, t2, t3

→ closed

• P2 ={cereals} is frequent (66%): t1, t2• P3 ={milk , cereals} is frequent (66%): t1, t2

→ closed

Closed pattern: maximal pattern in a set of transactions

Laboratoire d’Informatique de Grenoble 9

Page 17: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Illustration: frequent itemset mining

items

transactions:

butter eggsmilk milk milk

breadbeer

receipt 1 receipt 2 receipt 3

cereals cereals

t1 t2 t3

minimum frequency threshold = 50%

• P1 ={milk} is frequent (100%): t1, t2, t3

→ closed

• P2 ={cereals} is frequent (66%): t1, t2• P3 ={milk , cereals} is frequent (66%): t1, t2

→ closed

Closed pattern: maximal pattern in a set of transactions

Laboratoire d’Informatique de Grenoble 9

Page 18: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Illustration: frequent itemset mining

items

transactions:

butter eggsmilk milk milk

breadbeer

receipt 1 receipt 2 receipt 3

cereals cereals

t1 t2 t3

minimum frequency threshold = 50%

• P1 ={milk} is frequent (100%): t1, t2, t3

→ closed

• P2 ={cereals} is frequent (66%): t1, t2• P3 ={milk , cereals} is frequent (66%): t1, t2

→ closed

Closed pattern: maximal pattern in a set of transactions

Laboratoire d’Informatique de Grenoble 9

Page 19: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Illustration: frequent itemset mining

items

transactions:

butter eggsmilk milk milk

breadbeer

receipt 1 receipt 2 receipt 3

cereals cereals

t1 t2 t3

minimum frequency threshold = 50%

• P1 ={milk} is frequent (100%): t1, t2, t3

→ closed

• P2 ={cereals} is frequent (66%): t1, t2

• P3 ={milk , cereals} is frequent (66%): t1, t2

→ closed

Closed pattern: maximal pattern in a set of transactions

Laboratoire d’Informatique de Grenoble 9

Page 20: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Illustration: frequent itemset mining

items

transactions:

butter eggsmilk milk milk

breadbeer

receipt 1 receipt 2 receipt 3

cereals cereals

t1 t2 t3

minimum frequency threshold = 50%

• P1 ={milk} is frequent (100%): t1, t2, t3

→ closed

• P2 ={cereals} is frequent (66%): t1, t2• P3 ={milk , cereals} is frequent (66%): t1, t2

→ closed

Closed pattern: maximal pattern in a set of transactions

Laboratoire d’Informatique de Grenoble 9

Page 21: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Illustration: frequent itemset mining

items

transactions:

butter eggsmilk milk milk

breadbeer

receipt 1 receipt 2 receipt 3

cereals cereals

t1 t2 t3

minimum frequency threshold = 50%

• P1 ={milk} is frequent (100%): t1, t2, t3 → closed

• P2 ={cereals} is frequent (66%): t1, t2• P3 ={milk , cereals} is frequent (66%): t1, t2 → closed

Closed pattern: maximal pattern in a set of transactions

Laboratoire d’Informatique de Grenoble 9

Page 22: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Generic framework

I Every pattern represented as a set

I A pattern mining problem defined

• A ground set E

• A dataset DE

• A selection criterion Select

Laboratoire d’Informatique de Grenoble 10

Page 23: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Ground set

� Definition

• Set of all possible elements

• Every candidate pattern is a subset of the ground set

� Examples

Frequent itemsets I E : all possible itemsE = {milk, cereals, eggs, . . .} e.g. candidate: {milk , eggs}

Connected relational graphs I E : all the possible arcsE = {(G1,G2), (G1,G3), . . . , (G5,G4)}(Gx ,Gy ) represent the arc Gx Gy

eg. candidate:Gx

Gy Gz

Gradual itemsets I E : all the possible variationsE = {T ↑,T ↓,P ↑, . . . ,W ↓} eg. candidate: {T ↑,W ↑}

Laboratoire d’Informatique de Grenoble 11

Page 24: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Ground set

� Definition

• Set of all possible elements

• Every candidate pattern is a subset of the ground set

� Examples

Frequent itemsets I E : all possible itemsE = {milk, cereals, eggs, . . .} e.g. candidate: {milk , eggs}

Connected relational graphs I E : all the possible arcsE = {(G1,G2), (G1,G3), . . . , (G5,G4)}(Gx ,Gy ) represent the arc Gx Gy

eg. candidate:Gx

Gy Gz

Gradual itemsets I E : all the possible variationsE = {T ↑,T ↓,P ↑, . . . ,W ↓} eg. candidate: {T ↑,W ↑}

Laboratoire d’Informatique de Grenoble 11

Page 25: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Dataset

� Definition

• Sequence of transactions DE

• Each transaction is a subset of E

� Examples

Frequent itemsets I in DE each transaction = receipt

Relational graphs I in DE each transaction = input graph

[

G2

G1

G3 G4 ,

G2

G1

G3 G4 ] ⇒[{(G1,G2), (G3,G1), (G3,G2), (G4,G1)},{(G1,G2), (G3,G2), (G4,G1)}]

Laboratoire d’Informatique de Grenoble 12

Page 26: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Dataset

� Definition

• Sequence of transactions DE

• Each transaction is a subset of E

� Examples

Frequent itemsets I in DE each transaction = receipt

Relational graphs I in DE each transaction = input graph

[

G2

G1

G3 G4 ,

G2

G1

G3 G4 ] ⇒[{(G1,G2), (G3,G1), (G3,G2), (G4,G1)},{(G1,G2), (G3,G2), (G4,G1)}]

Laboratoire d’Informatique de Grenoble 12

Page 27: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Dataset

� Definition

• Sequence of transactions DE

• Each transaction is a subset of E

� Examples

Frequent itemsets I in DE each transaction = receipt

Relational graphs I in DE each transaction = input graph

[

G2

G1

G3 G4 ,

G2

G1

G3 G4 ] ⇒[{(G1,G2), (G3,G1), (G3,G2), (G4,G1)},{(G1,G2), (G3,G2), (G4,G1)}]

Laboratoire d’Informatique de Grenoble 12

Page 28: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Dataset (2)

Gradual itemsets I in DE each transaction = variation ofattributes between pairs of records

Date T P WDay 1 17.6 1021.20 57Day 2 18.5 1021.30 56Day 3 20.4 1018.20 51

(d1, d2) [{T ↑,P ↑,W ↓}(d1, d3) {T ↑,P ↓,W ↓}(d2, d1) {T ↓,P ↓,W ↑}(d2, d3) {T ↑,P ↓,W ↓}(d3, d1) {T ↓,P ↑,W ↑}(d3, d2) {T ↓,P ↑,W ↑}]

Laboratoire d’Informatique de Grenoble 13

Page 29: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Selection criterion

� Definition

• User-defined predicate 2E → {true, false}• Select(P,DE ) ≡ P is a pattern of interest in DE

� Examples

Frequent itemsets I Select(P,DE ) ≡ P is frequent in DE

Connected relational graphs I Select(P,DE ) ≡ P is aconnected graph and P is frequent in DE

Gradual itemsets I Select(P,DE ) ≡ there exists a path[(dx1 , dx2), . . . , (dxn−1 , dxn)] with n ≥ minsup

Laboratoire d’Informatique de Grenoble 14

Page 30: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Selection criterion

� Definition

• User-defined predicate 2E → {true, false}• Select(P,DE ) ≡ P is a pattern of interest in DE

� Examples

Frequent itemsets I Select(P,DE ) ≡ P is frequent in DE

Connected relational graphs I Select(P,DE ) ≡ P is aconnected graph and P is frequent in DE

Gradual itemsets I Select(P,DE ) ≡ there exists a path[(dx1 , dx2), . . . , (dxn−1 , dxn)] with n ≥ minsup

Laboratoire d’Informatique de Grenoble 14

Page 31: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

General pattern mining problem

� Closed patterns

Closed pattern: Any subset P of E such that

• P occurs in DE (DE [P] 6= ∅)• P satisfies the selection criterion

• P is the maximal pattern in DE [P]

� Problem statement

Given a ground set E , a dataset DE and a selection criterionI extract all the closed patterns

Laboratoire d’Informatique de Grenoble 15

Page 32: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Outline

Generic framework and problem statement

ParaMinerEfficient exploration of the set of candidate patternsSpeeding up candidate pattern testingParallel execution of ParaMiner

ExperimentsParallel performance evaluationComparative experiments

Conclusion and future work

Laboratoire d’Informatique de Grenoble 16

Page 33: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

The generate and test principle

� The standard approach

generate and test.

1. Generate candidate patterns

2. Test if the candidate pattern is a pattern

� . . . is not naively applicable

• Too many candidate patterns

• Each test requires costly database accesses(e.g. to count frequency)

I Challenges

• Generic strategy to explore the set of candidate patterns

• Generic method to simplify the testing

Laboratoire d’Informatique de Grenoble 17

Page 34: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

The generate and test principle

� The standard approach

generate and test.

1. Generate candidate patterns

2. Test if the candidate pattern is a pattern

� . . . is not naively applicable

• Too many candidate patterns

• Each test requires costly database accesses(e.g. to count frequency)

I Challenges

• Generic strategy to explore the set of candidate patterns

• Generic method to simplify the testing

Laboratoire d’Informatique de Grenoble 17

Page 35: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Outline

Generic framework and problem statement

ParaMinerEfficient exploration of the set of candidate patternsSpeeding up candidate pattern testingParallel execution of ParaMiner

ExperimentsParallel performance evaluationComparative experiments

Conclusion and future work

Laboratoire d’Informatique de Grenoble 18

Page 36: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Efficient exploration of the set of candidate patterns

� Structured search space

Most algorithms define a pattern augmentation relationI search space is explored by repeatedly augmenting patterns

ParaMiner’s augmentation relation

P and Q, two patterns:

I Q is the augmentation of P ⇔ Q = P ∪ {e}

ACE → ABCE

Set of patterns + Augmentation relation = DAG1 structure

1directed acyclic graph

Laboratoire d’Informatique de Grenoble 19

Page 37: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Structured search space

DAG-structure of a set of patterns

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

non-patternpattern

I several arcs lead to the same pattern

Laboratoire d’Informatique de Grenoble 20

Page 38: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Discover the structured search space

� An enumeration strategy

Is required to:

• discover all the patterns

• avoid duplicates generation

Which augmentation path must be followed ?

BC

ABC

?

AB×!

I AB is the first parent of ABC

Laboratoire d’Informatique de Grenoble 21

Page 39: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Discover the structured search space

� An enumeration strategy

Is required to:

• discover all the patterns

• avoid duplicates generation

Which augmentation path must be followed ?

BC

ABC

AB

×ABC

×!

I AB is the first parent of ABC

Laboratoire d’Informatique de Grenoble 21

Page 40: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Discover the structured search space

� An enumeration strategy

Is required to:

• discover all the patterns

• avoid duplicates generation

Which augmentation path must be followed ?

BC

ABC

AB×!

I BC is the first parent of ABC

Laboratoire d’Informatique de Grenoble 21

Page 41: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Enumeration strategy

Follow the augmentation P → Q if:

first parent( Q ) = P

DAG-structure following a tree search

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

non-patternsPattern

Laboratoire d’Informatique de Grenoble 22

Page 42: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Enumeration strategy

Follow the augmentation P → Q if:

first parent( Q ) = P

DAG-structure following a tree search

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

non-patternsPattern

Laboratoire d’Informatique de Grenoble 22

Page 43: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

How to compute the first parent of a pattern

� Requirement

A method that is:

• GenericI Sound for all our pattern mining problems

• Adequate to parallel exploration of the search spaceI Computable without global computations

� State of the art:

Theoretical work closed pattern enumeration strategies

• [Arimura et al. 2009]

• [Boley et al. 2010]

I Poly-space enumeration strategiesI first parent do not rely on a global representation of the

search space

Laboratoire d’Informatique de Grenoble 23

Page 44: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

How to compute the first parent of a pattern

� Requirement

A method that is:

• GenericI Sound for all our pattern mining problems

• Adequate to parallel exploration of the search spaceI Computable without global computations

� State of the art:

Theoretical work closed pattern enumeration strategies

• [Arimura et al. 2009]

• [Boley et al. 2010]

I Poly-space enumeration strategiesI first parent do not rely on a global representation of the

search space

Laboratoire d’Informatique de Grenoble 23

Page 45: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Structural properties of the set of patterns

[Boley 2010] Strong accessibility:Idea: We can reach every pattern by augmenting any subset thatis a pattern

We can detect first parent by memorizing only the root of theforking branches:

D

DA DE

DEB DEF

?

...

...

...

EL = {∅}

EL = {A}

EL = {A,B}

• Branch can be explored independently → Parallel

• FIM, CRG and GRI satisfy this property (proved) → Generic

Laboratoire d’Informatique de Grenoble 24

Page 46: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Structural properties of the set of patterns

[Boley 2010] Strong accessibility:Idea: We can reach every pattern by augmenting any subset thatis a patternWe can detect first parent by memorizing only the root of theforking branches:

D

DA DE

DEB DEF

?

...

...

...

EL = {∅}

EL = {A}

EL = {A,B}

• Branch can be explored independently → Parallel

• FIM, CRG and GRI satisfy this property (proved) → Generic

Laboratoire d’Informatique de Grenoble 24

Page 47: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Conclusion on pattern enumeration

• Pattern augmentation

• Structured search space

• Enumeration strategy required

• Problem of detecting the first parent

• Strong accessibility property

• Exclusion list

I generic and parallelizable enumeration strategy

Next: Identifying patterns

Laboratoire d’Informatique de Grenoble 25

Page 48: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Outline

Generic framework and problem statement

ParaMinerEfficient exploration of the set of candidate patternsSpeeding up candidate pattern testingParallel execution of ParaMiner

ExperimentsParallel performance evaluationComparative experiments

Conclusion and future work

Laboratoire d’Informatique de Grenoble 26

Page 49: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Simplifying candidate pattern testing

� Candidate pattern testing

• Check candidate pattern occurrence in the dataset

• Computing the selection criterion

• Computing pattern closure

I Intensive access to the dataset

Problem: How to reduce dataset accesses ?I dataset reduction [Han 2000, Uno 2004]

Laboratoire d’Informatique de Grenoble 27

Page 50: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Dataset reduction: principle

Motivation: Every element in the dataset are not required to testeach candidate pattern

Principle:

1. build a dataset for each new pattern P

2. filter unnecessary elements to P and its descendants

3. use it to test P and its descendants

In ParaMiner : EL-based filter I generic

Laboratoire d’Informatique de Grenoble 28

Page 51: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

EL-based reduction (Proved)

Dataset:DE

Pattern:P = {D, E}EL:EL = {A, B, C}

I DreducedP ?

DE

∈ EL P augm. (/∈ EL)

t1 A, B, D, E, F, G,

}P1

t2 A, D, D, E, F, G,t3 A, B, D, E, F, H,

}P2

⇓Dreduced

P

⋂= =

t ′1 ⊕ t ′2 A, D, E, F, G,t ′3 A, B, D, E, F, H,

• Element required: Any e that can belongs to a closed patternincluding P I ensure that first parent is sound

How to detect them:

• grouping transactions per augmentations supported

• Apply reduction

Laboratoire d’Informatique de Grenoble 29

Page 52: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

EL-based reduction (Proved)

Dataset:DE

Pattern:P = {D, E}EL:EL = {A, B, C}

I DreducedP ?

DE ∈ EL P augm. (/∈ EL)

t1 A, B, D, E, F, G,

}P1

t2 A, D, D, E, F, G,t3 A, B, D, E, F, H,

}P2

⇓Dreduced

P

⋂= =

t ′1 ⊕ t ′2 A, D, E, F, G,t ′3 A, B, D, E, F, H,

• Element required: Any e that can belongs to a closed patternincluding P I ensure that first parent is sound

How to detect them:

• grouping transactions per augmentations supported

• Apply reduction

Laboratoire d’Informatique de Grenoble 29

Page 53: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

EL-based reduction (Proved)

Dataset:DE

Pattern:P = {D, E}EL:EL = {A, B, C}

I DreducedP ?

DE ∈ EL P augm. (/∈ EL)

t1 A, B, D, E, F, G,

}P1

t2 A, D, D, E, F, G,t3 A, B, D, E, F, H,

}P2

⇓Dreduced

P

⋂= =

t ′1 ⊕ t ′2 A, D, E, F, G,t ′3 A, B, D, E, F, H,

• Element required: Any e that can belongs to a closed patternincluding P I ensure that first parent is sound

How to detect them:

• grouping transactions per augmentations supported

• Apply reduction

Laboratoire d’Informatique de Grenoble 29

Page 54: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

EL-based reduction (Proved)

Dataset:DE

Pattern:P = {D, E}EL:EL = {A, B, C}

I DreducedP ?

DE ∈ EL P augm. (/∈ EL)

t1 A, B, D, E, F, G,}P1t2 A, D, D, E, F, G,

t3 A, B, D, E, F, H, }P2

⇓Dreduced

P

⋂= =

t ′1 ⊕ t ′2 A, D, E, F, G,t ′3 A, B, D, E, F, H,

• Element required: Any e that can belongs to a closed patternincluding P I ensure that first parent is sound

How to detect them:

• grouping transactions per augmentations supported

• Apply reduction

Laboratoire d’Informatique de Grenoble 29

Page 55: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

EL-based reduction (Proved)

Dataset:DE

Pattern:P = {D, E}EL:EL = {A, B, C}

I DreducedP ?

DE ∈ EL P augm. (/∈ EL)

t1 A, B, D, E, F, G,}P1t2 A, D, D, E, F, G,

t3 A, B, D, E, F, H, }P2

⇓Dreduced

P

⋂= =

t ′1 ⊕ t ′2 A, D, E, F, G,t ′3 A, B, D, E, F, H,

• Element required: Any e that can belongs to a closed patternincluding P I ensure that first parent is sound

How to detect them:

• grouping transactions per augmentations supported

• Apply reduction

Laboratoire d’Informatique de Grenoble 29

Page 56: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Evaluation of dataset reduction

� Metric

reduction factor =|DE |

|DreducedP |

average reduction factor =

∑P∈C |DE |/|Dreduced

P ||C|

� Frequent itemset mining, mushroom dataset

dataset name ground set size # transactions dataset size

Mushroom 119 8,124 186,852

Laboratoire d’Informatique de Grenoble 30

Page 57: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Evaluation of dataset reduction

� Metric

reduction factor =|DE |

|DreducedP |

average reduction factor =

∑P∈C |DE |/|Dreduced

P ||C|

0

50

100

150

200

250

300

350

20 30 40 50 60 70 80 90

Red

uctio

n fa

ctor

Relative frequency threshold (%)

Problem: FIM, dataset: Mushroom

Reduction factor

0.1

1

10

100

1000

10000

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14

time

(s, l

ogsc

ale)

Relative frequency threshold (%)

Problem: FIM, dataset: Mushroom

PARAMINERNODSRPARAMINER

Laboratoire d’Informatique de Grenoble 30

Page 58: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

ParaMiner: algorithm

1: procedure expand(P, DreducedP , EL)

2: for all e such that e occurs in DreducedP do

3: if Select(P ∪ {e},DreducedP ) then

4: Q ← Clo(P ∪ {e},DreducedP )

5: if is first parent(P,EL,Q) then6: output Q7: Dreduced

Q ← reduce(DreducedP , e,EL)

8: spawn expand(Q,DreducedQ ,EL)

9: EL← EL ∪ {e}10: end if11: end if12: end for13: end procedure

Laboratoire d’Informatique de Grenoble 31

Page 59: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

ParaMiner: main achievements

• Proved sound and complete under some constraints:I the set of patterns satisfying the strong accessibility propertyI the selection criterion is decomposable

Problems FIM, CRG and GRI satisfy these properties

• Parallelized: space exploration is divided into independenttasksI non-trivial for closed pattern mining

• Melinda a parallel execution engine for data-drivenalgorithms

I can be extended with task scheduling strategiesI execute tasks spawned by ParaMinerI also used in PLCM [Negrevergne 2010], PGLCM [Do 2010]

Laboratoire d’Informatique de Grenoble 32

Page 60: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

ParaMiner: main achievements

• Proved sound and complete under some constraints:I the set of patterns satisfying the strong accessibility propertyI the selection criterion is decomposable

Problems FIM, CRG and GRI satisfy these properties

• Parallelized: space exploration is divided into independenttasksI non-trivial for closed pattern mining

• Melinda a parallel execution engine for data-drivenalgorithms

I can be extended with task scheduling strategiesI execute tasks spawned by ParaMinerI also used in PLCM [Negrevergne 2010], PGLCM [Do 2010]

Laboratoire d’Informatique de Grenoble 32

Page 61: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Outline

Generic framework and problem statement

ParaMinerEfficient exploration of the set of candidate patternsSpeeding up candidate pattern testingParallel execution of ParaMiner

ExperimentsParallel performance evaluationComparative experiments

Conclusion and future work

Laboratoire d’Informatique de Grenoble 33

Page 62: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Melinda

� Melinda: Main concepts

• Tuples (A,B,C ) I block of data with a fixed structure

• Tuple Space I synchronized memory spaceI can contain tuplesI supports put() and get()

� Parallelizing ParaMiner with Melinda

ParaMiner parallelization scheme: A new tuple for each recursivecallTuple structure: Parameters of recursive calls(Pattern,Reduce dataset,Exclusion list)

• Tuples are produced when a pattern is discovered

• Tuples are consumed when a core is idle

Laboratoire d’Informatique de Grenoble 34

Page 63: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Melinda

� Melinda: Main concepts

• Tuples (A,B,C ) I block of data with a fixed structure

• Tuple Space I synchronized memory spaceI can contain tuplesI supports put() and get()

� Parallelizing ParaMiner with Melinda

ParaMiner parallelization scheme: A new tuple for each recursivecallTuple structure: Parameters of recursive calls(Pattern,Reduce dataset,Exclusion list)

• Tuples are produced when a pattern is discovered

• Tuples are consumed when a core is idle

Laboratoire d’Informatique de Grenoble 34

Page 64: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Parallel execution

a

ab

abdabe

ad

bcd

df

def

e

Core #1

Core #2

Core #3

Tuple Space

Laboratoire d’Informatique de Grenoble 35

Page 65: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Parallel execution

a

ab

abdabe

ad

bcd

df

def

e

Core #1

Core #2

Core #3

Tuple Space

Laboratoire d’Informatique de Grenoble 35

Page 66: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Parallel execution

a

ab

abdabe

ad

bcd

df

def

e

Core #1

Core #2

Core #3

Tuple Space

Laboratoire d’Informatique de Grenoble 35

Page 67: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Parallel execution

a

ab

abdabe

ad

bcd

df

def

e

Core #1

Core #2

Core #3

Tuple Space

Laboratoire d’Informatique de Grenoble 35

Page 68: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Parallel execution

a

ab

abdabe

ad

bcd

df

def

e

Core #1

Core #2

Core #3

Tuple Space

Laboratoire d’Informatique de Grenoble 35

Page 69: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Parallel execution

a

ab

abdabe

ad

bcd

df

def

e

Core #1

Core #2

Core #3

Tuple Space

Laboratoire d’Informatique de Grenoble 35

Page 70: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Parallel execution

a

ab

abdabe

ad

bcd

df

def

e

Core #1

Core #2

Core #3

Tuple Space

Laboratoire d’Informatique de Grenoble 35

Page 71: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Parallel execution

a

ab

abdabe

ad

bcd

df

def

e

Core #1

Core #2

Core #3

Tuple Space

Laboratoire d’Informatique de Grenoble 35

Page 72: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Outline

Generic framework and problem statement

ParaMinerEfficient exploration of the set of candidate patternsSpeeding up candidate pattern testingParallel execution of ParaMiner

ExperimentsParallel performance evaluationComparative experiments

Conclusion and future work

Laboratoire d’Informatique de Grenoble 36

Page 73: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Outline

Generic framework and problem statement

ParaMinerEfficient exploration of the set of candidate patternsSpeeding up candidate pattern testingParallel execution of ParaMiner

ExperimentsParallel performance evaluationComparative experiments

Conclusion and future work

Laboratoire d’Informatique de Grenoble 37

Page 74: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Parallel performance evaluation

� ParaMiner instances

• frequent itemset mining• frequent relational graph mining• gradual itemset mining

� Computing platforms

Laptop:

• 4-cores (4 core i7)

• 8 GB memory

Server:

• 32-cores (4 core i7 witheach core each)

• 64 GB memory

� Metric

speedup =time using one core

time using n coresLaboratoire d’Informatique de Grenoble 38

Page 75: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Speedup measurements

� 1. Laptop

I speedup on 4 cores: 3 to 4

Laboratoire d’Informatique de Grenoble 39

Page 76: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Speedup measurements

� 2. Server

Laboratoire d’Informatique de Grenoble 39

Page 77: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Performance analysis

In [Buerhrer 2006, Tatikonda 2008], authors exhibit two importantissues:

• load imbalance

• memory issues

� Bus contention

Memory bus: connect cores to memory

too many memory access I subject to contention

Core #1

Core #2

Core #3

Memory

Laboratoire d’Informatique de Grenoble 40

Page 78: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Performance analysis

In [Buerhrer 2006, Tatikonda 2008], authors exhibit two importantissues:

• load imbalance

• memory issues

� Bus contention

Memory bus: connect cores to memory

too many memory access I subject to contention

Core #1

Core #2

Core #3

Memory

Laboratoire d’Informatique de Grenoble 40

Page 79: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Performance analysis

In [Buerhrer 2006, Tatikonda 2008], authors exhibit two importantissues:

• load imbalance

• memory issues

� Bus contention

Memory bus: connect cores to memorytoo many memory access I subject to contention

Core #1

Core #2

Core #3

Memory!

Laboratoire d’Informatique de Grenoble 40

Page 80: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Bus contention in ParaMiner

� % hi-lat. memory operation VS # of cores

gradual itemset mining(good speedup)

100

95

90

85 1 2 4 8 16 32%

of m

emor

y op

erat

ions

#cores

frequent itemset mining(bounded speedup)

100

95

90

85 1 2 4 8 16 32%

of m

emor

y op

erat

ions

#cores

dark slices = high latencies memory operations

• GRI: more cores I % hi-lat. memory operations remainsconstant

• FIM: more cores I increasing % of hi-lat. memory operations

Laboratoire d’Informatique de Grenoble 41

Page 81: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Melinda’s approach

[Tatikonda 2008] proposed architecture conscious pattern miningalgorithm

Requires deep modifications in the algorithm

Inadequate to ParaMiner:I could penalize executions that perform well

� Melinda’s tuple distribution strategies

• distribute tuples to available cores

• user-definedI can exploit algorithmic level information (in tuples)I can exploit architectural level information (available in

Melinda)

Laboratoire d’Informatique de Grenoble 42

Page 82: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Tuple distribution strategies

� Structured tuplespace

Tuple space is divided into internalsI Internals can be used to classify tuples in the tuple space

� Defining new strategies

• distribute() defines in which internal to put the tuple

• retrieve() from which internal the tuple has to be taken

I supports heavy usage

Laboratoire d’Informatique de Grenoble 43

Page 83: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Architecture conscious tuple distribution strategy

1. Distribute: Tuples sharing datastructures go in the sameinternal

2. Retrieve: Cores of the same processor retrieve tuples fromthe same internal

I Improve cache usage and reduce bus contention

� Frequent itemset mining, BMS-WebView/Accidentsdataset

dataset # items # transactions dataset size Density (%)BMS-WebView-2 3,340 77,512 320,601 0.14Accidents 468 340,184 11,500,870 7.22

Laboratoire d’Informatique de Grenoble 44

Page 84: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Architecture conscious tuple distribution strategy

1. Distribute: Tuples sharing datastructures go in the sameinternal

2. Retrieve: Cores of the same processor retrieve tuples fromthe same internal

I Improve cache usage and reduce bus contention

� Frequent itemset mining, BMS-WebView/Accidentsdataset

dataset # items # transactions dataset size Density (%)BMS-WebView-2 3,340 77,512 320,601 0.14Accidents 468 340,184 11,500,870 7.22

Laboratoire d’Informatique de Grenoble 44

Page 85: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Architecture conscious tuple distribution strategy

1. Distribute: Tuples sharing datastructures go in the sameinternal

2. Retrieve: Cores of the same processor retrieve tuples fromthe same internal

1

2

3

4

5

6

7

8

9

10

0 4 8 12 16 20 24 28 32

spee

dup

#cores

SdefaultSarch

y=x

1

2

3

4

5

6

7

8

9

10

0 4 8 12 16 20 24 28 32sp

eedu

p

#cores

SdefaultSarch

y=x

No source code modification I 30% performance gain

Laboratoire d’Informatique de Grenoble 44

Page 86: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Outline

Generic framework and problem statement

ParaMinerEfficient exploration of the set of candidate patternsSpeeding up candidate pattern testingParallel execution of ParaMiner

ExperimentsParallel performance evaluationComparative experiments

Conclusion and future work

Laboratoire d’Informatique de Grenoble 45

Page 87: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Comparative experiments: frequent itemset mining

� Algorithms

• PLCM [Negrevergne 2010] parallel implementation of LCM[Uno 2004] FIMI’04 Award

• MT-Closed [Lucchese 2007] parallel implementation ofDCI-Closed [Lucchese 2004]

� Datasets

dataset # items # transactions dataset size Density (%)BMS-WebView-2 3,340 77,512 320,601 0.14Accidents 468 340,184 11,500,870 7.22

density(DE ) =#items ×#transactions

|DE |(×100)

Laboratoire d’Informatique de Grenoble 46

Page 88: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Comparative experiments frequent itemset mining (2)

� BMS-WebView-2 (sparse)

0.1

1

10

100

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

time

(s, l

ogsc

ale)

Relative frequency threshold (%)

Problem: FIM, dataset: BMS-WebView-2

ParaMiner 32 coresPLCM 32 coresMT-Closed 32 cores

Laboratoire d’Informatique de Grenoble 47

Page 89: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Comparative experiments frequent itemset mining (2)

� Accidents (dense)

1

10

100

1000

10 20 30 40 50 60 70 80 90

time

(s, l

ogsc

ale)

Relative frequency threshold (%)

Problem: FIM, dataset: Accidents

ParaMiner 32 coresPLCM 32 coresMT-Closed 32 cores

Laboratoire d’Informatique de Grenoble 47

Page 90: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Comparative experiments: gradual itemset mining

� Algorithms

• PGLCM [Do 2010]

• PGP-mc (non-closed) [Laurent 2010]

� Datasets

dataset name ground set size # transactions dataset size Density (%)I4408 8824 11,556 50,985,072 50

Laboratoire d’Informatique de Grenoble 48

Page 91: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Comparative experiments gradual itemset mining (2)

� I4408 (Real)

100

1000

10000

100000

70 75 80 85 90 95

time

(s, l

ogsc

ale)

Relative frequency threshold (%)

Gradual Itemset Mining : I4408

ParaMiner 32 coresPGLCM

Laboratoire d’Informatique de Grenoble 49

Page 92: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Outline

Generic framework and problem statement

ParaMinerEfficient exploration of the set of candidate patternsSpeeding up candidate pattern testingParallel execution of ParaMiner

ExperimentsParallel performance evaluationComparative experiments

Conclusion and future work

Laboratoire d’Informatique de Grenoble 50

Page 93: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Conclusion

• ParaMinerI based on state of the art on pattern enumerationI useful to parallel pattern mining

I designed an efficient generic dataset reductionI exploit multi-core architectures

• MelindaI parallel engine of ParaMiner (and other algorithms)I extensible through strategies

� Novelty

• ParaMiner: an efficient solution for new pattern miningproblemsNew pattern mining problem can benefit from 15 years of research

• ParaMiner/Melinda: tool to learn about parallel patternmining

Laboratoire d’Informatique de Grenoble 51

Page 94: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Conclusion

• ParaMinerI based on state of the art on pattern enumerationI useful to parallel pattern mining

I designed an efficient generic dataset reductionI exploit multi-core architectures

• MelindaI parallel engine of ParaMiner (and other algorithms)I extensible through strategies

� Novelty

• ParaMiner: an efficient solution for new pattern miningproblemsNew pattern mining problem can benefit from 15 years of research

• ParaMiner/Melinda: tool to learn about parallel patternmining

Laboratoire d’Informatique de Grenoble 51

Page 95: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Future works

� Extend ParaMiner’s genericity

• Generalize strong accessibility properties to other structuresI handle sequences, and general graphs

• Study partial strong accessibility of set of patternsI bridge with constraint pattern mining

� Extend ParaMiner/Melinda’s efficiency

• Improve Melinda’s strategies I More flexible and expressivequerying tuples

• Target larger computing platforms I cluster of computers

Laboratoire d’Informatique de Grenoble 52

Page 96: A Parallel Pattern Mining Algorithm for Multi-Core …...Frequent itemset mining Apriori Eclat FP-Growth DCI-Closed LCM [Agrawal 94] [Zaki 1997] [Han 2004] [Lucchese 2004] [Uno 2004]

Thank you !

Laboratoire d’Informatique de Grenoble 53