atrie-basedaprioriimplementationformining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some...

29
A Trie-based APRIORI Implementation for Mining Frequent Item Sequences Ferenc Bodon [email protected] Department of Computer Science and Information Theory, Budapest University of Technology and Economics supervisor: Lajos Rónyai Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.1/14

Upload: others

Post on 07-Nov-2019

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

A Trie-based APRIORI Implementation for MiningFrequent Item Sequences

Ferenc Bodon

[email protected]

Department of Computer Science and Information Theory,

Budapest University of Technology and Economics

supervisor: Lajos Rónyai

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.1/14

Page 2: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

The hierarchy of the pattern types

itemsetitem sequence

without duplicatesitem sequencewith duplicates

rooted tree

sequence of itemsets

tree graph

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.2/14

Page 3: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

The hierarchy of the pattern types

itemsetitem sequence

without duplicatesitem sequencewith duplicates

rooted tree

sequence of itemsets

tree graph

ordered/unordered

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.2/14

Page 4: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

The hierarchy of the pattern types

itemsetitem sequence

without duplicatesitem sequencewith duplicates

rooted tree

sequence of itemsets

tree graph

ordered/unordered directed/undirected

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.2/14

Page 5: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

The hierarchy of the pattern types

itemsetitem sequence

without duplicatesitem sequencewith duplicates

rooted tree

sequence of itemsets

tree graph

ordered/unordered directed/undirected

vertex labeled/unlabelededge labeled/unlabeled

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.2/14

Page 6: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

The hierarchy of the pattern types

itemsetitem sequence

without duplicatesitem sequencewith duplicates

rooted tree

sequence of itemsets

tree graph

ordered/unordered directed/undirected

vertex labeled/unlabelededge labeled/unlabeled

At each generalization step we have to examine• new problems, that come into play,• applicability of the existing techniques at previous level.

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.2/14

Page 7: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

The hierarchy of the pattern types

itemsetitem sequence

without duplicatesitem sequencewith duplicates

rooted tree

sequence of itemsets

tree graph

ordered/unordered directed/undirected

vertex labeled/unlabelededge labeled/unlabeled

We provide an efficient, open-source, trie-based APRIORI imple-

mentation for mining frequent sequences of items in a transac-

tional database

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.2/14

Page 8: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

The background

We developed a FIM template library that includes• a fast IO framework,• some basic functions (e.g. fast subset enumerator),• modularized Apriori, eclat, fp-growth, nonord-fp algorithms,• tester classes,• a benchmark environment.

In Apriori of FIM itemset are converted to item sequences, hence

it is natural to extend FIM approach.

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.3/14

Page 9: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

General Properties of the Trie

• The representation of the list of edges◦ label ordered list of (label, pointer) pairs,◦ tabular representation (also with offset-index trick),◦ hybrid solution

is set by a template class• no parent pointers are stored,• dead-end branches are removed asap,• classes that do◦ support counting,◦ candidate generation

are set by template parameters.

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.4/14

Page 10: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

Investigated techniques

1. Routing strategy at the nodes,

2. candidate generation,

3. equisupport extension,

4. filtered transaction caching.

Our environment makes it possible to examine each technique

separately, and also to examine the effect on each other.

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.5/14

Page 11: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

1/4. Routing Strategy

How to find the edge to follow in Apriori? Given a node with a listof n edges and a part of the filtered transaction (t′), findmatching labels.

• search for corresponding item◦ if transaction is stored in a list: nt′ comparisons,◦ indexarray solution,

• search for corresponding label◦ tabular representation of edge: t′ comparisons,◦ binary search: t′ log n comparisons,◦ linear search: t′n comparisons,◦ intelligent linear search: t′n/2 comparisons,

• simultaneous traversal (merge)◦ first sort, then remove duplicates,◦ first remove duplicates, then sort.

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.6/14

Page 12: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

1/4. Routing Strategy

• No single best routing strategy.• The best routing strategy depends on◦ size of transaction,◦ the number of duplicates,◦ min_supp◦ · · ·

• Strategies with large overhead are not competitive at largesupport thresholds.

• Search corresponding item performed always good.• It is more important how does the strategy suit to the

features (prefetching, pipelining, etc.) of the modernprocessors than the worst case run-times.

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.7/14

Page 13: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

2/4. Candidate Generation

Same as in FIM, there solutions can be applied

1. complete pruning with subset checks,

2. complete pruning with an intersection-based solution,

3. omit complete pruning.

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.8/14

Page 14: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

2/4. Candidate Generation

Same as in FIM, there solutions can be applied

1. complete pruning with subset checks,

2. complete pruning with an intersection-based solution,

3. omit complete pruning.

Our expectation:Complete pruning in frequent item sequence mining ismore important than in FIM.

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.8/14

Page 15: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

2/4. Candidate Generation

Same as in FIM, there solutions can be applied

1. complete pruning with subset checks,

2. complete pruning with an intersection-based solution,

3. omit complete pruning.

Our expectation:Complete pruning in frequent item sequence mining ismore important than in FIM.Reasoning:Support counting is slower, while determining if a candi-date’s subsequence is contained in a trie is achieved asfast as in FIM.

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.8/14

Page 16: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

2/4. Candidate Generation

Same as in FIM, there solutions can be applied

1. complete pruning with subset checks,

2. complete pruning with an intersection-based solution,

3. omit complete pruning.

Our expectation:Complete pruning in frequent item sequence mining ismore important than in FIM.Reasoning:Support counting is slower, while determining if a candi-date’s subsequence is contained in a trie is achieved asfast as in FIM.Experiments:Support our hypothesis. Omitting complete pruning neverresulted in a faster Apriori than intersection-based Apriori.

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.8/14

Page 17: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

3/4. Equisupport Extension

Omitting equisupport extension is the most widely usedtechnique in FIM.Property 1. Let X ⊂ Y ⊆ I. If supp(X) = supp(Y ) thensupp(Y ∪ Z) = supp(X ∪ Z) for any Z ⊆ I \ Y .

Usage: It is not necessary to determine the support of Y ∪Z.

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.9/14

Page 18: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

3/4. Equisupport Extension

Omitting equisupport extension is the most widely usedtechnique in FIM.Property 1. Let X ⊂ Y ⊆ I. If supp(X) = supp(Y ) thensupp(Y ∪ Z) = supp(X ∪ Z) for any Z ⊆ I \ Y .

Usage: It is not necessary to determine the support of Y ∪ Z.

0

100

200

300

400

500

600

30000 35000 40000 45000 50000 55000

time

(sec

)

Support threshold

Database: connect

NOPRUNEINTERSECT-PRUNE

NOPRUNE-NEE

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.9/14

Page 19: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

3/4. Equisupport Extension

Omitting equisupport extension is the most widely usedtechnique in FIM.Property 1. Let X ⊂ Y ⊆ I. If supp(X) = supp(Y ) thensupp(Y ∪ Z) = supp(X ∪ Z) for any Z ⊆ I \ Y .

Usage: It is not necessary to determine the support of Y ∪ Z.Equisupport pruning in Apriori for FIM:

1. in infrequent candidate removal phase: |X| = |Y | − 1 and Xis prefix of Y , then do not extend Y ,

2. in candidate generation’s subset check phase: Y ∪ Z ispotential candidate, Y is non-prefix of Y ∪ Z, X is prefix ofY then the support of Y ∪ Z is not determined.

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.9/14

Page 20: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

3/4. Equisupport Extension

Omitting equisupport extension is the most widely usedtechnique in FIM.Property 1. Let X ⊂ Y ⊆ I. If supp(X) = supp(Y ) thensupp(Y ∪ Z) = supp(X ∪ Z) for any Z ⊆ I \ Y .

Usage: It is not necessary to determine the support of Y ∪ Z.Equisupport pruning in Apriori for FIM:

1. in infrequent candidate removal phase: |X| = |Y | − 1 and Xis prefix of Y , then do not extend Y ,

2. in candidate generation’s subset check phase: Y ∪ Z ispotential candidate, Y is non-prefix of Y ∪ Z, X is prefix ofY then the support of Y ∪ Z is not determined.

Unfortunately, no equisupport extension pruning can be applied

in frequent item sequence mining.

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.9/14

Page 21: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

3/4. Equisupport Extension

Omitting equisupport extension is the most widely usedtechnique in FIM.Property 1. Let X ⊂ Y ⊆ I. If supp(X) = supp(Y ) thensupp(Y ∪ Z) = supp(X ∪ Z) for any Z ⊆ I \ Y .

Usage: It is not necessary to determine the support of Y ∪ Z.Equisupport pruning in Apriori for FIM:

1. in infrequent candidate removal phase: |X| = |Y | − 1 and Xis prefix of Y , then do not extend Y ,

Proof. By contradiction. Let the database T be {〈A〉, 〈B,A〉}.Then supp(〈〉) = supp(〈A〉) = 2 butsupp(〈B〉) = 1 6= 0 = supp(〈A,B〉).

Unfortunately, no equisupport extension pruning can be applied

in frequent item sequence mining.

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.9/14

Page 22: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

4/4. Filtered transaction caching

Let t be a transaction.filtered t: infrequent items are removed from t.Collect the same filtered transaction and store them in memory.

Advantages:• IO cost is reduced,

• parsing costs are reduced,• the number of support count method calls is reduced.

Disadvantage:• needs extra memory.• requires CPU time to collect the same filtered transactions

Possible data structures: ordered list, trie, red-black tree, patri-

cia tree

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.10/14

Page 23: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

4/4. Filtered transaction caching

Let t be a transaction.filtered t: infrequent items are removed from t.Collect the same filtered transaction and store them in memory.

Advantages:• IO cost is reduced,• parsing costs are reduced,

• the number of support count method calls is reduced.

Disadvantage:• needs extra memory.• requires CPU time to collect the same filtered transactions

Possible data structures: ordered list, trie, red-black tree, patri-

cia tree

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.10/14

Page 24: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

4/4. Filtered transaction caching

Let t be a transaction.filtered t: infrequent items are removed from t.Collect the same filtered transaction and store them in memory.

Advantages:• IO cost is reduced,• parsing costs are reduced,• the number of support count method calls is reduced.

Disadvantage:• needs extra memory.• requires CPU time to collect the same filtered transactions

Possible data structures: ordered list, trie, red-black tree, patri-

cia tree

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.10/14

Page 25: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

4/4. Filtered transaction caching

Let t be a transaction.filtered t: infrequent items are removed from t.Collect the same filtered transaction and store them in memory.

Advantages:• IO cost is reduced,• parsing costs are reduced,• the number of support count method calls is reduced.

Disadvantage:• needs extra memory.• requires CPU time to collect the same filtered transactions

Possible data structures: ordered list, trie, red-black tree, patri-

cia treeOpen Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.10/14

Page 26: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

4/4. Filtered transaction caching

Expectation:• Transaction caching is not such an effective technique as it

is in FIM.

Reasoning:• For two item sequences to be equal, not just the items, but

the indices of the same items have to be equal as well.

Experiments:• never resulted in a faster algorithm,• sometimes it significantly increases memory need.

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.11/14

Page 27: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

Is our implementation competitive?

0 100 200 300 400 500 600

1000 10000

time

(sec

)

Support threshold

Database: BMS-POS

AprioriPrefixSpan

0 100 200 300 400 500 600 700

100000

time

(sec

)

Support threshold

Database: kosarak100_infty

AprioriPrefixSpan

0 20 40 60 80

100 120

1 10 100

time

(sec

)

Support threshold

Database: kosarak2_10_2

AprioriPrefixSpan

0 100 200 300 400 500 600 700 800

10000tim

e (s

ec)

Support threshold

Database: kosarak2_100_infty

AprioriPrefixSpan

Run-timesApriori vs. prefixSpan by Taku Kudo

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.12/14

Page 28: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

Is our implementation competitive?

0 10 20 30 40 50 60 70 80 90

1000 10000

mem

(MB

)

Support threshold

Database: BMS-POS

AprioriPrefixSpan

0 20 40 60 80

100 120 140

100000

mem

(MB

)

Support threshold

Database: kosarak100_infty

AprioriPrefixSpan

0 50

100 150 200 250 300

1 10 100

mem

(MB

)

Support threshold

Database: kosarak2_10_2

AprioriPrefixSpan

0 20 40 60 80

100 120 140 160

10000m

em (M

B)

Support threshold

Database: kosarak2_100_infty

AprioriPrefixSpan

Memory needsApriori vs. prefixSpan by Taku Kudo

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.13/14

Page 29: ATrie-basedAPRIORIImplementationforMining ...bodon/kozos/papers/bodon_osdm05_slide.pdf · some basic functions (e.g. fast subset enumerator), modularized Apriori, eclat, fp-growth,

Conclusion

When develping solutions for frequent item sequence mining it isuseful to understand techniques used in FIM.

Thank you for your attention!

Open Source Data Mining Workshop on Frequent Pattern Mining Implementations – p.14/14