lcm ver.2: efficient mining algorithms for frequent/closed/maximal itemsets takeaki uno masashi...

24
LCM ver.2: Efficient Mining LCM ver.2: Efficient Mining Algorithms for Algorithms for Frequent/Closed/Maximal Itemsets Frequent/Closed/Maximal Itemsets Takeaki Uno Takeaki Uno Masashi Masashi Kiyomi Kiyomi Hiroki Hiroki Arimura Arimura National Institute of Informatics, JAPAN National Institute of Informatics, JAPAN Hokkaido University, JAPAN ov/2004 Frequent Itemset Mining Implementations ’0

Upload: rudolf-lindsey

Post on 31-Dec-2015

225 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

LCM ver.2: Efficient Mining Algorithms for LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal ItemsetsFrequent/Closed/Maximal Itemsets

Takeaki UnoTakeaki Uno

Masashi KiyomiMasashi Kiyomi

Hiroki ArimuraHiroki Arimura

National Institute of Informatics, JAPAN

National Institute of Informatics, JAPAN

Hokkaido University, JAPAN

1/Nov/2004 Frequent Itemset Mining Implementations ’04

Page 2: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

SummarySummary

FI mining Backtracking with Hypercube decomposition (few freq. Counting)

Back-tracking

CI mining Backtracking with PPC-extension

(complete enumeration)

(small memory)

Apriori with pruning

MFI mining Backtracking with pruning

(small memory)

Apriori with pruning

freq. counting Occurrence deliver

(linear time computation)

Down project

database maintenance

array with Anytime database reduction (simple) (fast initialization)

Trie (FP-tree)

maximality check

More database reductions

(small memory)

store all itemsets

Our approachOur approach Typical approachTypical approach

Page 3: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

Frequent Itemset MiningFrequent Itemset Mining

•• Almost all computation time is spent for frequency counting

⇒ ⇒ How to reduce

FI mining Backtracking with Hypercube decomposition (few freq. Counting)

Backtracking

CI mining Backtracking with PPC-extension (complete enumeration)(small memory) Apriori with pruning

MFI mining Backtracking with pruning (small memory) Apriori with pruning

freq.counting Occurrence deliver  (linear time computation)

Down project

database maintenance

array with Anytime database reduction (simple) (fast initialization)

Trie (FP-tree)

maximality check More database reductions (small memory) store all itemsets

•• #FI to be checked

•• cost of frequency counting

Page 4: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

Hypercube Decomposition Hypercube Decomposition [form Ver.1][form Ver.1]

•• Reduce #FI to be checked

1.1. Decompose the set of all FI’s into hypercubes, each of which is included in an equivalence class

2.2. Enumerate maximal and minimal of each hypercube

(with frequency counting)

3.3. Generate other FI’s between maximal and minimal

(without frequency counting)

Efficient when support is smallEfficient when support is small

Page 5: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

Occurrence Deliver Occurrence Deliver [ver1][ver1]

•• Compute the denotations of P {∪ i} for all i’s at once, by transposing the trimmed database

•• Trimmed database is composed of - - items to be added - - transactions including P

linear timelinear time in the size of trimmed database

A B C

3 4 5

33

4 55

A BC

denotation of 1,2,3denotation of 1,2,4denotation of 1,2,5

AA

B

B

C

itemset: 1,2denotation: A,B,C

Efficient for sparse datasets

TrimmedTrimmeddatabasedatabase

1 2

database

Page 6: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

Loss of Occurrence DeliverLoss of Occurrence Deliver     [new][new]

•• Avoiding frequency counting of infrequent itemset P {∪ e} has been considered to be important

•• However, the computation time for such itemsets is 1/3 of all computation cost on average, in our experiments

   (if we sort items by their frequency (size of tuple list))

3456789

P∪ADELMABCEFGH JKLNABDEFGI JKLMSTWBEGILTMTWABCDFGH IKLMNST

θ

Occurrence deliver has an advantage of its simple structure

Page 7: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

Anytime Database Reduction Anytime Database Reduction [new][new]

•• Database reduction:Database reduction: Reduce the database, by [fp-growth, etc]

  ◆  ◆ Remove item e, if e is included in less than θ transactions

             oror included in all transactions

  ◆  ◆ merge identical transactions into one

•• Anytime database reduction:Anytime database reduction: Recursively apply trimming and this reduction, in the recursion

   database size becomes small in lower levels of the recursion

In the recursion tree, lower level iterations are exponentially many rather than upper level iterations. very efficient

Page 8: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

Example of Anytime D. R. Example of Anytime D. R. [new][new]

trim anytime database reduction trim anytime database reduction….

i j

Page 9: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

ArrayArray(reduced)(reduced) vs. Trie (FP-tree) vs. Trie (FP-tree) [new][new]

•• Trie can compress the trimmed database [fp-growth, etc]

•• By experiments for FIMI instances, we compute the average compression ratio by Trie for trimmed database over all iterations

•• #items(cells) in Tries 1/2 average, 1/6 minimum (dense case)

•• If Trie is constructed by a binary tree, it needs at least 3 pointers for each item.

memory use (computation time) twice, minimum 2/3

initialization is fast (LCM O(||T||) : Trie O(|T|log|T| + ||T||) )

Page 10: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

ResultsResults

Page 11: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

Closed Itemset MiningClosed Itemset Mining

•• avoid (prune) non-closed itemsets?

(existing pruning is not complete)

•• quickly operate closure?

•• save memory use?

(existing approach uses much memory)

FI mining Backtracking with Hypercube decomposition (few freq. Counting) Backtracking

CI mining Backtracking with PPC-extension (complete enumeration)(small memory)

Apriori with pruning

MFI mining Backtracking with pruning (small memory) Apriori with pruning

freq.counting Occurrence deliver    (linear time computation) Down project

database maintenance array with Anytime database reduction (simple) (fast initialization) Trie (FP-tree)

Maximality check

More database reductions

(small memory)

store all itemsets

•• How to

Page 12: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

Prefix Preserving Closure Extension Prefix Preserving Closure Extension [ver1][ver1]

•• Prefix preserving closure extensionPrefix preserving closure extension (PPC-extension) is

a variation of closure extension

Def. closure tailDef. closure tail of a closed itemset P

⇔⇔ the minimum j s.t. closure (P ∩ {1,…,j}) ==  P

Def. Def. H == closure(P {∪ i}) (closure extension of P)

is a PPC-extensionPPC-extension of P

   ⇔⇔     i > closure tail and H ∩{1,…,i-1} ==  P ∩{1,…,i-1}

no duplication occurs by depth-first search

“Any” closed itemset H is generated from another “uniqueunique” closed itemset by PPC-extension (i.e., from closure(H ∩{1,…,i-1}) )

Page 13: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

Example of ppc-extension Example of ppc-extension [ver1][ver1]

closure extension

ppc extension

1,2,5,6,7,92,3,4,51,2,7,8,91,7,92,7,92

TT == 

φ

{1,7,9}

{2,7,9}

{1,2,7,9}

{7,9}

{2,5}

{2}

{2,3,4,5}

{1,2,7,8,9} {1,2,5,6,7,9}

•• closure extension acyclic

•• ppc extension tree

Page 14: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

ResultsResults

Page 15: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

Maximal Frequent Itemset MiningMaximal Frequent Itemset Mining

•• How to

FI mining Backtracking with Hypercube decomposition (few freq. Counting) Backtracking

CI mining Backtracking with PPC-extension (complete enumeration)(small memory) Apriori with pruning

MFI mining Backtracking with pruning (small memory)

Apriori with pruning

freq.counting Occurrence deliver    (linear time computation) Down project

database maintenance array with Anytime database reduction (simple) (fast initialization) Trie (FP-tree)

maximality check

More database reductions

(small memory)

store all itemsets

•• avoid (prune) non-maximal imteset?

•• check maximality quickly?

•• save memory? (existing maximality

check and pruning use much memory)

Page 16: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

Backtracking-based Pruning Backtracking-based Pruning [new][new]

•• During backtracking algorithm for FI,

           : current itemset       : a MFI including K

•• re-sort items s.t.

items of H locate end

4 5 6 7 8 9 10

4 56 78 910

re-sort

31 2

We can avoid so many non-MFI’s

•• Then, new MFI NEVER be found in

recursive calls w.r.t. items in H

omit such recursive callsomit such recursive calls

rec. call no rec. call

Page 17: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

Fast Maximality Check (Fast Maximality Check (CI,MFICI,MFI) ) [new][new]

•• To reduce the computation cost for maximality check,

closedness check, we use more database reduction

•• At anytime database reduction, we keep

 ◆ ◆ the intersection of merged transactions, for closure operation

 ◆ ◆ the sum of merged transactions as a weighted transaction database, for maximality check

•• Closure is the intersection of transactions

•• Frequency of one more larger itemsets are

sum of transactions in the trimmed database

By using these reduced databases, computation time becomes short

(no more than frequency counting)

Page 18: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

ResultsResults

Page 19: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

ExperimentsExperiments

CPU, memory, OS:   AMD Athron XP 1600+, 224MB, Linux

Compared with: FP-growth, afopt, Mafia, Patriciamine, kDCI

(All these marked high scores at competition FIMI03)

1313 datasets datasets of FIMI repository FIMI repository

•• Fast at large supports for all instances of FI, CI, MFI

•• Fast for all instances for CI (except for Accidents)

•• Fast for all sparse datasets of FI, CI, MFI

•• Slow only for accidents, T40I10D100K of FI, MFI, and

pumsbstar of MFI

ResultResult

Page 20: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

Summary of ResultsSummary of Results

largelarge

supportssupports

FI CI MFI

sparse(7)

LCMmiddle(5)

dense(1)

smallsmall

supportssupports

FI CI MFI

sparse(7) LCM LCM LCM

middle(5) Both LCM Both

dense(1) Others LCM Others

Page 21: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

resultsresults

Page 22: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

ConclusionConclusion

•• When equivalence classes are large, PPC-extension and

Hypercube decomposition works well

•• Anytime database reduction and Occurrence deliver have

advantages on initialization, sparse cases and simplicity compared to

Trie and Down project

•• Backtracking-based pruning saves memory usage

•• More database reduction works well as much as memory storage

approaches

Page 23: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

Future WorkFuture Work

• • LCM is weak at MFI mining and dense datasetsLCM is weak at MFI mining and dense datasets

•• More efficient Pruning for MFI

•• Some new data structures for dense cases

•• Fast radix sort for anytime database reduction

•• IO optimization ?????

Page 24: LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

List of DatasetsList of Datasets

Real datasetsReal datasets

     ・・ BMS-WebVeiw-1

     ・・ BMS-WebVeiw-2

     ・・ BMS-POS

     ・・ Retail

     ・・ Kosarak

     ・・ Accidents

Machine learning benchmarkMachine learning benchmark

     ・・ Chess

     ・・ Mushroom

     ・・  Pumsb

     ・・  Pumsb*

     ・・ Connect

Aartificial datasetsAartificial datasets

    ・・  T10I4D100K

    ・・  T40I10D100K