m.sc. jury defense

49
Outline Motivations State of the Art Parallel CLOSET+ Ending Remarks Parallel CLOSET+ Algorithm for Finding Frequent Closed Itemsets Tayfun S ¸en M.Sc. Thesis Defense Department of Computer Engineering Middle East Technical University June 29, 2009 Tayfun S ¸en Parallel CLOSET+ for Finding Frequent Closed Itemsets

Upload: tayfun-sen

Post on 22-May-2015

594 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Parallel CLOSET+ Algorithm for FindingFrequent Closed Itemsets

Tayfun Sen

M.Sc. Thesis DefenseDepartment of Computer Engineering

Middle East Technical UniversityJune 29, 2009

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 2: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

MotivationsProblem: Information PollutionRecent Advancements in Data and Computing

State of the ArtData MiningParallel Computing

Parallel CLOSET+The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

Ending RemarksDemoReferencesQ&A

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 3: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Problem: Information PollutionRecent Advancements in Data and Computing

Problem: Information Pollution

I Computerization

I Internetization (is that aword?)

As a result, data accessible by ordinary people through everydaydevices increases exponentially.

Image titled “Listening Post” from flickr, licensed CC BY-NC-ND 2.0. Source:http://www.flickr.com/photos/fenchurch/427814801/

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 4: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Problem: Information PollutionRecent Advancements in Data and Computing

Problem: Information Pollution

I Computerization

I Internetization (is that aword?)

As a result, data accessible by ordinary people through everydaydevices increases exponentially.

Image titled “Listening Post” from flickr, licensed CC BY-NC-ND 2.0. Source:http://www.flickr.com/photos/fenchurch/427814801/

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 5: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Problem: Information PollutionRecent Advancements in Data and Computing

Problem: Information Pollution

I Computerization

I Internetization (is that aword?)

As a result, data accessible by ordinary people through everydaydevices increases exponentially.

Image titled “Listening Post” from flickr, licensed CC BY-NC-ND 2.0. Source:http://www.flickr.com/photos/fenchurch/427814801/

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 6: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Problem: Information PollutionRecent Advancements in Data and Computing

Recent Advancements in Data and Computing

I Newer sources providing huge data (open governance, openAPIs, crowdsourcing . . . )

I Increased computing power (grids, cheaper clusters, cloudcomputing . . . )

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 7: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Problem: Information PollutionRecent Advancements in Data and Computing

Recent Advancements in Data and Computing

I Newer sources providing huge data (open governance, openAPIs, crowdsourcing . . . )

I Increased computing power (grids, cheaper clusters, cloudcomputing . . . )

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 8: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Problem: Information PollutionRecent Advancements in Data and Computing

Google servers circa 1996

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 9: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Problem: Information PollutionRecent Advancements in Data and Computing

Google servers circa 1999. a

aImage taken from flickr, licensed CC BY-2.0. Source:http://en.wikipedia.org/wiki/File:Google’s First Production Server.jpg

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 10: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Problem: Information PollutionRecent Advancements in Data and Computing

Image taken from Wikipedia, licensed CC BY-3.0. Source:http://en.wikipedia.org/wiki/File:Athlon64x2-6400plus.jpg

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 11: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Data MiningParallel Computing

Data Mining

I Data mining enables the transition from data to knowledge:Data ⇒ Information ⇒ Knowledge

I A relatively young research area, but quite active nonetheless.

I Many popular applications on the wild (e-commerce, finance,biotechnology, counter terrorism etc.)

I flickr.com interestingness, amazon.com suggestions, google flutrends are some of the well known implementations.

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 12: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Data MiningParallel Computing

Data Mining

I Data mining enables the transition from data to knowledge:Data ⇒ Information ⇒ Knowledge

I A relatively young research area, but quite active nonetheless.

I Many popular applications on the wild (e-commerce, finance,biotechnology, counter terrorism etc.)

I flickr.com interestingness, amazon.com suggestions, google flutrends are some of the well known implementations.

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 13: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Data MiningParallel Computing

Data Mining

I Data mining enables the transition from data to knowledge:Data ⇒ Information ⇒ Knowledge

I A relatively young research area, but quite active nonetheless.

I Many popular applications on the wild (e-commerce, finance,biotechnology, counter terrorism etc.)

I flickr.com interestingness, amazon.com suggestions, google flutrends are some of the well known implementations.

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 14: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Data MiningParallel Computing

Data Mining

I Data mining enables the transition from data to knowledge:Data ⇒ Information ⇒ Knowledge

I A relatively young research area, but quite active nonetheless.

I Many popular applications on the wild (e-commerce, finance,biotechnology, counter terrorism etc.)

I flickr.com interestingness, amazon.com suggestions, google flutrends are some of the well known implementations.

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 15: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Data MiningParallel Computing

Data Mining - A Top Down Approach

I KDD? Knowledge Discovery in Databases(data preprocessing and result interpretation is included)

I Data Mining(sometimes used interchangeably with KDD)

I Association Rule Mining(beer and baby diapers?)

I Frequent Itemset Mining(self explanatory)

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 16: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Data MiningParallel Computing

Data Mining - A Top Down Approach

I KDD? Knowledge Discovery in Databases(data preprocessing and result interpretation is included)

I Data Mining(sometimes used interchangeably with KDD)

I Association Rule Mining(beer and baby diapers?)

I Frequent Itemset Mining(self explanatory)

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 17: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Data MiningParallel Computing

Data Mining - A Top Down Approach

I KDD? Knowledge Discovery in Databases(data preprocessing and result interpretation is included)

I Data Mining(sometimes used interchangeably with KDD)

I Association Rule Mining(beer and baby diapers?)

I Frequent Itemset Mining(self explanatory)

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 18: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Data MiningParallel Computing

Data Mining - A Top Down Approach

I KDD? Knowledge Discovery in Databases(data preprocessing and result interpretation is included)

I Data Mining(sometimes used interchangeably with KDD)

I Association Rule Mining(beer and baby diapers?)

I Frequent Itemset Mining(self explanatory)

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 19: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Data MiningParallel Computing

Parallel Computing

I To program multi-core CPUs (whatever happened tofrequency increases?)

I To program systems with multiple computing resources(shared memory or shared nothing)

I Beginning to get really important as it is needed to takebenefit of the ever increasing computing resources.

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 20: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Data MiningParallel Computing

Parallel Computing

I To program multi-core CPUs (whatever happened tofrequency increases?)

I To program systems with multiple computing resources(shared memory or shared nothing)

I Beginning to get really important as it is needed to takebenefit of the ever increasing computing resources.

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 21: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Data MiningParallel Computing

Parallel Computing

I To program multi-core CPUs (whatever happened tofrequency increases?)

I To program systems with multiple computing resources(shared memory or shared nothing)

I Beginning to get really important as it is needed to takebenefit of the ever increasing computing resources.

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 22: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Data MiningParallel Computing

Parallel Programming Methods

I Automatic parallelization (a lost war?)

I Threads (too hard to manage?)

I OpenMP (easier to do, but does not work on shared nothingarchitectures)

I MPI (flexible, but low level and harder)

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 23: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Data MiningParallel Computing

Parallel Programming Methods

I Automatic parallelization (a lost war?)

I Threads (too hard to manage?)

I OpenMP (easier to do, but does not work on shared nothingarchitectures)

I MPI (flexible, but low level and harder)

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 24: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Data MiningParallel Computing

Parallel Programming Methods

I Automatic parallelization (a lost war?)

I Threads (too hard to manage?)

I OpenMP (easier to do, but does not work on shared nothingarchitectures)

I MPI (flexible, but low level and harder)

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 25: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

Data MiningParallel Computing

Parallel Programming Methods

I Automatic parallelization (a lost war?)

I Threads (too hard to manage?)

I OpenMP (easier to do, but does not work on shared nothingarchitectures)

I MPI (flexible, but low level and harder)

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 26: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

The CLOSET+ Algorithm

I Developed by Wang, Han et al. [1]

I A natural step in the evolution of data mining algorithms(sets ⇒ trees ⇒ graphs)

I A data structure called FP-tree used

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 27: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

The CLOSET+ Algorithm

I Developed by Wang, Han et al. [1]

I A natural step in the evolution of data mining algorithms(sets ⇒ trees ⇒ graphs)

I A data structure called FP-tree used

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 28: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

The CLOSET+ Algorithm

I Developed by Wang, Han et al. [1]

I A natural step in the evolution of data mining algorithms(sets ⇒ trees ⇒ graphs)

I A data structure called FP-tree used

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 29: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

Building FP-tree

Table: An Example Database

TID Basket contents

001 a, c, f, m, p

002 a, c, d, f, m, p

003 a, b, c, f, g, m

004 b, f, i

005 b, c, n, p

Table: Pruned and Ordered DB

TID Pruned & ordered items

001 f, c, a, m, p

002 f, c, a, m, p

003 f, c, a, b, m

004 f, b

005 c, b, p

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 30: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

Figure: Building of FP-tree as each transaction is processed

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 31: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

Mining FP-tree

Figure: FP-tree with side links

Figure: Projected FP-tree for itemp:3

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 32: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

Parallelization

I Used OpenMPI and Boost libraries.

I Developed using C++

I Debugging is particularly tricky (new types of bugs, hugenumber of interleavings . . . )

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 33: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

Parallelization

I Used OpenMPI and Boost libraries.

I Developed using C++

I Debugging is particularly tricky (new types of bugs, hugenumber of interleavings . . . )

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 34: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

Parallelization

I Used OpenMPI and Boost libraries.

I Developed using C++

I Debugging is particularly tricky (new types of bugs, hugenumber of interleavings . . . )

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 35: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

Item Count Merging

���� � � � � � � �

�� ��� � � � � � � � � � �

���� � � � � � � �

�� ��� � � � � � � � � � �

���� � � � � � � �

�� ��� � � � � � � � � � �

Figure: Merging of the local item counts

Simple adding of support counts.Next up, FP-tree and result tree merging.

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 36: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

Figure: Merging of two FP-trees

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 37: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

Result tree merging

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 38: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

Test Results

I Tested with two types of data

I A real dataset and a synthetic one generated using IBM’sQuest dataset generator

I No over-subscription done (each core executes a single thread)

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 39: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

Test Results

I Tested with two types of data

I A real dataset and a synthetic one generated using IBM’sQuest dataset generator

I No over-subscription done (each core executes a single thread)

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 40: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

Test Results

I Tested with two types of data

I A real dataset and a synthetic one generated using IBM’sQuest dataset generator

I No over-subscription done (each core executes a single thread)

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 41: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

0

10

20

30

40

50

60

70

40 45 50 55 60 65 70 75 80 85 90 95

Tim

e (s

ec)

Support value (%)

1 core2 cores3 cores4 cores

Figure: Execution on 1-4 Cores,Retail dataset

0

50

100

150

200

250

300

40 45 50 55 60 65 70 75 80 85 90 95

Tim

e (s

ec)

Support value (%)

4 cores8 cores

12 cores16 cores

Figure: Execution on 4-16 Cores,Retail dataset

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 42: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

Conclusion

I High speedup and efficiency for high to medium support values

I The basic determinant for performance is communicationoverhead

I FP-tree provides a compressed communication, usefulness ofparallel execution is increased

I As support threshold is lowered and number of processors isincreased, efficiency gets lower

I It is left to the application owner to find the optimumnumbers for execution (in terms of support values and numberof processors)

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 43: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

Conclusion

I High speedup and efficiency for high to medium support values

I The basic determinant for performance is communicationoverhead

I FP-tree provides a compressed communication, usefulness ofparallel execution is increased

I As support threshold is lowered and number of processors isincreased, efficiency gets lower

I It is left to the application owner to find the optimumnumbers for execution (in terms of support values and numberof processors)

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 44: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

Conclusion

I High speedup and efficiency for high to medium support values

I The basic determinant for performance is communicationoverhead

I FP-tree provides a compressed communication, usefulness ofparallel execution is increased

I As support threshold is lowered and number of processors isincreased, efficiency gets lower

I It is left to the application owner to find the optimumnumbers for execution (in terms of support values and numberof processors)

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 45: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

Conclusion

I High speedup and efficiency for high to medium support values

I The basic determinant for performance is communicationoverhead

I FP-tree provides a compressed communication, usefulness ofparallel execution is increased

I As support threshold is lowered and number of processors isincreased, efficiency gets lower

I It is left to the application owner to find the optimumnumbers for execution (in terms of support values and numberof processors)

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 46: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

The CLOSET+ AlgorithmParallelizationTest ResultsConclusion

Conclusion

I High speedup and efficiency for high to medium support values

I The basic determinant for performance is communicationoverhead

I FP-tree provides a compressed communication, usefulness ofparallel execution is increased

I As support threshold is lowered and number of processors isincreased, efficiency gets lower

I It is left to the application owner to find the optimumnumbers for execution (in terms of support values and numberof processors)

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 47: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

DemoReferencesQ&A

Demo

A real life demo on Nar

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 48: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

DemoReferencesQ&A

References

Jianyong Wang, Jiawei Han, and Jian Pei.CLOSET+: searching for the best strategies for miningfrequent closed itemsets.In KDD ’03: Proceedings of the ninth ACM SIGKDDinternational conference on Knowledge discovery and datamining, pages 236–245, New York, NY, USA, 2003.

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets

Page 49: M.Sc. Jury Defense

OutlineMotivations

State of the ArtParallel CLOSET+

Ending Remarks

DemoReferencesQ&A

Thanks for listening. Any questions?

Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets