m.sc. jury defense
TRANSCRIPT
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Parallel CLOSET+ Algorithm for FindingFrequent Closed Itemsets
Tayfun Sen
M.Sc. Thesis DefenseDepartment of Computer Engineering
Middle East Technical UniversityJune 29, 2009
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
MotivationsProblem: Information PollutionRecent Advancements in Data and Computing
State of the ArtData MiningParallel Computing
Parallel CLOSET+The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
Ending RemarksDemoReferencesQ&A
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Problem: Information PollutionRecent Advancements in Data and Computing
Problem: Information Pollution
I Computerization
I Internetization (is that aword?)
As a result, data accessible by ordinary people through everydaydevices increases exponentially.
Image titled “Listening Post” from flickr, licensed CC BY-NC-ND 2.0. Source:http://www.flickr.com/photos/fenchurch/427814801/
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Problem: Information PollutionRecent Advancements in Data and Computing
Problem: Information Pollution
I Computerization
I Internetization (is that aword?)
As a result, data accessible by ordinary people through everydaydevices increases exponentially.
Image titled “Listening Post” from flickr, licensed CC BY-NC-ND 2.0. Source:http://www.flickr.com/photos/fenchurch/427814801/
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Problem: Information PollutionRecent Advancements in Data and Computing
Problem: Information Pollution
I Computerization
I Internetization (is that aword?)
As a result, data accessible by ordinary people through everydaydevices increases exponentially.
Image titled “Listening Post” from flickr, licensed CC BY-NC-ND 2.0. Source:http://www.flickr.com/photos/fenchurch/427814801/
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Problem: Information PollutionRecent Advancements in Data and Computing
Recent Advancements in Data and Computing
I Newer sources providing huge data (open governance, openAPIs, crowdsourcing . . . )
I Increased computing power (grids, cheaper clusters, cloudcomputing . . . )
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Problem: Information PollutionRecent Advancements in Data and Computing
Recent Advancements in Data and Computing
I Newer sources providing huge data (open governance, openAPIs, crowdsourcing . . . )
I Increased computing power (grids, cheaper clusters, cloudcomputing . . . )
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Problem: Information PollutionRecent Advancements in Data and Computing
Google servers circa 1996
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Problem: Information PollutionRecent Advancements in Data and Computing
Google servers circa 1999. a
aImage taken from flickr, licensed CC BY-2.0. Source:http://en.wikipedia.org/wiki/File:Google’s First Production Server.jpg
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Problem: Information PollutionRecent Advancements in Data and Computing
Image taken from Wikipedia, licensed CC BY-3.0. Source:http://en.wikipedia.org/wiki/File:Athlon64x2-6400plus.jpg
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Data MiningParallel Computing
Data Mining
I Data mining enables the transition from data to knowledge:Data ⇒ Information ⇒ Knowledge
I A relatively young research area, but quite active nonetheless.
I Many popular applications on the wild (e-commerce, finance,biotechnology, counter terrorism etc.)
I flickr.com interestingness, amazon.com suggestions, google flutrends are some of the well known implementations.
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Data MiningParallel Computing
Data Mining
I Data mining enables the transition from data to knowledge:Data ⇒ Information ⇒ Knowledge
I A relatively young research area, but quite active nonetheless.
I Many popular applications on the wild (e-commerce, finance,biotechnology, counter terrorism etc.)
I flickr.com interestingness, amazon.com suggestions, google flutrends are some of the well known implementations.
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Data MiningParallel Computing
Data Mining
I Data mining enables the transition from data to knowledge:Data ⇒ Information ⇒ Knowledge
I A relatively young research area, but quite active nonetheless.
I Many popular applications on the wild (e-commerce, finance,biotechnology, counter terrorism etc.)
I flickr.com interestingness, amazon.com suggestions, google flutrends are some of the well known implementations.
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Data MiningParallel Computing
Data Mining
I Data mining enables the transition from data to knowledge:Data ⇒ Information ⇒ Knowledge
I A relatively young research area, but quite active nonetheless.
I Many popular applications on the wild (e-commerce, finance,biotechnology, counter terrorism etc.)
I flickr.com interestingness, amazon.com suggestions, google flutrends are some of the well known implementations.
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Data MiningParallel Computing
Data Mining - A Top Down Approach
I KDD? Knowledge Discovery in Databases(data preprocessing and result interpretation is included)
I Data Mining(sometimes used interchangeably with KDD)
I Association Rule Mining(beer and baby diapers?)
I Frequent Itemset Mining(self explanatory)
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Data MiningParallel Computing
Data Mining - A Top Down Approach
I KDD? Knowledge Discovery in Databases(data preprocessing and result interpretation is included)
I Data Mining(sometimes used interchangeably with KDD)
I Association Rule Mining(beer and baby diapers?)
I Frequent Itemset Mining(self explanatory)
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Data MiningParallel Computing
Data Mining - A Top Down Approach
I KDD? Knowledge Discovery in Databases(data preprocessing and result interpretation is included)
I Data Mining(sometimes used interchangeably with KDD)
I Association Rule Mining(beer and baby diapers?)
I Frequent Itemset Mining(self explanatory)
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Data MiningParallel Computing
Data Mining - A Top Down Approach
I KDD? Knowledge Discovery in Databases(data preprocessing and result interpretation is included)
I Data Mining(sometimes used interchangeably with KDD)
I Association Rule Mining(beer and baby diapers?)
I Frequent Itemset Mining(self explanatory)
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Data MiningParallel Computing
Parallel Computing
I To program multi-core CPUs (whatever happened tofrequency increases?)
I To program systems with multiple computing resources(shared memory or shared nothing)
I Beginning to get really important as it is needed to takebenefit of the ever increasing computing resources.
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Data MiningParallel Computing
Parallel Computing
I To program multi-core CPUs (whatever happened tofrequency increases?)
I To program systems with multiple computing resources(shared memory or shared nothing)
I Beginning to get really important as it is needed to takebenefit of the ever increasing computing resources.
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Data MiningParallel Computing
Parallel Computing
I To program multi-core CPUs (whatever happened tofrequency increases?)
I To program systems with multiple computing resources(shared memory or shared nothing)
I Beginning to get really important as it is needed to takebenefit of the ever increasing computing resources.
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Data MiningParallel Computing
Parallel Programming Methods
I Automatic parallelization (a lost war?)
I Threads (too hard to manage?)
I OpenMP (easier to do, but does not work on shared nothingarchitectures)
I MPI (flexible, but low level and harder)
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Data MiningParallel Computing
Parallel Programming Methods
I Automatic parallelization (a lost war?)
I Threads (too hard to manage?)
I OpenMP (easier to do, but does not work on shared nothingarchitectures)
I MPI (flexible, but low level and harder)
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Data MiningParallel Computing
Parallel Programming Methods
I Automatic parallelization (a lost war?)
I Threads (too hard to manage?)
I OpenMP (easier to do, but does not work on shared nothingarchitectures)
I MPI (flexible, but low level and harder)
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
Data MiningParallel Computing
Parallel Programming Methods
I Automatic parallelization (a lost war?)
I Threads (too hard to manage?)
I OpenMP (easier to do, but does not work on shared nothingarchitectures)
I MPI (flexible, but low level and harder)
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
The CLOSET+ Algorithm
I Developed by Wang, Han et al. [1]
I A natural step in the evolution of data mining algorithms(sets ⇒ trees ⇒ graphs)
I A data structure called FP-tree used
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
The CLOSET+ Algorithm
I Developed by Wang, Han et al. [1]
I A natural step in the evolution of data mining algorithms(sets ⇒ trees ⇒ graphs)
I A data structure called FP-tree used
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
The CLOSET+ Algorithm
I Developed by Wang, Han et al. [1]
I A natural step in the evolution of data mining algorithms(sets ⇒ trees ⇒ graphs)
I A data structure called FP-tree used
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
Building FP-tree
Table: An Example Database
TID Basket contents
001 a, c, f, m, p
002 a, c, d, f, m, p
003 a, b, c, f, g, m
004 b, f, i
005 b, c, n, p
Table: Pruned and Ordered DB
TID Pruned & ordered items
001 f, c, a, m, p
002 f, c, a, m, p
003 f, c, a, b, m
004 f, b
005 c, b, p
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
Figure: Building of FP-tree as each transaction is processed
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
Mining FP-tree
Figure: FP-tree with side links
Figure: Projected FP-tree for itemp:3
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
Parallelization
I Used OpenMPI and Boost libraries.
I Developed using C++
I Debugging is particularly tricky (new types of bugs, hugenumber of interleavings . . . )
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
Parallelization
I Used OpenMPI and Boost libraries.
I Developed using C++
I Debugging is particularly tricky (new types of bugs, hugenumber of interleavings . . . )
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
Parallelization
I Used OpenMPI and Boost libraries.
I Developed using C++
I Debugging is particularly tricky (new types of bugs, hugenumber of interleavings . . . )
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
Item Count Merging
���� � � � � � � �
�� ��� � � � � � � � � � �
�
���� � � � � � � �
�� ��� � � � � � � � � � �
���� � � � � � � �
�� ��� � � � � � � � � � �
�
Figure: Merging of the local item counts
Simple adding of support counts.Next up, FP-tree and result tree merging.
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
Figure: Merging of two FP-trees
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
Result tree merging
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
Test Results
I Tested with two types of data
I A real dataset and a synthetic one generated using IBM’sQuest dataset generator
I No over-subscription done (each core executes a single thread)
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
Test Results
I Tested with two types of data
I A real dataset and a synthetic one generated using IBM’sQuest dataset generator
I No over-subscription done (each core executes a single thread)
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
Test Results
I Tested with two types of data
I A real dataset and a synthetic one generated using IBM’sQuest dataset generator
I No over-subscription done (each core executes a single thread)
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
0
10
20
30
40
50
60
70
40 45 50 55 60 65 70 75 80 85 90 95
Tim
e (s
ec)
Support value (%)
1 core2 cores3 cores4 cores
Figure: Execution on 1-4 Cores,Retail dataset
0
50
100
150
200
250
300
40 45 50 55 60 65 70 75 80 85 90 95
Tim
e (s
ec)
Support value (%)
4 cores8 cores
12 cores16 cores
Figure: Execution on 4-16 Cores,Retail dataset
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
Conclusion
I High speedup and efficiency for high to medium support values
I The basic determinant for performance is communicationoverhead
I FP-tree provides a compressed communication, usefulness ofparallel execution is increased
I As support threshold is lowered and number of processors isincreased, efficiency gets lower
I It is left to the application owner to find the optimumnumbers for execution (in terms of support values and numberof processors)
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
Conclusion
I High speedup and efficiency for high to medium support values
I The basic determinant for performance is communicationoverhead
I FP-tree provides a compressed communication, usefulness ofparallel execution is increased
I As support threshold is lowered and number of processors isincreased, efficiency gets lower
I It is left to the application owner to find the optimumnumbers for execution (in terms of support values and numberof processors)
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
Conclusion
I High speedup and efficiency for high to medium support values
I The basic determinant for performance is communicationoverhead
I FP-tree provides a compressed communication, usefulness ofparallel execution is increased
I As support threshold is lowered and number of processors isincreased, efficiency gets lower
I It is left to the application owner to find the optimumnumbers for execution (in terms of support values and numberof processors)
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
Conclusion
I High speedup and efficiency for high to medium support values
I The basic determinant for performance is communicationoverhead
I FP-tree provides a compressed communication, usefulness ofparallel execution is increased
I As support threshold is lowered and number of processors isincreased, efficiency gets lower
I It is left to the application owner to find the optimumnumbers for execution (in terms of support values and numberof processors)
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
The CLOSET+ AlgorithmParallelizationTest ResultsConclusion
Conclusion
I High speedup and efficiency for high to medium support values
I The basic determinant for performance is communicationoverhead
I FP-tree provides a compressed communication, usefulness ofparallel execution is increased
I As support threshold is lowered and number of processors isincreased, efficiency gets lower
I It is left to the application owner to find the optimumnumbers for execution (in terms of support values and numberof processors)
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
DemoReferencesQ&A
Demo
A real life demo on Nar
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
DemoReferencesQ&A
References
Jianyong Wang, Jiawei Han, and Jian Pei.CLOSET+: searching for the best strategies for miningfrequent closed itemsets.In KDD ’03: Proceedings of the ninth ACM SIGKDDinternational conference on Knowledge discovery and datamining, pages 236–245, New York, NY, USA, 2003.
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets
OutlineMotivations
State of the ArtParallel CLOSET+
Ending Remarks
DemoReferencesQ&A
Thanks for listening. Any questions?
Tayfun Sen Parallel CLOSET+ for Finding Frequent Closed Itemsets