association rules mining in distributed environments by: shamila mafazi supervised by: dr. abrar...
TRANSCRIPT
Association Rules Mining in Distributed Environments
By: Shamila Mafazi
Supervised by: Dr. Abrar Haider
Acknowledgment
First I would like to thank my supervisor, Dr. Abrar Haider for his valuable support in this thesis.
I specially like to extend my thank to Dr Jiuyong Li who introduced me to the Data Mining world.
Introduction
Data Mining refers to extraction or mining knowledge from large amount of data (Han & Kamber 2006).
Data Mining Techniques include: Association rules mining, clustering, classification, prediction and so on.
Association Rules Mining retrieves relation and correlation
between the data in a database.
Research Question
How can a more efficient method for discovering association rules in a distributed environment be created?
Methodology
To address the research question, this research has reviewed literature relating to the following areas:
Data mining in the centralised environments. Data mining in the distributed environments. Association rules mining in the distributed/centralised
environments. Comparing the existing algorithms in the distributed/centralised
environments and studying the advantages and disadvantages of them.
Developing a concise representation particularly, distributed deduction rules.
Designing the new algorithm based on DTFIM .
Thesis Contribution
The proposed algorithm by this research intends to resolve marketing problem in distributed environments. Additionally, it aims to profile needs of customers and their preferences in the transaction oriented systems such as, credit cards. One of the most significant problems of market basket data is dealing with large number of candidate itemsets. Retrieving interesting and meaningful patterns from these candidates is extremely difficult. Thus, this research, presents an algorithm which reduces the number of candidate sets and simplifies the process to produce interesting customer preferences and patterns.
The Importance of Association Rules
Better management of inventory.
Better arrangement of the shelves.
Enhancement in sale of items.
These rules can reveal the effect of continuing or discontinuing the sale of an item on the sale of other items.
Two Common Measures of Rules
Interestingness
The discovered rules are interesting if they satisfy the minimum support and confidence thresholds.
Support of an itemset is the percentage of the transactions which contain that itemset.
Confidence is the probability of occuring (such as buying) items together.
Association Rules Mining (ARM)
ARM in centralised environments refers to mining association rules from an integrated database
Apriori AprioriTid AprioriHybrid FIM
ARM in distributed environments refers to mining association rules from distributed databases.
CD FDM
Sampling Partitioning Algorithm DIC algorithm FP Growth algorithm NDI algorithm
ODAM DTFIM
Apriori Algorithm (Agrawal & Srikant 1994)
(Kantardzic 2003, p.168)
Minimum support thershold=2
1-itemsets
2-itemsets
3-itemsets
Centralised ARM Algorithm
Frequent Itemsets Mining (FIM) algorithm (Bodon 2004):
Trie data structure (Ansari et al. 2008)
Uses Trie data structure
Fast construction and information retrieval
Memory efficient
Supp(a)=9Supp(a,b)=6
Centralised ARM Algorithm Non derivable itemsets (NDI) (Calder & Geothals 2002)
Deduction rules divide the itemsets of a data base into derivable and non derivable itemsets.
The deduction rules set tight bounds on the support of an itemset.
The least of the upper bounds and the greatest of the lower bounds are the tight bounds for an itemset.
Derivable itemsets have equal upper and lower bounds.
Derivable itemsets do not represent any new information, not being introduced by their subsets.
The non derivable itemsets are considered as a concise representation of the frequent itemsets.
Deduction Rules:
An example of a DB transactions (Calders & Geothals 2007)
__supp (abc) = supp(a) –supp(ab) – supp(ac) + supp(abc)0 ≥ supp(a) –supp(ab) – supp(ac) + supp(abc)supp(abc) ≥ supp(ab) + supp(ac) – supp(a)
Summary of Deduction Rules
If |I\X| is odd then (odd-itemsets) |I\J| +1
Supp (I) ≤ ∑ (-1) supp(J) X⊆ J I⊂
If |I\X| is even then (even-itemsets) |I\J| +1
Supp (I) ≥ ∑ (-1) supp(J) X ⊆ J I⊂
The Importance of Distributed Data Mining
Distributed databases. Transferring data within sites is extremely costly and time
consuming. Due to security issues and data ownership, transferring the
local data is not permitted
Distributed ARM Algorithms
Distributed Trie-based Frequent Itemset Mining (DTFIM) (Ansari et al. 2008) Trie based data structure. Memory efficient Fast searching The more skewed data bases, the algorithm acts more
efficient.
My Contribution
Creating a concise representation by applying distributed deduction rules.
Applying the distributed deduction rules inside the DTFIM algorithm
Distributed Deduction Rules
|I\J| +1
Supp (I) ≤ ∑ ((-1) ∑ Suppi (J)) X⊆ J I 1 ⊂ ≤i ≤ n
|I\J| +1
Supp (I) ≥ ∑ ((-1) ∑ Suppi (J)) X⊆ J I 1 ⊂ ≤i ≤ n
If |I/X| is odd then
If |I/X| is even then
The Proposed Algorithm (cont.)
Inputs:
DBi (i=1,…,n): the databases at each site Si.
iterationDepth: number of iterations
minSup: the support threshold
Output: The set of all globally large itemsets L.
Method: Execution of the following program fragment (for the k-th iteration) at the participating sites.
The Proposed Algorithm (cont.)
k:=1;
while k ≤ iterationDepth do
{ if k=1 then
TR i(1) := findLocalCandidate (DB i,0,1);
else
{ candidateGen (TR i(k-1), NDL(k-1), CG(k), DL(k), DL(k-1));
if DL(k-1) ≠ 0 then
dFrequent (DL(k-1), NDL(k-1), DL(k));
TR i(k) := findLocalCandidate (DBi, CG(k), k); }
if CG(k) ≠ 0 then // if the CG(k) is not empty
TRi(k-1) :=findNDFrequent (DBi, CG(k), k);
passLocalCandidate (TRi(k));
GLi(k) := getGlobalFrequent (); // globally large k-itemsets
updateLocalCandidates (TRi(k), GLi(k)); // prunes the local candidates which are not globally large
NDL(k) := ∪ⁿ i=1 GLi(k) ;
k:=k+1; }
L(k) := NDL(k) ∪ DL (k) ;
return L(k);
Candidate Set Generating Procedureprocedure candidateGen (TR i(k-1), NDL(k-1), CG(k), DL(k), DL(k-1))
for all Z ∈ TR i(k-1) do
{ compute the [l,u] bounds of Z
if Z.sup=Z.l or Z.sup=Z.u then
{ Prune Z from NDL(k-1) and TRi(k-1) and insert it into DL(k-1) ;
if Z.sup=Z.l then
Z.sup=Z.l;
else
Z.sup=Z.u; }
pCG(k) =∪ⁿ i=1 CG i(k) =∪ⁿ i=1 aprioriGen(NDLi(k-1)); //FDM candidate itemsets generator
for all Y ∈ pCG(k) do
{ compute [l,u] bounds on support of Y
if l≠u then
{ Y.l=l; Y.u=u; Insert Y into CG(k) ; }
else {
If u≥ minSup then
{ Insert Y into DL(k), delete it from NDLi(k-1) and TRi(k-1) ; Y.sup=u }
} } }
end procedure
Derivable Frequent Itemsets Procedure
Procedure dFrequent (DL(k-1), NDL(k-1), DL(k))
DCG(k) := aprioriGen2(DL(k-1), NDL(k-1)); // FDM apriori candidate generator.
for all Z ∈ DCG(k) do
{
compute Z.sup; //compute the s support of Z
if Z.sup ≥ minSup then
Insert Z into DL(k), delete it from NDLi(k-1) and TRi(k-1) ;
}
end procedure
Explanation of the Proposed Algorithm
Developing the local 1-itemsets vectors: The local DBs are scanned by their local sites independently. the 1-itemsets are determined and their local support counts are
stored in the local vectors.
Global 1-itemsets: The support counts are exchanged within the sites The globally large 1-itemsets are determined.
Initialising the local Tries: Each local site initialise their local Trie based on Global 1-itemsets
Explanation of the Proposed Algorithm
Production of the global frequent k-itemsets (k≥2): The local candidate 2-itemsets are generated based on the local Tries the local candidate 2-itemsets are stored in a two dimensional array.
Applying the deduction rules: The deduction rules are applied on the local candidate 2-itemsets by each site. The derivable itemsets are deleted from the list of the non-derivable local
candidate 2-itemsets.
Global large 2-itmsests: The support counts of non-derivable locally frequent 2-itemsets are exchanged
within the sites. The global support count for 2-itemsets are determined.
Updating the local Tries: Local sites update their Trie by inserting the global large 2-itmsets.
Conclusion
This research resolves the problem of market basket data in distributed environments and presents an algorithm which reduces the number of candidate sets and simplifies the process to produce interesting customer preferences and patterns.
References Ansari, E, Dastghaibifard, GH, Keshtkaran, M, Kaabi, H
2008,‘Distributed Frequent Itemset Mining using Trie Data Structure’, IAENG International Journal of Computer Science, vol. 35, no. 3, pp. 377-381.
Calders, T& Goethals, B 2007, ‘Non Derivable Itemsets mining’, data mining Knowledge Discovery in Databases, vol. 14, pp. 171-206.
Han, J, Kamber, M 2006, Data mining: concepts and techniques, Diane Cerra, United States of America.
Kantardzic, M 2003, Data Mining Concepts, Models, Methods, and Algorithms, A John Wiley & Sons, INC, United State of America.
Thank you