comparative study of distributed frequent pattern mining algorithms for big sales data

http://www.iaeme.com/IJARET/index.asp 78 [email protected]

International Journal of Advanced Research in Engineering and Technology (IJARET) Volume 8, Issue 1, January- February 2017, pp. 78–85, Article ID: IJARET_08_01_008

Available online at http://www.iaeme.com/IJARET/issues.asp?JType=IJARET&VType=8&IType=1

ISSN Print: 0976-6480 and ISSN Online: 0976-6499

© IAEME Publication

COMPARATIVE STUDY OF DISTRIBUTED

FREQUENT PATTERN MINING ALGORITHMS FOR

BIG SALES DATA

Dinesh J. Prajapati

Research Scholar, Department of Computer Science & Engineering,

Institute of Technology, Nirma University, Ahmedabad, India

ABSTRACT

Association rule mining plays an important role in decision support system. Nowadays in the

era of internet, various online marketing sites and social networking sites are generating enormous

amount of structural/semi structural data in the form of sales data, tweets, emails, web pages and

so on. This online generated data is too large that it becomes very complex to process and analyze

it using traditional systems which consumes more time. This paper overcomes the main memory

bottleneck in single computing system. There are two major goals of this paper. In this paper, big

sales dataset of AMUL dairy is preprocessed using Hadoop Map Reduce that convert it into the

transactional dataset. Then, after removing the null transactions; distributed frequent pattern

mining algorithm MR-DARM (Map Reduce based Distributed Association Rule Mining) is used to

find most frequent item set. Finally, strong association rules are generated from frequent item sets.

The paper also compares the time efficiency of MR-DARM algorithm with existing Count

Distributed Algorithm (CDA) and Fast Distributed Mining (FDM) distributed frequent pattern

mining algorithms. The compared algorithms are presented together with experimental results that

lead to the final conclusions.

Key words: Association rule, distributed frequent pattern mining, hadoop, map reduces.

Cite this Article: Dinesh J. Prajapati, Comparative Study of Distributed Frequent Pattern Mining

Algorithms for Big Sales Data. International Journal of Advanced Research in Engineering and

Technology, 8(1), 2017, pp 78–85.

http://www.iaeme.com/IJARET/issues.asp?JType=IJARET&VType=8&IType=1

1. INTRODUCTION

The process of data mining is to extract the useful information and patterns for the knowledge discovery

process. One of the techniques used in data mining is called association rule mining. Association rule

mining is the data mining task of uncovering relationships in the data. It is a popular model in the retail

sales industry where a company is interested in identifying items that are frequently purchased together.

An association rule is expressed in the form X � Y, where X and Y are the itemsets. This rule exposes the

relationship between the itemset X with the itemset Y. The interestingness of the rule X � Y is measured

by the support and confidence [1, 2]. The rule X � Y has minimum support value min_sup if min_sup

percent of transactions support XUY, the rule X � Y holds with minimum confidence value min_conf if

min_conf percent of transactions which support X also support Y [3, 4]. Association rule mining process

basically consists of two steps: (i) Finding all the frequent itemsets that satisfies minimum support

Dinesh J. Prajapati


thresholds and, (ii) Generating strong association rules from derived frequent itemsets. Big data is termed

for a collection of large data sets which are complex and difficult to process using traditional data

processing tools [5].

In brief, the contribution of this paper is summarized in three steps: i) First of all, the distributed

frequent itemset mining algorithms CDA, FDM and MR-DARM are used to generate the complete set of

frequent itemsets and results are compared, (iv) Proposed framework mines not only frequent itemsets, but

also mines distributor’s sales association rules in transactional datasets to analyze total sales based on the

distributor. (v) Finally, based on user defined thresholds, the complete set of distributor’s sales strong

association rules are generated with the interesting patterns. The CDA, FDM and MR-DARM distributed

frequent mining algorithms are tested on sales dataset of AMUL Dairy.

The remaining of the paper is organized as follows. Related work is given in section 2. Section 3 shows

the proposed methodology. In Section 4, the performance of CDA, FDM and MR-DARM algorithms are

evaluated on sales dataset of AMUL dairy. Finally, the conclusion and future scope is drawn in section 5.

2. RELATED WORK

Authors in [6] proposed performance analysis factors like heterogeneous and autonomous. The authors

also proposed a complex theorem which characterizes the features of both the big data revolution and big

data processing model. Authors analyze the challenging issues in the data mining model and also in the big

data analysis. Authors in [7] proposed imminent about big data mining infrastructures and analysis of

Twitter. In this paper two major topics are discussed. First, schemas are insufficient to provide the

knowledge of understanding the petabytes or terabytes of data. Second, a major challenge for analyzing the

data is the heterogeneity of the various components. The objective of this paper is to share experiences of

authors to analyze the data from Twitter in the area of production environment. Authors in [8] proposed an

optimized distributed association rule mining approach to reduce the communication cost for

geographically distributed data. The communication as well as computation time is considered to achieve

an improved response time. The performance analysis is done based on scalability of processors in

distributed environment. Authors in [9] proposed distributed trie based algorithm (DTFIM) to find frequent

item sets. In this paper, authors proposed Bodon’s algorithm based on no shared memory in distributed

computing environment. The proposed algorithm is revised with some frequent data mining algorithm.

Authors in [10] proposed a distributed system for mining the transactional datasets using an improved Map

Reduce framework. In this paper, authors implemented “Associated-Correlated-Independent” algorithm to

find the complete set of customer’s purchase patterns along with the correlated, associated, associated-

correlated, and independent purchase patterns.

The PARMA algorithm proposed in [11] provides great improvements to the runtime of finding

association rules. PARMA achieves this by utilizing probabilistic results, it only approximates the answers.

Another statistical approach was presented in [12]. This solution uses clustering to create groups of

transactions and chooses candidate sets from the representative item sets in the clusters. Authors in [13]

present improved version of the frequent item set mining algorithm as well as its generalized version. The

authors introduced optimized formulas for generating valid candidates by reducing number of invalid

candidates. By using the computations of previous steps by other processed nodes, it avoids generating

redundant candidates. Authors also suggested to run the same algorithm in parallel or distributed system.

The Count Distribution Algorithm (CDA) [14] provides fundamental distributed association rule

algorithm. In this paper, each node contains huge number of frequent item sets and counts candidate item

set locally. These count values are stored in the local database and maintains incoming count values. All

the computing nodes execute the Apriori algorithm locally and after reading count values from the local

database they broadcast respective count values to the remaining nodes. Each of the nodes can generate

new candidate itemset based on the global counter. The FDM (Fast Distributed Mining) algorithm [15]

provides candidate set generation algorithm similar to Apriori. The interesting property of local as well as

global frequent itemset is used to generate a reduced set of candidates for the each iteration. Thus the

Comparative Study of Distributed Frequent Pattern Mining Algorithms for Big Sales Data


number of messages interchanged between each node reduces. Once the candidate sets are generated, then

local reduction and global reduction techniques are applied to eliminate few candidate sets from each site.

In big data analysis, mining long patterns is more important for the transactional database having

unique item set. However, none of the above mentioned work deals with the problem of data

transformation and elimination of null transactions using Map Reduce. Therefore, data transformation and

finding null transactions and then eliminating it for the future consideration; is the initial part of this

proposed methodology. After removing null transactions, distributed frequent mining algorithm is applied

to generate useful patterns. Existing CDA and FDM algorithm generates large candidate set, uses more

number of message passing system and execution time is also higher while mining big data. The MR-

DARM algorithm improves the drawback of CDA and FDM algorithms and generates useful patterns. The

objective of this work is to remove the drawbacks of relational database and facilitate the existing Map

Reduce framework; to generate the complete set of frequent itemsets with smaller candidate set

generations, less message passing and improvement in the execution time of the system.

3. PROPOSED METHODOLOGY

The CDA and FDM algorithms are data parallelism algorithm [15]. In CDA algorithm, the dataset is

divided into n number of partitions, each partition is given to separate node. Each node counts the

candidates and then broadcasts its counts to remaining nodes. Each node then determines the global counts.

The global counts are used to determine the large item sets and to generate the candidates for the next

iteration. In FDM algorithm, candidate set is generated similar to Apriori algorithm. To reduce the size of

candidates at each iteration, local and global frequent item sets are used which result reduction in the

number of messages interchanged between nodes. Once the candidate sets are generated, local reduction

and global reduction techniques are applied on each site to eliminate redundant candidate sets. The main

drawback of CDA and FDM algorithm is that both generates large candidate set, uses more number of

message passing system and execution time is higher while mining big data. These drawbacks can be

improved by Map Reduce so the new approach is developed.

The MR-DARM algorithm is used to find frequent item sets from the actual transactional dataset. Once

the actual transactional dataset is stored in HDFS, the entire dataset is split into the smaller segments and

then each segment is transformed to data nodes. The map function is executed on each data segments and it

produces <key, value> pairs for each record of database. The Map Reduce framework groups all <key,

value> pairs, which have the same items and call the reducer function by passing value list for generating

candidate item sets. In each database scan, map function generates local candidate item sets, then the

reduce function generates global counts by adding local count values. For the overall computation,

multiple iterations of Map Reduce functions are necessary. Each of the Map Reduce iteration produces a

frequent item set. The iteration continues until no further frequent item sets are found. The reduce function

adds up all the values produce by Mapper and generates a count for the candidate item. The main

advantage of this approach is that it doesn’t exchange data between each node, but it only exchanges the

count values. The MR-DARM algorithm uses notation Ck as a set of candidate k-item set and Lk as a set of

frequent k-itemset which is shown in Fig. 1.

http://www.iaeme.com/IJARET/index.

The transactional data is given as an input to the Mapper line by line. Each line is split into items and

the output <key, value> pair consists of the item and the value 1. This is the local frequency of the item.

The reduce task starts with the itemsets of length 1 and generates candidates with length 2. During step k

of the algorithm it will start with length n itemsets and genera

reduce task cannot generate bigger candidate itemsets it will stop the whole computation. Frequent

itemsets are calculated based on different values of minimum support threshold. Support decision system

will check for the appropriate support count value for generating strong association rules.

3.1. Association Rule Generation

The output of distributed frequent mining algorithm is frequent itemsets which will be given as input to the

association rule generator module to generate strong association rules which satisfies minimum confidence

threshold. Association rules can be generated as follows [

• For each frequent itemset,

Input: Transactional Database in HDFS (

Minimum Support Threshold (

Output: Frequent Itemsets (

Method:

L1 = find frequent 1

For each frequent

Ck = Lk-1

Ct = Map(). // Generates itemset occurrence

Lk = Reduce (). // Gets the subset of frequent itemsets

L = L Uk Lk.

Map Function:

Input: Set of Transaction (

Output: < Candidate Itemset

Method:

For each transaction

For each itemset

If ( Ii ∈ Ti ) then

Generate the output <

as < Key,

Reduce Function:

Input: < candidate itemset, list

Output: < frequent itemset, support_count

Method:

count = 0.

For each number in

count + = number

If ( count > =

Generate the output <

as < key, value

Dinesh J. Prajapati

IJARET/index.asp 81

Figure 1 The MR-DARM Algorithm


pair consists of the item and the value 1. This is the local frequency of the item.


of the algorithm it will start with length n itemsets and generate length k + 1 candidate itemsets. If the



for the appropriate support count value for generating strong association rules.

Association Rule Generation


le to generate strong association rules which satisfies minimum confidence

threshold. Association rules can be generated as follows [16].

For each frequent itemset, l, generate all non-empty subsets of l.

Transactional Database in HDFS (D),

Minimum Support Threshold (min_sup)

Output: Frequent Itemsets (L)

= find frequent 1-itemsets from D.

For each frequent k-itemset do

Lk-1. // Generates candidate itemset

= Map(). // Generates itemset occurrence

= Reduce (). // Gets the subset of frequent itemsets

Input: Set of Transaction ( Ti )

Candidate Itemset, Value>

For each transaction Ti ∈ D do

For each itemset Ii in Candidate Itemset Ck do

) then

Generate the output < Ii, 1>

, Value> pair.

andidate itemset, list >

frequent itemset, support_count >

For each number in list do

number.

> = Min_sup ) then

Generate the output < frequent itemset, count >

key, value > pair.

[email protected]


pair consists of the item and the value 1. This is the local frequency of the item.


te length k + 1 candidate itemsets. If the



for the appropriate support count value for generating strong association rules.


le to generate strong association rules which satisfies minimum confidence

Generates candidate itemset



• For every non-empty subset s of l, output the rule

where min_conf is the minimum confidence threshold.

Since, the rules are generated from frequent itemsets; each rule automatically satisfies minimum

support.

4. EXPERIMENTAL SETUP

For the experimental purpose cluster of four desktop machines consists of i5 processor with 4 GB DDR

RAM are used. Ubuntu 12.04 LTS operating system is installed in all the four computers. Usually JVM is

not a part of Ubuntu 12.04, so, JVM is also instal

configured in three computers and single

Hadoop packages.

For this experiment, the sales database of AMUL dairy with more than 1500 differen

having total size of 5GB is used. In dairy dataset, sales of the dairy product are done based on concept

hierarchy. First of all product is send to the distributor which in turn distribute the product to the retailer

and finally the retailer will sell the dairy product to the customer.

4.1. Comparative Study of CDA, FDM and

After transforming transactional dataset into actual transactional dataset, actual transaction file is given as

input to the frequent pattern mining al

MR-DARM algorithms on AMUL datasets for the varying database size 256MB, 512MB, 1GB, 2GB and

5GB is applied using single node, two node and three node clusters with minimum support threshol

which are shown in Fig. 2, 3 and

depends on the number of nodes and the size of dataset. For a data set of size 5GB that was distributed on

single node, the execution time for t

seconds and 373 seconds respectively, while the same data set distributed on three node cluster produced

an execution time of 3490 seconds, 2280 seconds and 269 seconds respectively. So, in order to

comparatively small execution times, the number of nodes must be increase with increase in the database

size. It is noticeable that the performance of the algorithms increases with increase in number of nodes, and

the proposed algorithm gives much

dataset is large.

Figure 2 Dataset Size Vs Execution Time for Single Node Cluster with 1% Minimum Support


IJARET/index.asp 82

empty subset s of l, output the rule “s � (l-s)” if (Support (

where min_conf is the minimum confidence threshold.


. EXPERIMENTAL SETUP & RESULTS

For the experimental purpose cluster of four desktop machines consists of i5 processor with 4 GB DDR


not a part of Ubuntu 12.04, so, JVM is also installed it in all the four computers. Multi

configured in three computers and single-node cluster is configured in single computer using Apache

For this experiment, the sales database of AMUL dairy with more than 1500 differen



r will sell the dairy product to the customer.

Comparative Study of CDA, FDM and MR-DARM Algorithm


input to the frequent pattern mining algorithm to find the frequent itemsets. The results of CDA, FDM and

algorithms on AMUL datasets for the varying database size 256MB, 512MB, 1GB, 2GB and

5GB is applied using single node, two node and three node clusters with minimum support threshol

and 4 respectively. The result shows that the performance of the algorithm


single node, the execution time for the CDA, FDM and MR-DARM algorithms are 5670 seconds, 3680


an execution time of 3490 seconds, 2280 seconds and 269 seconds respectively. So, in order to



the proposed algorithm gives much better performance than CDA as well as FDM when the size of the

Dataset Size Vs Execution Time for Single Node Cluster with 1% Minimum Support


[email protected]

” if (Support (l) / Support (s)) >= min_conf,


For the experimental purpose cluster of four desktop machines consists of i5 processor with 4 GB DDR-3


led it in all the four computers. Multi-node cluster is

node cluster is configured in single computer using Apache

For this experiment, the sales database of AMUL dairy with more than 1500 different dairy product




The results of CDA, FDM and

algorithms on AMUL datasets for the varying database size 256MB, 512MB, 1GB, 2GB and

5GB is applied using single node, two node and three node clusters with minimum support threshold of 1%

respectively. The result shows that the performance of the algorithm


algorithms are 5670 seconds, 3680


an execution time of 3490 seconds, 2280 seconds and 269 seconds respectively. So, in order to obtain



better performance than CDA as well as FDM when the size of the

Dataset Size Vs Execution Time for Single Node Cluster with 1% Minimum Support


Figure 3 Dataset Size Vs Execution Time for Two Node Cluster with 1% Minimum Sup

Figure 4 Dataset Size Vs Execution Time for Three Node Cluster with 1% Minimum Support

5. CONCLUSION AND FUTURE SCOPE

HDFS and MapReduce play really an important role

However, most of the algorithms have limitation of processing speed. In this paper, hadoop based

distributed approach is presented which process the transactional dataset into partitions and transfers the

task to all participating nodes. The purpose is to reduce inter node message passing in the cluster. In

preprocessing using Hadoop MapReduce, it has been observed that as the number of reducer increases, the

execution time is significantly decreases. The experimen

scales linearly with the number of nodes and the size of the dataset. In this paper, The

algorithm is implemented to find distributed frequent itemsets. As the number of node is increased, the

performance is really improved by considering lower minimum support factor and large database size. The

proposed algorithm generates a smaller candidate set and uses a less message passing than CDA and FDM

algorithm, thus the execution time of the proposed alg

algorithm is more flexible, scalable and efficient distributed frequent pattern mining algorithm for mining

large data.

Dinesh J. Prajapati

IJARET/index.asp 83

Dataset Size Vs Execution Time for Two Node Cluster with 1% Minimum Sup

Dataset Size Vs Execution Time for Three Node Cluster with 1% Minimum Support

AND FUTURE SCOPE

HDFS and MapReduce play really an important role for handling and analyzing of large datasets.



all participating nodes. The purpose is to reduce inter node message passing in the cluster. In


execution time is significantly decreases. The experimental results show that the parallel processing task

scales linearly with the number of nodes and the size of the dataset. In this paper, The


ormance is really improved by considering lower minimum support factor and large database size. The


algorithm, thus the execution time of the proposed algorithm is less as compare to others. The proposed


[email protected]

Dataset Size Vs Execution Time for Two Node Cluster with 1% Minimum Support

Dataset Size Vs Execution Time for Three Node Cluster with 1% Minimum Support

handling and analyzing of large datasets.



all participating nodes. The purpose is to reduce inter node message passing in the cluster. In


tal results show that the parallel processing task

scales linearly with the number of nodes and the size of the dataset. In this paper, The MR-DARM


ormance is really improved by considering lower minimum support factor and large database size. The


orithm is less as compare to others. The proposed




The time efficiency of the algorithm may be improved by using FP-tree based data structures for the

candidate itemset generation.

6. ACKNOWLEDGEMENTS

The authors take this opportunity to thank all the researchers from the domain of big data analysis for their

immense knowledge and kind support throughout the work. Also would like to thank our institute for their

resources and constant inspiration. Special thanks to the authority of AMUL dairy located at Anand district

for providing sales dataset. At last heartiest thanks to our family and friends for encouraging us to make

this a success.

REFERENCES

[1] Srikumar, K. and Bhasker, B. 2005. Metamorphosis: Mining Maximal Frequent Sets in Dense Domains.

Int. Journal of Artificial Intelligence Tools, Vol. 14, Issue 3, 491-506.

[2] Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large

databases. Proc. Int. Conf. of ACM-SIGMOD on Management of Data, 207-216.

[3] Olsan, D. L. and Delen, D. 2008. Advanced Data Mining Techniques. Springer.

[4] Han, J. and Kamber, M. 2004. Data Mining Concepts & Techniques. San Francisco: Morgan Kaufmann

Publishers.

[5] Agrawal, D., Das, S. and Abbadi, A. 2011. Big data and cloud computing: current state and future

opportunities. Proc. 14th Int. Conf. Extending Database Technology, ACM, 530-533.

[6] Wu, X., Zhu, X., Wu, G. and Ding W. 2013. Data Mining with Big Data. IEEE Transactions on

Knowledge and Data Engineering, Vol. 26, Issue 1, 97-107.

[7] Lin, J., & Ryaboy, D. 2013. Scaling big data mining infrastructure: the twitter experience. ACM

SIGKDD Explorations Newsletter, 14, 6-19.

[8] Mottalib, M. A., Arefin, K. S., Islam, M. M., Rahman, M. A. and Abeer, S. A. 2011. Performance

Analysis of Distributed Association Rule Mining with Apriori Algorithm. Int. Journal of Computer

Theory and Engineering, Vol. 3, No. 4, 484-488.

[9] Ansari, E., Dastghaibifard, G. H., Keshtkaran, M. and Kaabi, H. 2008. Distributed Frequent Itemset

Mining using Trie Data Structure. Int. Journal of Computer Science (IJCS).

[10] Karim, M. R., Ahmed, C. F., Jeong, B. and Choi, H. 2013. An efficient Distributed Programming Model

for Mining Useful Patterns in Big Datasets. IETE Technical Review, Vol. 30, Issue 1, 53-63.

[11] Riondato, M., DeBrabant, J. A., Fonseca, R. and Upfal, E. 2012. Parma: A parallel randomized

algorithm for approximate association rules mining in MapReduce. Proc. 21th Int. Conf. Information

and Knowledge Management (CIKM ’12), ACM, USA, 85–94.

[12] Malek, M. and Kadima, H. 2013. Searching frequent itemsets by clustering data: Towards a parallel

approach using MapReduce. Web Information Systems Engineering WISE 2011 and 2012, Springer,

Berlin Heidelberg, 7652, 251–258.

[13] Butincu, C. N. and Craus, M. 2015. An improved version of the frequent itemset mining algorithm.

Proc. 14th IEEE Int. Conf. Networking in Education and Research, Craiova, 184-189.

[14] Agrawal, R. and Shafer, J. C. 1996. Parallel mining of association rules. IEEE Trans. on Knowledge and

Data Engineering, 8, 962-969.

[15] Cheung, D. W., Han. J., Vincent. T. N. and Ada W. Fu 1996. A fast distributed algorithm for mining

association rules. Proc. 4th IEEE Int. Conf. Parallel and Distributed Information Systems, 31-42.

Dinesh J. Prajapati


[16] Ban, T., Eto, M., Guo, S., Inoue, D., Nakao, K. and Huang, R. 2015. A study on association rule mining

of darknet big data. Proc. IEEE Int. Joint Conf. on Neural Network (IJCNN), 1-7.

[17] Mudra Doshi And Bidisha Roy, Efficient Processing of Ajax Data Using Mining Algorithms.

International Journal of Computer Engineering and Technology (IJCET), 5(8), 2014, pp. 48–54

[18] Ms. Aruna J. Chamatkar and Dr. P.K. Butey, Performance Analysis of Data Mining Algorithms with

Neural Network. International Journal of Computer Engineering and Technology, 6(1), 2015, pp. 01–11