large dataset compression approach article info using ...szepkuti/cites/525-2457-1-pb.pdf · the...
TRANSCRIPT
Journal of Advanced Computer Science and Technology Research, Vol.3 No.1, March 2013, 1-20
Large Dataset Compression Approach
Using Intelligent Technique
Ahmed Tariq Sadiq1,a, Mehdi G. Duaimi2,b, Rasha
Subhi Ali2,c 1Computer Science Department, University of Technology,
Baghdad, Iraq 2Computer Science Department, Baghdad University,
Baghdad, Iraq [email protected],
[email protected],[email protected]
Article Info
Received: 7th January 2013
Accepted: 28th January 2013
Published online: 1st March 2013
ISSN: 2231-8852 © 2013 Design for Scientific Renaissance All rights reserved
ABSTRACT Data clustering is a process of putting similar data into groups. A clustering algorithms partition data
set into several groups such that the similarity within a group is larger than among groups.
Association rule is one of the possible methods for analysis of data. The association rules algorithm
generates a huge number of association rules, of which many are redundant. The main idea of this
paper is to compress large database by using clustering techniques with association rule algorithms. In
the first stage, the database is compressed by using clustering techniques followed by association rules
algorithm. Adaptive k-means clustering algorithm is proposed with apriori algorithm. Due to many
experiments by using the adaptive k-means algorithm and apriori algorithm together it gives better
compression ratio and smaller compressed file size than the compression ratio and compressed file
size that are given from using each algorithm alone. Several experiments were made in several
different sizes of database. The apriori algorithm increases the compression ratio of the adaptive k-
means algorithm when hey are used together but it takes more compression time than the adaptive k-
means takes. These algorithms are presented and their results are compared.
Keywords: Association rule, clustering techniques and compression algorithms.
1. Introduction
Compression is the art of representing the information in a compact form rather than its
original or uncompressed form (Pu, 2006). In other words, using the data compression, the
size of a particular file can be reduced. This is very useful when processing, storing or
transferring a huge file, which needs lots of resources (Kodituwakku,et al., 2007). Data
compression is widely used in data management to save storage space and network
bandwidth (Goetz, et al.,1991). In computer science and information theory, data
compression, source coding, or bit-rate reduction involve encoding information using fewer
bits than the original representation. Compression can be either lossy or lossless. Lossless
compression reduces bits by identifying and eliminating statistical redundancy (eliminate
Journal of Advanced Computer Science and Technology Research, Vol.3 No.1, March 2013, 1-20
2
unwanted redundancy). Compression is useful because it helps reduce the consuming of
resources such as data space or transmission capacity. Because compressed data must be
decompressed to be used, this extra processing imposes computational or other costs through
decompression. In lossy data compression, some loss of information is acceptable. Depending
upon the application, details can be dropped from the data to save storage space. Lossy data
compression used in image, audio, and video (Graham, 1994). . Lossless data compression is
contrasted with Lossy data compression. Lossless data compression has been suggested for
many space science exploration mission applications either to increase the scientific return or
to reduce the requirement for on-board memory, station contact time, and data archival
volume. A Lossless compression technique guarantees full reconstruction of the original data
without incurring any distortion in the process. The Lossless Data Compression technique
preserves the source data accuracy by removing redundancy from the application source data.
In the decompression process the original source data are reconstructed from the compressed
data by restoring the removed redundancy. The reconstructed data is an exact replica of the
original source data. The amount of redundancy removed from the source data is variable and
is highly dependent on the source data statistics, which are often non-stationary (Report
Concerning Space Data System Standards, 2006). In this paper two intelligent techniques
(clustering techniques and association rules) are used to compress large data sets. Clustering
is a division of data into groups of similar objects. Each group, called a cluster, consists of
objects that are similar among themselves and dissimilar to the objects of other groups
(Berkhin, 2002). . Association Rule Mining plays a major role in the process of mining data
for frequent pattern matching. Association Rules: Association rule of data mining involves
picking out the unknown interdependence of the data and finding out the rules among those
items (Neeraj, et al.,2012).. The main objective of this work is to discuss intelligent
techniques to compress large data sets.
This paper is organized as follow. Section two shows the related works, Section three
gives some terminology on association rules and apriori algorithm, Section four explains
major clustering techniques and k-means algorithm, Section five explains the methodology of
the compression algorithms and the decompression algorithms, Section six presents some
results from experiments and section seven concludes with a discussion.
2. Related Work
Sonia Dora (Jacob , et al., 2012). , focuses on lossless compression for relational
databases at the attribute level so the proposed technique is used at the attribute level by
compressing three types of attribute (string, integer and date type) and the most interesting
feature is that it automatically identifies the type of attribute. I Made Agus Dwi Suarjaya
(Suarjaya, 2012), proposes a new algorithm for data compression, called j-bit encoding
(JBE). This algorithm manipulates each bit of data in a file to minimize the size without
loosing any data after decoding which is classified to lossless compression. Heba Afify,
Muhammad Islam and Manal Abdel Wahed (Afify, et al.,2011). , present a differential
compression algorithm that is based on production of difference sequences according to an
op-code table in order to optimize the compression of homologous sequences in the dataset.
István Szépkúti (Szépkút , 2004)., introduces a new method called difference sequence
Journal of Advanced Computer Science and Technology Research, Vol.3 No.1, March 2013, 1-20
3
compression. Under some conditions, this technique is able to create a smaller size
multidimensional database than others like single count header compression, logical position
compression or base-offset compression.
3. Association Rules
Association rule mining is one of the most important and well researched techniques of
data mining. It was first introduced by Agrawal, Imielinski, and Swami (Agrawal, et
al.,1997). . The discovery of “association rules” in databases may provide useful background
knowledge to decision support systems, selective marketing, financial forecast, medical
diagnosis, and many other applications (Yijun. et al., 2000). . Mining association rules is an
important data mining problem. Association rules are usually mined repeatedly in different
parts of a database. Current algorithms for mining association rules work in two steps.
1. Discover the large itemsets, i.e. the sets of itemsets that have support above a
predetermined minimum support σ.
2. Use the large itemsets to generate the association rules for the database.
It is noted that the overall performance of mining association rules is determined by the first
step, which usually requires repeated passes over the analyzed database and determines the
overall performance. After the large itemsets are identified, the corresponding association
rules can be derived in a straightforward manner. (Saad et al., 2010).
3.1 Associations rules concept
An association rule is a simple probabilistic statement about the co-occurrence of certain
events in a database, and is particularly applicable to sparse transaction data sets (Hand et al.,
2001). An association rule is a rule, which infers certain association relationships among a set
of objects (such as those which occur together or one infers the other”). In a database
association rule mining works as follows: (Adriaans et al., 1998).
Let I be a set of items and D a database of transactions, where each transaction has a
unique identifier (tid) and contains a set of items called an item set. An itemset with k items
is called a k-itemset. The support of an itemset X denoted S (X) is the number of transactions
in which that itemset occurs as a subset. A k-subset is a k- length subset of an itemset. An
itemset is frequent or large if its support is more than a user-specified minimum support
(min_sup) value. Fk is the set of frequent k-itemsets. A frequent itemset is maximal if it is not
a subset of any other frequent itemset. An association rule is an expression A B, where A
and B are itemsets. The rule’s support (S) is the joint probability of a transaction containing
both A and B, and is given as S (A B). The confidence of the rule is the conditional
probability that a transaction contains B, given that it contains A and is given as S (A B)/S
(A). A rule is frequent if its support is greater than min_sup and strong if its confidence is
more than a user-specified minimum confidence (min_conf). Data mining involves
generating all association rules in the database that have a support greater than min_sup (the
rules are frequent) and that have a confidence greater than min_conf (the rules are strong)
(Saad et al., 2010). The important measures for association rules are support (S) and
confidence (C). They can be defined as:
Journal of Advanced Computer Science and Technology Research, Vol.3 No.1, March 2013, 1-20
4
Definition1: Support (S)
Support(X, Y) = Pr(X Y) =count of (X Y) / Total transactions……………………... (1)
The support (S) of an association rule is the ratio (in percent) of the records that contain
(X Y) to the total number of records in the database. Therefore, if we say that the support of
a rule is 5% then it means that 5% of the total records contain (X Y) (Brin , et al.,1997).
Definition2: Confidence (C)
Conf (X Y) =Pr (X Y)/Pr(X) =support(X, Y)/support(X)……………………………. (2)
For a given number of records, confidence (C) is the ratio (in percent) of the numbers of
records that contain (X Y), to the number of records that contain X. Thus, if we say that a
rule has a confidence of 15% it means that 85% of the records containing X also contain Y.
The confidence of the rule refers to the degree of correlation in the database between X and
Y. Confidence is also a measure of rules strength. Mining consists of finding all rules that
meet the user-specified threshold support and confidence (Brin , et al.,1997). As there are two
thresholds, we need two processes to mine the rules. The first step is to get the large itemsets.
It finds all the itemsets whose supports are larger than the support threshold. An itemset is the
set of the items. Based on the large itemsets, we can generate the rules from the large
itemsets, which is the second step. Rules that satisfy both a minimum support threshold
(min_sup) and a minimum confidence threshold (min_conf) are called strong (Han, et al.,
2001).. An association rule-mining problem is broken down into two steps: 1) Generate all
the item combinations (itemsets) whose support is greater than the user specified minimum
support. Such sets are called the frequent itemsets and 2) use the identified frequent itemsets
to generate the rules that satisfy a user specified confidence. The frequent itemsets generation
requires more effort and the rule generation is straightforward, (Kona, 2003).
4. Unsupervised learning
Unsupervised learning is a branch of machine learning algorithms, where the aim is to
find patterns in raw, unlabeled, data. Unlike in supervised learning, where the algorithm is
first trained on labeled data, so it can learn and adapt to the particular problem
(classification). The general, inherent, properties of the data can be used to find patterns and
organize it (clustering). (Figueroa , 2011).
4.1 Clustering
Clustering is the process of examining a collection of “points,” and grouping the points
into “clusters” according to some distance measure (Rajaraman , et al., 2011).. The goal of
clustering is to define clusters of elements in the dataset, such that the elements in the same
cluster are similar to each other, and each cluster as a whole is distant from the others. Two
important classes of clustering can be distinguished 1) Hierarchical clustering: These kinds of
techniques can either be agglomerative or divisive. In agglomerative clustering, we start by
assigning each element to a different class. The successive iterations of the algorithm cluster
together the closest classes, until all the elements belong to the same, main, class. The
divisive methods perform divisions of the classes into smaller ones instead. 2) Partitional
clustering: the concept of partitional clustering is to divide the data immediately into a certain
Journal of Advanced Computer Science and Technology Research, Vol.3 No.1, March 2013, 1-20
5
number of clusters. The most popular algorithm is k-means, which will be presented in the
next section. (Figueroa , 2011).
4.1.1. Partitioning Methods
The partitioning methods generally result in a set of M clusters; each object is belonging
to one cluster. Each cluster may be represented by a centroid or a cluster representative; this
is some a sort of a summary description of all the objects contained in a cluster. K-means
method is an example about partitioning clustering. (Rai, et al., 2010).
K-means
The k-means algorithm was proposed independently in various scientific fields over 50
years ago. MacQueen was the first, in 1967, to name k-means its one-pass version of the
algorithm, where he defined the first k elements of the dataset as the k classes, and
successively assigned the next elements to the closest class, updating the centroid after each
assignment (Figueroa , 2011).. The standard k-means algorithm is considered to be a simple
but efficient partitioning algorithm. It divides the data into k clusters, minimizing the squared
distance between each element to the center of its cluster. The distance measure is a
parameter of the algorithm. The objective function, using the Euclidian distance, is defined
as:
2
1
k
m
k ci
i gxJk
………………………………………………………………………. (3)
Where:
• m: is the total number of clusters.
• Ck: is the k-th cluster.
• xi: is the vector of the i-th element of the dataset.
• gk: is the vector of the center of the k-th cluster.
K-means then proposes the following iterative method to find a good solution.
1. Initialize the center of each cluster (also called centroids). For example, we can
arbitrarily choose some elements of the dataset to be the centroids.
2. Reassign each element to the closest centroid.
3. Recompute the center of each cluster.
4. Repeat steps 2 and 3 until the stopping criterion is satisfied.
(Figueroa , 2011).
4.1.2. Hierarchical Methods
Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of
clusters. The basics of hierarchical clustering include Lance-Williams formula, conceptual
clustering, classic algorithms SLINK, COBWEB, as well as newer algorithms CURE and
CHAMELEON. The hierarchical algorithms build clusters gradually (as crystals are grown).
Strategies for hierarchical clustering generally fall into two types: In hierarchical clustering
the data are not partitioned into a particular cluster in a single step. Instead, a series of
partitions takes place, which may run from a single cluster containing all objects to n clusters
Journal of Advanced Computer Science and Technology Research, Vol.3 No.1, March 2013, 1-20
6
each containing a single object. Hierarchical Clustering is subdivided into agglomerative
methods, which proceed by a series of fusions of the n objects into groups, and divisive
methods, which separate n objects successively into finer groupings. Agglomerative
techniques are more commonly used (Rai, et al., 2010). .
A. Agglomerative technique
This is a "bottom up" approach: each observation starts in its own cluster, and pairs of
clusters are merged as one moves up the hierarchy. The algorithm forms clusters in a bottom-
up manner, as follows (Rai, et al., 2010). :
1. Initially, put each article in its own cluster.
2. Among all current clusters, pick the two clusters with the smallest distance.
3. Replace these two clusters with a new cluster, formed by merging the two original ones.
4. Repeat the above two steps until there is only one remaining cluster in the pool.
Thus, the agglomerative clustering algorithm will result in a binary cluster tree with single
article clusters as its leaf nodes and a root node containing all the articles.
B. Divisive technique (Rai, et al., 2010).
This is a "top down" approach: all observations start in one cluster, and splits are performed
recursively as one moves down the hierarchy.
1. Put all objects in one cluster
2. Repeat until all clusters are singletons
a) Choose a cluster to split
b) Replace the chosen cluster with the sub-cluster.
C. Advantages of hierarchal clustering (Rai, et al., 2010).
1) Embedded flexibility regarding the level of granularity.
2) Ease of handling any forms of similarity or distance.
3) Applicability to any attributes type.
D. Disadvantages of hierarchal clustering (Rai, et al., 2010).
1. Ambiguity of termination criteria.
2. Most hierarchal algorithms do not revisit once constructed clusters with the purpose
of improvement.
5. Methodology
In the first part of this work the adaptive k-means and apriori algorithm are utilized. By
adaptive k-means, one can extract several clusters from each database table, and by the
apriori algorithm; relationships among the sets of items can be extracted from each cluster. In
the next part of this work the decompressed data must be recovered from the compressed one.
To do that the adaptive k-means decompression algorithm and the apriori decompression
algorithm will be used to recover the original data from the compressed data.
5.1 compression algorithms
5.1.1. Adaptive k-means algorithm
Adaptive k-means is a partitioning clustering algorithm used to extract all available
clusters in any selected database. In recent k-means the user must determine the number of
Journal of Advanced Computer Science and Technology Research, Vol.3 No.1, March 2013, 1-20
7
clusters and the center of each cluster, while in adaptive k-means the number of clusters and
the center of clusters are determined automatically without intervention of the user. In this
algorithm the user selects a database file then the algorithm automatically select two
attributes. The items that are available in these selected attributes represent the centers of
clusters. This algorithm has several stages.
Algorithm: adaptive k-means.
Input: database file
Output: two text file first one for saving extracted clusters, and the second is for saving
information for each extracted cluster (this information contains a number of items in each
cluster and the name of each cluster).
Begin
i. Let the DB = database file
ii. Automatically select two attributes from the inputted database file. Let G and D are the
selected attributes.
iii. The center for each cluster is determined automatically by selecting items from the
determined attributes without repetition of these items and considers them as centers for
clusters. Let (G1, G2…..,Gn and D1, D2,…. Dm ) to be centers of clusters.
n represents the number of items in the first selected attributes without repetition.
m represents the number of items in the second selected attributes without repetition.
iv. U= unselected attributes
v. For each closed items in G1, G2,… Gn and D1, D2,… Dm select U
vi. Print U in first text file
vii. Print information of each cluster in the second file
viii. Return compressed files.
End
For example, if the dept table is selected to apply compression algorithms on it then the
gender and degree columns will be selected to make their items as a center for each cluster.
The (male and female) are closed items belonged to the first selected attribute (gender).
While (secondary school, B.Sc., Diploma, PhD, MSc, Higher Diploma) are closed items
belonged to the second selected attribute. Then the center of each cluster is considered as
follows: The center of the First cluster is (male, secondary school ( , the center of the second
cluster is (male, MSc), while the center of the third cluster is (female, MSc) and so on as
determined by the adaptive k-means algorithm.
5.1.2. Apriori Algorithm
Apriori Algorithm can be used to generate all frequent itemset. A Frequent itemset is an
itemset that has support is greater than the user-specified minimum support (denoted Lk,
where k is the size of the itemset). A Candidate itemset is a potentially frequent itemset
(denoted Ck, where k is the size of the itemset).
Algorithm: apriori
Input: adaptive k-means resulted files, min_sup, min_conf
Output: one text file for saving extracted rules and the remaining cluster's data on it.
Journal of Advanced Computer Science and Technology Research, Vol.3 No.1, March 2013, 1-20
8
Begin:
For each itemset l1 Lk-1
For each itemset l2 Lk-1
IF (l1 [1] = l2 [1]) (l1 [2] = l2 [2]) … (l1 [k-2] = l2 [k-2]) (l1 [k-1] < l2 [k-1] ) then
c = l1 join l2; // join step: generate candidates
IF has_infrequent_subset (c, Lk-1) then
Delete c; // prune step: remove unfruitful candidate
Else add c to Ck;
Return Ck and the remaining items of clusters and write them in a text file;
End
5.2 Decompression algorithms
5.2.1. Adaptive k-means Decompression Algorithm
To recover the original data from the compressed one, we reverse the operation of the
adaptive k-means compression algorithm. This operation is called the adaptive k-means
decompression algorithm. By decompressing the original data from the input compressed
files, each input compressed files are used to read out the original data from the compressed
file's data. To do this without loss of any information, it is necessary to keep the data sets
with an index used in both adaptive k-means and apriori decompression algorithm. Hence,
the data are distributed in the clusters with its cluster's name, the available number of items in
each cluster, and the selected attribute names. This information is used as the current output
data and is the next entry to be inserted into the database file. The detailed operation of the
adaptive k-means decompression algorithm is described as follows.
Adaptive k-means decompression algorithm
Input: compressed files.
Output: original database file.
Begin:
1. Read the clusters items from the first compressed file.
2. Read the clusters information from the second compressed file.
3. Split the data that are available in the both two compressed files.
4. Save the data that are getting from the second compressed file into a one dimensional
array.
5. Save the data that are getting from the first compressed file into a buffer view.
6. Create a new database file.
7. Fill this new database file with the data that are needed from the array and the buffer
view.
8. Return the original database file.
End
5.1.2. Apriori Decompression Algorithm
To recover the original data from the compressed one, we reverse the operation of the
apriori compression algorithm. This operation is called the apriori decompression algorithm.
Journal of Advanced Computer Science and Technology Research, Vol.3 No.1, March 2013, 1-20
9
By decompressing the original data from the input compressed data files, each input
compressed data files used to read out the original data from the data set. To do this without
loss of any information, it is necessary to keep the data sets with an index used in both
adaptive k-means and apriori decompression algorithm. Hence, important correlations are
extracted by using apriori compression algorithm, and only correlated data and uncorrelated
data are saved as a cluster with its cluster names into a text file. The second file is the same as
the second file used by adaptive k-means algorithm. This information is used as the current
output data and is the next entry to be inserted into the database file. The detailed operation of
the apriori decompression algorithm is described as follows:
Apriori decompression algorithm
Input: compressed files.
Output: original database file.
Begin:
1. Read the clusters items from the apriori compressed file.
2. Read the clusters information from the second adaptive k-means compressed file.
3. Split the data that are available in the both two compressed files.
4. Save the data that are getting from the second compressed file into a one dimensional
array.
5. Save the data that are getting from the apriori compressed file into a buffer view and
save the clusters that are given from apriori compressed file into another buffer view.
6. Return correlated data to the clusters that belonged to it by using its indexes.
7. Create a new database file.
8. Fill this new database file with the data that are needed from the array and the buffer
view.
9. Return the original database file.
End
6. Experimental Results
For an experimental evaluation of the proposed algorithms, several experiments are
performed on real databases. The proposed algorithms are implemented in VB.net
environment. Many databases are used to test the performance of these algorithms. In
particular, several clusters are derived from each database file and a large number of rules can
be derived from those clusters depending on the support and confidence thresholds. Some of
the results are given in tables 1, 2, 3, 4, 5,6,7,8 and 9. Table 1 shows the taken time for
compressing several databases by using both adaptive k-means and apriori algorithm.
Table 1: adaptive k-means vs apriori compression time along with the database files size.
Original size kilobytes AK comp time in sec AP time in sec No of clusters
128 1 2 6
168 1 2 11
176 2 2 6
356 2 10 14
372 11 147 697
640 3 31 6
Journal of Advanced Computer Science and Technology Research, Vol.3 No.1, March 2013, 1-20
10
704 4 63 43
852 7 18 107
872 3 82 4
1188 6 157 10
1444 11 81 150
2693 8 31 82
In the above table the AK comp time in sec (adaptive k-means compression time in
second) represents the taken time to compress the database by using adaptive k-means
compression algorithm. The AP comp time in sec (apriori compressed time in second)
represents the taken time to compress the database by applying apriori compression algorithm
on the adaptive k-means algorithm results. The original size in kilobytes represents the
original database size. From the above results we see that the apriori algorithm takes more
time than adaptive k-means algorithm because of the apriori algorithm has two calculations.
Firstly calculates the support or number of frequencies of each item then matching this
support with min_sub if the support≥ min_sub then moving this item to the frequent item list.
Next calculate the confidence of each two or more frequent items that appear together, if the
confidence ≥ min_conf extract this rule. These calculations will cause a delay in the work of
the algorithm in advanced. Also, increasing the total taken time happen whenever increased
the testing file size. While the K-means algorithm adjustment there is no need for these
operations. While adaptive K-means algorithm needs only extract the center of groups and
matching the remaining items from the database if sharing this Center or not. Finally, the
distribution of these items according to the sharing center.
Fig.1. Illustrations of the adaptive k-means vs apriori compression time along with the
database files size.
Table 2 shows the compressed database size by applying adaptive k-means together with the
apriori algorithm and applying adaptive k-means alone on the original database file and also
shows the original database file size
0
20
40
60
80
100
120
140
160
180
128
168
176
356
372
640
704
852
87211
8814
4426
93
AK comp time in sec
AP time in sec
Journal of Advanced Computer Science and Technology Research, Vol.3 No.1, March 2013, 1-20
11
Table 2: Original database file size vs compressed file size.
Table name Original size kilobytes AK comp size in kilobytes apriori size in kilobyte
Dept 128 12 12
niaid100 168 16 16
dept200 176 24 20
salary 653 356 52 40
medicin 831 372 140 120
dept3000 640 236 148
niaid2248 704 280 220
2012ss 2733 852 232 204
DWC_admin 3697 872 448 324
dept10000 1188 636 364
niaid 2248&2012ss
2733 1444 512 424
bacteria 4894 2693 1069 1059
In the above table the AK comp size in KB (adaptive k-means compressed file size)
represents the compressed database file size by using adaptive k-means compression
algorithm measured in kilobytes, and the apriori comp size in KB (apriori compressed file
size) represents the compressed database file size by applying apriori compression algorithm
on the data that result from the adaptive k-means compression algorithm also measured by
using kilobytes. Fig. 2 shows the compressed database size for each tested database using
both adaptive k-means and apriori with adaptive k-means algorithms.
Fig.2. Demonstrations of original database files size vs compress database size.
The above figure discovers that when the apriori algorithm is applied on the adaptive k-
means resulted data., It will result in a compressed file size less than that when the adaptive
k-means algorithm is applied alone because of the apriori algorithm saves the frequent items
with its support only once instead of saving these items several times, this process decreases
the amount of data. Table 3 shows the comparison between the compression ratios when
0
500
1000
1500
2000
2500
3000
dept
niaid1
00
dept
200
salary
653
med
icin
831
dept
3000
niaid2
248
2012
ss 2
733
DW
C_a
dmin
369
7
dept
1000
0
niaid 22
48&20
12ss
273
3
bact
eria 4
894
original size kilobytes
AK comp size in
kilobytes
apriori size in
kilobytes
Journal of Advanced Computer Science and Technology Research, Vol.3 No.1, March 2013, 1-20
12
using adaptive k-means alone and when applying adaptive k-means together with the apriori
algorithm on the database files.
Table 3: The compression ratios for the adaptive k-means and apriori compression algorithm.
Original size in
kilobytes
Comp ratio by using
clustering
Ratio by using apriori with clustering
technique
No of
clusters
128 k 91% 91% 6
168 k 90% 90% 11
176 k 86% 89% 6
356 k 85% 89% 14
372 k 62% 68% 697
640 k 63% 77% 6
704 k 60% 70% 43
852 k 73% 76% 107
872 k 49% 63% 4
1188 k 46% 70% 10
1444 k 65% 71% 150
2693 k 60% 61% 82
In the above table the comp ratio by using clustering (compression ratio by using a
clustering technique) represents the compression ratio by using adaptive k-means
compression algorithm, and the comp ratio by using apriori (compression ratio by using
apriori) with clustering represents the compression ratio by applying apriori compression
algorithm on the adaptive k-means compression algorithm results. Fig. 3 shows the
compression ratio when applying adaptive k-means algorithm alone and together with the
apriori algorithm on the original database file.
Fig 3. Illustration of the compression ratios for the adaptive k-means and apriori compression
algorithm.
The compression ratio is calculated by using the following equation:
Compression ratio=
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
128
k
168
k
176
k
356
k
372
k
640
k
704
k
852
k
872
k
1188
k
1444
k
2693
k
Comp ratio by using
adaptive k-means
Ratio by using apriori
with adaptive k-
means
(1- ) * 100 …………………. (4) Compressed file size
size Original file size
Journal of Advanced Computer Science and Technology Research, Vol.3 No.1, March 2013, 1-20
13
Table 3 expresses that the apriori compression ratio is more than the adaptive k-means
compression ratio accepts in small database file size where the apriori and the adaptive k-
means give the same compression ratio such as in (dept and niaid100) tables. Table 4 shows
the decompression time taken using the adaptive k-means decompression algorithm and the
apriori decompression algorithm.
Table 4: original database file size corresponded to the decompression time.
Original size
kilobytes
Clustering decomp time
in sec
Apriori decomp time
in sec
No of
clusters
128 2 2 6
168 4 4 11
176 4 4 6
356 11 12 14
372 157 14 697
640 92 114 6
704 73 77 43
852 71 74 107
872 123 140 4
1188 549 696 10
1444 144 151 150
2693 185 187 82
The clustering decomp time in sec (adaptive k-means decompression algorithm time in
seconds) represents the time taken to recover the original data from the compressed one by
using adaptive k-means decompression algorithm. The Apriori decomp time in sec (apriori
decompression algorithm time in seconds) represents the time taken to recover the original
data from the compressed one by using apriori decompression algorithm. Table 1 and table 4
shows that the compression time is less than the decompression time and the time will be
increased when the number of clusters are increased because that the number of loops will be
increased in the program execution. Also the adaptive k-means decompression algorithm
takes less time than the time taken in apriori decompression algorithm excepts in the
medicine table where the apriori decompression algorithm takes less time than that takes
when adaptive k-means decompression algorithm applied. Fig. 4 shows the comparison
between the decompression time by using adaptive k-means and apriori decompression
algorithm on the compressed files.
Table 5 shows the compressed file size when adaptive k-means algorithm is used alone,
when the apriori algorithm is used alone, and when using a joint of adaptive k-means
algorithm with apriori algorithm.
Journal of Advanced Computer Science and Technology Research, Vol.3 No.1, March 2013, 1-20
14
Fig 4. The decompression time for the adaptive k-means algorithm and the apriori algorithm.
Table 5: the compressed files size and the original files size.
Original size in
kilobytes
AK comp size in
kilobytes
Joint apriori with adaptive
k-means size kilobytes
apriori compression
size in kilobytes
No of
clusters
128 12 12 8 6
168 16 16 16 11
356 52 40 44 14
372 140 120 120 697
704 280 220 240 43
2693 1069 1059 1157 82
From the table above when applying the adaptive k-means together with the apriori
algorithm they gives better results than the application of each algorithm alone on the original
data base file. These results are also shown in Fig. 5.
Fig 5. The compressed file size vs original database files size.
Table 6 presents the compression ratio when using each algorithm separately and when using
an integration of these two algorithms.
0
100
200
300
400
500
600
700
800
128
168
176
356
372
640
704
852
872
1188
1444
2693
Clustering decomp
time in sec
Apriori decomp time
in sec
0
500
1000
1500
2000
2500
3000
dept
niaid1
00
salary
356
med
icin
831
niaid2
248
bact
eria 4
894
original size kilobytes
AK comp size in
kilobytes
joint apriori with
adaptive k-means
size kilobytes
apriori compression
size in kilobytes
Journal of Advanced Computer Science and Technology Research, Vol.3 No.1, March 2013, 1-20
15
Table 6: the resulted compression ratio of the three algorithms.
Original size
kilobytes
Comp ratio by using
clustering
Ratio by using a priori with
clustering comp ratio
Apriori
ratio
No of
clusters
128 91% 91% 94% 6
168 90% 90% 90% 11
356 85% 89% 88% 14
372 62% 68% 68% 697
704 60% 70% 66% 43
2693 60% 61% 57% 82
From the table above when applying the apriori algorithm on result of the adaptive k-
means algorithm gives a higher compression ratio than when applying each algorithm alone
on the original database except in the case that the number of records are few. The apriori
algorithm when applied alone gives a better compression ratio because of the support of the
LHS will be increased so that the confidence value is not matching the min_conf then few
rules will be produced. While in the case of large amount of records the amount of LHS and
RHS items that appears together will be increased. This increasing may caused matching
between the confidence value and the min_conf value even the increasing in the amount of
LHS items so the amount of the extracted rules will be increased. This increasing causes an
increasing in the compressed file size and the decreasing in the compression ratio. Fig. 6
demonstrates the compression ratios after applying apriori alone, adaptive k-means alone and
a joint between adaptive k-means algorithm with apriori algorithm
Fig 6. The compression ratios corresponding to the original database file size.
A joint between adaptive k-means and apriori algorithm means the adaptive k-means is
applied first on the database file then the apriori algorithm is applied on the adaptive k-means
resulted files.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
128 168 356 372 704 2693
comp ratio by using
adaptive k-means
ratio by using apriori
with adaptive k-
means
apriori ratio
Journal of Advanced Computer Science and Technology Research, Vol.3 No.1, March 2013, 1-20
16
Table 7: the compression taken time of the three discussed schemes measured in seconds.
Original size
kilobytes
AK comp time
in sec
Ap time in
sec
apriori comp time on
origin database in
sec
No of
clusters
128 1 2 2 6
168 1 2 4 11
356 2 10 38 14
372 11 147 23 697
704 4 63 431 43
2693 8 31 887 82
Fig. 7 shows the time taking to compress the database file by using adaptive k-means
algorithm alone, apriori algorithm alone and a joint between these two algorithms.
Fig 7. The time of the compression process when using each algorithm separately and when
using a joint of these algorithms.
The above figure shows that when applying the apriori algorithm on the original database,
it will take more time than when the apriori is applied on the results of the adaptive k-means
algorithm, except in the table of Medicine when it is applied on the results of the adaptive k-
means. It takes more time because the number of clusters that are resulting from applying the
adaptive k-means on the original database items is very large. Meaning that when the number
of clusters increases the time of the compression process will increase.
Table 8: the decompression taken time of the three algorithms.
Original size
kilobytes
Clustering decomp time
in sec
apriori decomp
time in sec
apriori on db
decomp time in sec No of clusters
128 2 2 2 6
168 4 4 3 11
356 11 12 15 14
372 157 14 14 697
704 73 77 86 43
2693 185 187 221 82
0
100
200
300
400
500
600
700
800
900
1000
128 168 356 372 704 2693
AK comp time in sec
ap time in sec
apriori comp time on
origin database in sec
Journal of Advanced Computer Science and Technology Research, Vol.3 No.1, March 2013, 1-20
17
In the above table the first column represents the original database size. The second
column represents the time that is taken to recover the original database using the adaptive k-
means decompression algorithm. The third column represents the time that is taken to recover
the original database using the priori decompression algorithm. In this case the apriori
decompression algorithm is applied on the compressed files that are resulted from applying
the apriori compression algorithm on the adaptive k-means results. The fourth column
represents the time that is taken to recover the original database using the apriori
decompression algorithm. In the fourth column the apriori decompression algorithm is
applied on the compressed files that are resulted from applying the apriori algorithm on the
original database data. The table above shows that apriori decompression algorithm takes
more time when it is applied on the compressed files resulting from applying the apriori
compression algorithm on the original databases. While decompression time of the apriori
decompression algorithm that is applied on the compressed files resulting from applying the
apriori compression algorithm on the results of the adaptive k-means compression algorithm
is less.
Fig 8. Illustration of the decompression time for the three algorithms.
The following table and Fig. 9 shows a comparison between the compression and the
decompression time.
Table 9: A comparison between the compression time and the decompression taken time.
AK comp time in sec ap time in sec
Clustering decomp time in
sec
apriori decomp time in
sec No of clusters
1 2 2 2 6
1 2 4 4 11
2 2 4 4 6
2 10 11 12 14
11 147 157 14 697
3 31 92 114 6
4 63 73 77 43
7 18 71 74 107
3 82 123 140 4
6 157 549 696 10
11 81 144 151 150
8 31 185 187 82
0
50
100
150
200
250
128 168 356 372 704 2693
clustering
decomptime in sec
apriori decomptime
in sec
apriori on db decomp
time in sec
Journal of Advanced Computer Science and Technology Research, Vol.3 No.1, March 2013, 1-20
18
The above table states that the time taken to return the original database is more than the
time taken to compress it. In order to return the original database the algorithm reads the data
from the compressed file and appends this data to the new database. This process will cause a
delaying in the time that it takes to recover the original database. Appending data to the
database is made automatically.
Fig 9. Demonstration a comparison of the compression time and the decompression time.
7. Conclusions
In this work, database files are exploited to discover the grouping of Objects and the
correlated data in each group. Grouping is proposed (adaptive k-means and apriori
algorithm). The adaptive k-means algorithm is used to discover the grouping of objects while
the apriori algorithm is used to extract the correlated data that are available in each group of
objects. The experimental results show that the proposed compression algorithms effective in
reducing the quantity of transmiting data and improving the compressibility, and thus,
reducing the energy consuming that uses for data transfer. The apriori algorithm achieves a
better compression ratio than the adaptive k-means algorithm but the apriori compression
algorithm takes more time than the adaptive k-means compression algorithm. The apriori
decompression algorithm takes more time than the adaptive k-means decompression
algorithm in order to recover the original database from the compressed one. Therefore the
apriori algorithm is considered better than the adaptive k-means algorithm in compressing
database file size. Even though if the apriori algorithm is applied alone on the database file
without joining with adaptive k-means algorithm, it gives a better compression ratio than the
adaptive k-means algorithm alone. Also it gives smaller file size than the files given when the
adaptive k-means is applied alone on the database file. But when the apriori algorithm is
applied alone on the database files the time that is taken to compress the database file will be
more than the time that is taken when the adaptive k-means is applied alone and when the
apriori algorithm is applied on the adaptive k-means results. The proposed adaptive k-means
algorithm can deal with any data types such as text, date… etc. While the recent k-means
algorithm deals with numerical data only. In order to validate the results, we have calculated
some measurements used to evaluate the performances of compression algorithms such as the
compression ratios, compression time and the decompression time. The compression ratios of
0
100
200
300
400
500
600
700
800
128
168
176
356
372
640
704
852
872
1188
1444
2693
AK comp time in sec
AP time in sec
adaptive k-means
decomp time in sec
apriori decomp time
in sec
Journal of Advanced Computer Science and Technology Research, Vol.3 No.1, March 2013, 1-20
19
joining the apriori algorithm with the adaptive k-means algorithm has values around between
61% and 91% . The compression ratios by using the adaptive k-means algorithm alone have
values around between 46% and 91%. As shown in section 4, the compression time has
values around between 2 sec and 157 sec for the joining of the adaptive k-means with the
apriori algorithm and we have obtained values around 1 sec – 11 sec for adaptive k-means
algorithm. Finally, the decompression time of the apriori decompression algorithm has values
between 2 Sec and 696 Sec and the decompression time of using the adaptive k-means
decompression algorithm has values between 2 Sec and 549 Sec.
References
Adriaans, P., & Zantinge, D. (1998). Data Mining. Addision_Wesley.
Afify, H. , Islam, M., & Wahed, M. A. (2011). Dna Lossless Differential Compression
Algorithm Based On Similarity Of Genomic Sequence Database. International Journal of
Computer Science & Information Technology (IJCSIT) Vol 3, No 4.
Agrawal, R., Imielinski, T. & Swami, A. (1997). Database Mining: A Performance
Perspective. IEEE Trans. Knowledge and Data Engineering, England.
Berkhin, P. (2002). Survey of Clustering Data Mining Techniques. Accrue Software, 1045
Forest Knoll Dr., San Jose, CA, 95129.
Brin ,S., Motwani, R., Ullman, J. D., & Tsur, S.(1997). Dynamic Itemset Counting And
Implication Rules For Market Basket Data. In proceeding of data (SGMOD97) Tucson,
Arizona USA
Figueroa ,V. (2010-2011).Clustering methods applied to Wikipedia, Ann´ee acad´emique.
Goetz, G., & Leonard, D. S. (1991). Data Compression and Database Performance. Oregon
Advanced Computing Institute (OACIS) and NSF awards IRI-8805200, IRI-8912618,
and IRI-9006348.
Graham, W. (1994). Signal coding and processing. (2 ed.). Cambridge University Press. p. 34
Han, J., & Kanmber, M. (2001). Data Mining Concepts and Techniques. Academic press,
USA.
Hand, D., Mannila, H., & Smyth, P. (2001). Principles of Data Mining. The MIT
Press,Cambridge, London England.
Jacob , N., Somvanshi ,P., & Tornekar, R. (2012). Comparative Analysis of Lossless Text
Compression Techniques. International Journal of Computer Applications (0975 – 8887),
Volume 56– No.3.
Kodituwakku, S.R.., & Amarasinghe ,U. S.. (2007).Comparison Of Lossless Data
Compression Algorithms For Text Data. Indian Journal of Computer Science and
Engineering Vol 1 No 4 416-425.
Neeraj,S., & Swati2, L. S. (2012)..Overview of Non-redundant Association Rule Mining.
Research Journal of Recent Sciences(Res.J.Recent Sci), ISSN 2277-2502 Vol. 1(2), 108-
112
Pu, I.M. (2006). Fundamental Data Compression, Elsevier, Britain.
Rai, P., & Singh, S. (2010). A Survey of Clustering Techniques, International Journal of
Computer Applications (0975 – 8887) Volume 7– No.12.
Rajaraman ,A., & Ullman ,J. D .( 2010, 2011). Mining of Massive Datasets.
Journal of Advanced Computer Science and Technology Research, Vol.3 No.1, March 2013, 1-20
20
Report Concerning Space Data System Standards. (2006). Lossless Data Compression. Ccsds
Secretariat, Office of Space Communication (Code M-3), National Aeronautics and Space
Administration, Washington, DC 20546, USA.
Saad K. Majeed & Hussein K. Abbas. (2010). An Improved Distributed Association Rule
Algorithm. Eng.& Tech. Journal, Vol.28, No.18 .
Suarjaya, I. M. A. D. (2012). A New Algorithm for Data Compression
Optimization.International Journal of Advanced Computer Science and Applications
(IJACSA), Vol. 3, No.8.
Szépkút , I. (2004). Difference Sequence Compression Of Multidimensional Databases.
Periodica Polytechnica Ser. EL. ENG. VOL. 48, NO. 3–4, PP. 197–218.
Kona, H. V. (2003). Association Rule Mining Over Multiple Databases: Partitioned And
Incremental Approaches. Published: Master Thesis, The University Of Texas At
Arlington.
Yijun. Lin ,X., & Tsang,C.( 2000). An Efficient Distributed Algorithm for Computing
Association Rules. Springer-Verlag Berlin Heidelberg.