large dataset compression approach article info using ...szepkuti/cites/525-2457-1-pb.pdf · the...

Journal of Advanced Computer Science and Technology Research, Vol.3 No.1, March 2013, 1-20

Large Dataset Compression Approach

Using Intelligent Technique

Ahmed Tariq Sadiq1,a, Mehdi G. Duaimi2,b, Rasha

Subhi Ali2,c 1Computer Science Department, University of Technology,

Baghdad, Iraq 2Computer Science Department, Baghdad University,

Baghdad, Iraq [email protected],

[email protected],[email protected]

Article Info

Received: 7th January 2013

Accepted: 28th January 2013

Published online: 1st March 2013

ISSN: 2231-8852 © 2013 Design for Scientific Renaissance All rights reserved

ABSTRACT Data clustering is a process of putting similar data into groups. A clustering algorithms partition data

set into several groups such that the similarity within a group is larger than among groups.

Association rule is one of the possible methods for analysis of data. The association rules algorithm

generates a huge number of association rules, of which many are redundant. The main idea of this

paper is to compress large database by using clustering techniques with association rule algorithms. In

the first stage, the database is compressed by using clustering techniques followed by association rules

algorithm. Adaptive k-means clustering algorithm is proposed with apriori algorithm. Due to many

experiments by using the adaptive k-means algorithm and apriori algorithm together it gives better

compression ratio and smaller compressed file size than the compression ratio and compressed file

size that are given from using each algorithm alone. Several experiments were made in several

different sizes of database. The apriori algorithm increases the compression ratio of the adaptive k-

means algorithm when hey are used together but it takes more compression time than the adaptive k-

means takes. These algorithms are presented and their results are compared.

Keywords: Association rule, clustering techniques and compression algorithms.

1. Introduction

Compression is the art of representing the information in a compact form rather than its

original or uncompressed form (Pu, 2006). In other words, using the data compression, the

size of a particular file can be reduced. This is very useful when processing, storing or

transferring a huge file, which needs lots of resources (Kodituwakku,et al., 2007). Data

compression is widely used in data management to save storage space and network

bandwidth (Goetz, et al.,1991). In computer science and information theory, data

compression, source coding, or bit-rate reduction involve encoding information using fewer

bits than the original representation. Compression can be either lossy or lossless. Lossless

compression reduces bits by identifying and eliminating statistical redundancy (eliminate


2

unwanted redundancy). Compression is useful because it helps reduce the consuming of

resources such as data space or transmission capacity. Because compressed data must be

decompressed to be used, this extra processing imposes computational or other costs through

decompression. In lossy data compression, some loss of information is acceptable. Depending

upon the application, details can be dropped from the data to save storage space. Lossy data

compression used in image, audio, and video (Graham, 1994). . Lossless data compression is

contrasted with Lossy data compression. Lossless data compression has been suggested for

many space science exploration mission applications either to increase the scientific return or

to reduce the requirement for on-board memory, station contact time, and data archival

volume. A Lossless compression technique guarantees full reconstruction of the original data

without incurring any distortion in the process. The Lossless Data Compression technique

preserves the source data accuracy by removing redundancy from the application source data.

In the decompression process the original source data are reconstructed from the compressed

data by restoring the removed redundancy. The reconstructed data is an exact replica of the

original source data. The amount of redundancy removed from the source data is variable and

is highly dependent on the source data statistics, which are often non-stationary (Report

Concerning Space Data System Standards, 2006). In this paper two intelligent techniques

(clustering techniques and association rules) are used to compress large data sets. Clustering

is a division of data into groups of similar objects. Each group, called a cluster, consists of

objects that are similar among themselves and dissimilar to the objects of other groups

(Berkhin, 2002). . Association Rule Mining plays a major role in the process of mining data

for frequent pattern matching. Association Rules: Association rule of data mining involves

picking out the unknown interdependence of the data and finding out the rules among those

items (Neeraj, et al.,2012).. The main objective of this work is to discuss intelligent

techniques to compress large data sets.

This paper is organized as follow. Section two shows the related works, Section three

gives some terminology on association rules and apriori algorithm, Section four explains

major clustering techniques and k-means algorithm, Section five explains the methodology of

the compression algorithms and the decompression algorithms, Section six presents some

results from experiments and section seven concludes with a discussion.

2. Related Work

Sonia Dora (Jacob , et al., 2012). , focuses on lossless compression for relational

databases at the attribute level so the proposed technique is used at the attribute level by

compressing three types of attribute (string, integer and date type) and the most interesting

feature is that it automatically identifies the type of attribute. I Made Agus Dwi Suarjaya

(Suarjaya, 2012), proposes a new algorithm for data compression, called j-bit encoding

(JBE). This algorithm manipulates each bit of data in a file to minimize the size without

loosing any data after decoding which is classified to lossless compression. Heba Afify,

Muhammad Islam and Manal Abdel Wahed (Afify, et al.,2011). , present a differential

compression algorithm that is based on production of difference sequences according to an

op-code table in order to optimize the compression of homologous sequences in the dataset.

István Szépkúti (Szépkút , 2004)., introduces a new method called difference sequence


3

compression. Under some conditions, this technique is able to create a smaller size

multidimensional database than others like single count header compression, logical position

compression or base-offset compression.

3. Association Rules

Association rule mining is one of the most important and well researched techniques of

data mining. It was first introduced by Agrawal, Imielinski, and Swami (Agrawal, et

al.,1997). . The discovery of “association rules” in databases may provide useful background

knowledge to decision support systems, selective marketing, financial forecast, medical

diagnosis, and many other applications (Yijun. et al., 2000). . Mining association rules is an

important data mining problem. Association rules are usually mined repeatedly in different

parts of a database. Current algorithms for mining association rules work in two steps.

1. Discover the large itemsets, i.e. the sets of itemsets that have support above a

predetermined minimum support σ.

2. Use the large itemsets to generate the association rules for the database.

It is noted that the overall performance of mining association rules is determined by the first

step, which usually requires repeated passes over the analyzed database and determines the

overall performance. After the large itemsets are identified, the corresponding association

rules can be derived in a straightforward manner. (Saad et al., 2010).

3.1 Associations rules concept

An association rule is a simple probabilistic statement about the co-occurrence of certain

events in a database, and is particularly applicable to sparse transaction data sets (Hand et al.,

2001). An association rule is a rule, which infers certain association relationships among a set

of objects (such as those which occur together or one infers the other”). In a database

association rule mining works as follows: (Adriaans et al., 1998).

Let I be a set of items and D a database of transactions, where each transaction has a

unique identifier (tid) and contains a set of items called an item set. An itemset with k items

is called a k-itemset. The support of an itemset X denoted S (X) is the number of transactions

in which that itemset occurs as a subset. A k-subset is a k- length subset of an itemset. An

itemset is frequent or large if its support is more than a user-specified minimum support

(min_sup) value. Fk is the set of frequent k-itemsets. A frequent itemset is maximal if it is not

a subset of any other frequent itemset. An association rule is an expression A B, where A

and B are itemsets. The rule’s support (S) is the joint probability of a transaction containing

both A and B, and is given as S (A B). The confidence of the rule is the conditional

probability that a transaction contains B, given that it contains A and is given as S (A B)/S

(A). A rule is frequent if its support is greater than min_sup and strong if its confidence is

more than a user-specified minimum confidence (min_conf). Data mining involves

generating all association rules in the database that have a support greater than min_sup (the

rules are frequent) and that have a confidence greater than min_conf (the rules are strong)

(Saad et al., 2010). The important measures for association rules are support (S) and

confidence (C). They can be defined as:


4

Definition1: Support (S)

Support(X, Y) = Pr(X Y) =count of (X Y) / Total transactions……………………... (1)

The support (S) of an association rule is the ratio (in percent) of the records that contain

(X Y) to the total number of records in the database. Therefore, if we say that the support of

a rule is 5% then it means that 5% of the total records contain (X Y) (Brin , et al.,1997).

Definition2: Confidence (C)

Conf (X Y) =Pr (X Y)/Pr(X) =support(X, Y)/support(X)……………………………. (2)

For a given number of records, confidence (C) is the ratio (in percent) of the numbers of

records that contain (X Y), to the number of records that contain X. Thus, if we say that a

rule has a confidence of 15% it means that 85% of the records containing X also contain Y.

The confidence of the rule refers to the degree of correlation in the database between X and

Y. Confidence is also a measure of rules strength. Mining consists of finding all rules that

meet the user-specified threshold support and confidence (Brin , et al.,1997). As there are two

thresholds, we need two processes to mine the rules. The first step is to get the large itemsets.

It finds all the itemsets whose supports are larger than the support threshold. An itemset is the

set of the items. Based on the large itemsets, we can generate the rules from the large

itemsets, which is the second step. Rules that satisfy both a minimum support threshold

(min_sup) and a minimum confidence threshold (min_conf) are called strong (Han, et al.,

2001).. An association rule-mining problem is broken down into two steps: 1) Generate all

the item combinations (itemsets) whose support is greater than the user specified minimum

support. Such sets are called the frequent itemsets and 2) use the identified frequent itemsets

to generate the rules that satisfy a user specified confidence. The frequent itemsets generation

requires more effort and the rule generation is straightforward, (Kona, 2003).

4. Unsupervised learning

Unsupervised learning is a branch of machine learning algorithms, where the aim is to

find patterns in raw, unlabeled, data. Unlike in supervised learning, where the algorithm is

first trained on labeled data, so it can learn and adapt to the particular problem

(classification). The general, inherent, properties of the data can be used to find patterns and

organize it (clustering). (Figueroa , 2011).

4.1 Clustering

Clustering is the process of examining a collection of “points,” and grouping the points

into “clusters” according to some distance measure (Rajaraman , et al., 2011).. The goal of

clustering is to define clusters of elements in the dataset, such that the elements in the same

cluster are similar to each other, and each cluster as a whole is distant from the others. Two

important classes of clustering can be distinguished 1) Hierarchical clustering: These kinds of

techniques can either be agglomerative or divisive. In agglomerative clustering, we start by

assigning each element to a different class. The successive iterations of the algorithm cluster

together the closest classes, until all the elements belong to the same, main, class. The

divisive methods perform divisions of the classes into smaller ones instead. 2) Partitional

clustering: the concept of partitional clustering is to divide the data immediately into a certain


5

number of clusters. The most popular algorithm is k-means, which will be presented in the

next section. (Figueroa , 2011).

4.1.1. Partitioning Methods

The partitioning methods generally result in a set of M clusters; each object is belonging

to one cluster. Each cluster may be represented by a centroid or a cluster representative; this

is some a sort of a summary description of all the objects contained in a cluster. K-means

method is an example about partitioning clustering. (Rai, et al., 2010).

K-means

The k-means algorithm was proposed independently in various scientific fields over 50

years ago. MacQueen was the first, in 1967, to name k-means its one-pass version of the

algorithm, where he defined the first k elements of the dataset as the k classes, and

successively assigned the next elements to the closest class, updating the centroid after each

assignment (Figueroa , 2011).. The standard k-means algorithm is considered to be a simple

but efficient partitioning algorithm. It divides the data into k clusters, minimizing the squared

distance between each element to the center of its cluster. The distance measure is a

parameter of the algorithm. The objective function, using the Euclidian distance, is defined

as:

2

1

k

m

k ci

i gxJk

………………………………………………………………………. (3)

Where:

• m: is the total number of clusters.

• Ck: is the k-th cluster.

• xi: is the vector of the i-th element of the dataset.

• gk: is the vector of the center of the k-th cluster.

K-means then proposes the following iterative method to find a good solution.

1. Initialize the center of each cluster (also called centroids). For example, we can

arbitrarily choose some elements of the dataset to be the centroids.

2. Reassign each element to the closest centroid.

3. Recompute the center of each cluster.

4. Repeat steps 2 and 3 until the stopping criterion is satisfied.

(Figueroa , 2011).

4.1.2. Hierarchical Methods

Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of

clusters. The basics of hierarchical clustering include Lance-Williams formula, conceptual

clustering, classic algorithms SLINK, COBWEB, as well as newer algorithms CURE and

CHAMELEON. The hierarchical algorithms build clusters gradually (as crystals are grown).

Strategies for hierarchical clustering generally fall into two types: In hierarchical clustering

the data are not partitioned into a particular cluster in a single step. Instead, a series of

partitions takes place, which may run from a single cluster containing all objects to n clusters


6

each containing a single object. Hierarchical Clustering is subdivided into agglomerative

methods, which proceed by a series of fusions of the n objects into groups, and divisive

methods, which separate n objects successively into finer groupings. Agglomerative

techniques are more commonly used (Rai, et al., 2010). .

A. Agglomerative technique

This is a "bottom up" approach: each observation starts in its own cluster, and pairs of

clusters are merged as one moves up the hierarchy. The algorithm forms clusters in a bottom-

up manner, as follows (Rai, et al., 2010). :

1. Initially, put each article in its own cluster.

2. Among all current clusters, pick the two clusters with the smallest distance.

3. Replace these two clusters with a new cluster, formed by merging the two original ones.

4. Repeat the above two steps until there is only one remaining cluster in the pool.

Thus, the agglomerative clustering algorithm will result in a binary cluster tree with single

article clusters as its leaf nodes and a root node containing all the articles.

B. Divisive technique (Rai, et al., 2010).

This is a "top down" approach: all observations start in one cluster, and splits are performed

recursively as one moves down the hierarchy.

1. Put all objects in one cluster

2. Repeat until all clusters are singletons

a) Choose a cluster to split

b) Replace the chosen cluster with the sub-cluster.

C. Advantages of hierarchal clustering (Rai, et al., 2010).

1) Embedded flexibility regarding the level of granularity.

2) Ease of handling any forms of similarity or distance.

3) Applicability to any attributes type.

D. Disadvantages of hierarchal clustering (Rai, et al., 2010).

1. Ambiguity of termination criteria.

2. Most hierarchal algorithms do not revisit once constructed clusters with the purpose

of improvement.

5. Methodology

In the first part of this work the adaptive k-means and apriori algorithm are utilized. By

adaptive k-means, one can extract several clusters from each database table, and by the

apriori algorithm; relationships among the sets of items can be extracted from each cluster. In

the next part of this work the decompressed data must be recovered from the compressed one.

To do that the adaptive k-means decompression algorithm and the apriori decompression

algorithm will be used to recover the original data from the compressed data.

5.1 compression algorithms

5.1.1. Adaptive k-means algorithm

Adaptive k-means is a partitioning clustering algorithm used to extract all available

clusters in any selected database. In recent k-means the user must determine the number of


7

clusters and the center of each cluster, while in adaptive k-means the number of clusters and

the center of clusters are determined automatically without intervention of the user. In this

algorithm the user selects a database file then the algorithm automatically select two

attributes. The items that are available in these selected attributes represent the centers of

clusters. This algorithm has several stages.

Algorithm: adaptive k-means.

Input: database file

Output: two text file first one for saving extracted clusters, and the second is for saving

information for each extracted cluster (this information contains a number of items in each

cluster and the name of each cluster).

Begin

i. Let the DB = database file

ii. Automatically select two attributes from the inputted database file. Let G and D are the

selected attributes.

iii. The center for each cluster is determined automatically by selecting items from the

determined attributes without repetition of these items and considers them as centers for

clusters. Let (G1, G2…..,Gn and D1, D2,…. Dm ) to be centers of clusters.

n represents the number of items in the first selected attributes without repetition.

m represents the number of items in the second selected attributes without repetition.

iv. U= unselected attributes

v. For each closed items in G1, G2,… Gn and D1, D2,… Dm select U

vi. Print U in first text file

vii. Print information of each cluster in the second file

viii. Return compressed files.

End

For example, if the dept table is selected to apply compression algorithms on it then the

gender and degree columns will be selected to make their items as a center for each cluster.

The (male and female) are closed items belonged to the first selected attribute (gender).

While (secondary school, B.Sc., Diploma, PhD, MSc, Higher Diploma) are closed items

belonged to the second selected attribute. Then the center of each cluster is considered as

follows: The center of the First cluster is (male, secondary school ( , the center of the second

cluster is (male, MSc), while the center of the third cluster is (female, MSc) and so on as

determined by the adaptive k-means algorithm.

5.1.2. Apriori Algorithm

Apriori Algorithm can be used to generate all frequent itemset. A Frequent itemset is an

itemset that has support is greater than the user-specified minimum support (denoted Lk,

where k is the size of the itemset). A Candidate itemset is a potentially frequent itemset

(denoted Ck, where k is the size of the itemset).

Algorithm: apriori

Input: adaptive k-means resulted files, min_sup, min_conf

Output: one text file for saving extracted rules and the remaining cluster's data on it.


8

Begin:

For each itemset l1 Lk-1

For each itemset l2 Lk-1

IF (l1 [1] = l2 [1]) (l1 [2] = l2 [2]) … (l1 [k-2] = l2 [k-2]) (l1 [k-1] < l2 [k-1] ) then

c = l1 join l2; // join step: generate candidates

IF has_infrequent_subset (c, Lk-1) then

Delete c; // prune step: remove unfruitful candidate

Else add c to Ck;

Return Ck and the remaining items of clusters and write them in a text file;

End

5.2 Decompression algorithms

5.2.1. Adaptive k-means Decompression Algorithm

To recover the original data from the compressed one, we reverse the operation of the

adaptive k-means compression algorithm. This operation is called the adaptive k-means

decompression algorithm. By decompressing the original data from the input compressed

files, each input compressed files are used to read out the original data from the compressed

file's data. To do this without loss of any information, it is necessary to keep the data sets

with an index used in both adaptive k-means and apriori decompression algorithm. Hence,

the data are distributed in the clusters with its cluster's name, the available number of items in

each cluster, and the selected attribute names. This information is used as the current output

data and is the next entry to be inserted into the database file. The detailed operation of the

adaptive k-means decompression algorithm is described as follows.

Adaptive k-means decompression algorithm

Input: compressed files.

Output: original database file.

Begin:

1. Read the clusters items from the first compressed file.

2. Read the clusters information from the second compressed file.

3. Split the data that are available in the both two compressed files.

4. Save the data that are getting from the second compressed file into a one dimensional

array.

5. Save the data that are getting from the first compressed file into a buffer view.

6. Create a new database file.

7. Fill this new database file with the data that are needed from the array and the buffer

view.

8. Return the original database file.

End

5.1.2. Apriori Decompression Algorithm

To recover the original data from the compressed one, we reverse the operation of the

apriori compression algorithm. This operation is called the apriori decompression algorithm.


9

By decompressing the original data from the input compressed data files, each input

compressed data files used to read out the original data from the data set. To do this without

loss of any information, it is necessary to keep the data sets with an index used in both

adaptive k-means and apriori decompression algorithm. Hence, important correlations are

extracted by using apriori compression algorithm, and only correlated data and uncorrelated

data are saved as a cluster with its cluster names into a text file. The second file is the same as

the second file used by adaptive k-means algorithm. This information is used as the current

output data and is the next entry to be inserted into the database file. The detailed operation of

the apriori decompression algorithm is described as follows:

Apriori decompression algorithm

Input: compressed files.

Output: original database file.

Begin:

1. Read the clusters items from the apriori compressed file.

2. Read the clusters information from the second adaptive k-means compressed file.

3. Split the data that are available in the both two compressed files.

4. Save the data that are getting from the second compressed file into a one dimensional

array.

5. Save the data that are getting from the apriori compressed file into a buffer view and

save the clusters that are given from apriori compressed file into another buffer view.

6. Return correlated data to the clusters that belonged to it by using its indexes.

7. Create a new database file.

8. Fill this new database file with the data that are needed from the array and the buffer

view.

9. Return the original database file.

End

6. Experimental Results

For an experimental evaluation of the proposed algorithms, several experiments are

performed on real databases. The proposed algorithms are implemented in VB.net

environment. Many databases are used to test the performance of these algorithms. In

particular, several clusters are derived from each database file and a large number of rules can

be derived from those clusters depending on the support and confidence thresholds. Some of

the results are given in tables 1, 2, 3, 4, 5,6,7,8 and 9. Table 1 shows the taken time for

compressing several databases by using both adaptive k-means and apriori algorithm.

Table 1: adaptive k-means vs apriori compression time along with the database files size.

Original size kilobytes AK comp time in sec AP time in sec No of clusters

128 1 2 6

168 1 2 11

176 2 2 6

356 2 10 14

372 11 147 697

640 3 31 6


10

704 4 63 43

852 7 18 107

872 3 82 4

1188 6 157 10

1444 11 81 150

2693 8 31 82

In the above table the AK comp time in sec (adaptive k-means compression time in

second) represents the taken time to compress the database by using adaptive k-means

compression algorithm. The AP comp time in sec (apriori compressed time in second)

represents the taken time to compress the database by applying apriori compression algorithm

on the adaptive k-means algorithm results. The original size in kilobytes represents the

original database size. From the above results we see that the apriori algorithm takes more

time than adaptive k-means algorithm because of the apriori algorithm has two calculations.

Firstly calculates the support or number of frequencies of each item then matching this

support with min_sub if the support≥ min_sub then moving this item to the frequent item list.

Next calculate the confidence of each two or more frequent items that appear together, if the

confidence ≥ min_conf extract this rule. These calculations will cause a delay in the work of

the algorithm in advanced. Also, increasing the total taken time happen whenever increased

the testing file size. While the K-means algorithm adjustment there is no need for these

operations. While adaptive K-means algorithm needs only extract the center of groups and

matching the remaining items from the database if sharing this Center or not. Finally, the

distribution of these items according to the sharing center.

Fig.1. Illustrations of the adaptive k-means vs apriori compression time along with the

database files size.

Table 2 shows the compressed database size by applying adaptive k-means together with the

apriori algorithm and applying adaptive k-means alone on the original database file and also

shows the original database file size

0

20

40

60

80

100

120

140

160

180

128

168

176

356

372

640

704

852

87211

8814

4426

93

AK comp time in sec

AP time in sec


11

Table 2: Original database file size vs compressed file size.

Table name Original size kilobytes AK comp size in kilobytes apriori size in kilobyte

Dept 128 12 12

niaid100 168 16 16

dept200 176 24 20

salary 653 356 52 40

medicin 831 372 140 120

dept3000 640 236 148

niaid2248 704 280 220

2012ss 2733 852 232 204

DWC_admin 3697 872 448 324

dept10000 1188 636 364

niaid 2248&2012ss

2733 1444 512 424

bacteria 4894 2693 1069 1059

In the above table the AK comp size in KB (adaptive k-means compressed file size)

represents the compressed database file size by using adaptive k-means compression

algorithm measured in kilobytes, and the apriori comp size in KB (apriori compressed file

size) represents the compressed database file size by applying apriori compression algorithm

on the data that result from the adaptive k-means compression algorithm also measured by

using kilobytes. Fig. 2 shows the compressed database size for each tested database using

both adaptive k-means and apriori with adaptive k-means algorithms.

Fig.2. Demonstrations of original database files size vs compress database size.

The above figure discovers that when the apriori algorithm is applied on the adaptive k-

means resulted data., It will result in a compressed file size less than that when the adaptive

k-means algorithm is applied alone because of the apriori algorithm saves the frequent items

with its support only once instead of saving these items several times, this process decreases

the amount of data. Table 3 shows the comparison between the compression ratios when

0

500

1000

1500

2000

2500

3000

dept

niaid1

00

dept

200

salary

653

med

icin

831

dept

3000

niaid2

248

2012

ss 2

733

DW

C_a

dmin

369

7

dept

1000

0

niaid 22

48&20

12ss

273

3

bact

eria 4

894

original size kilobytes

AK comp size in

kilobytes

apriori size in

kilobytes


12

using adaptive k-means alone and when applying adaptive k-means together with the apriori

algorithm on the database files.

Table 3: The compression ratios for the adaptive k-means and apriori compression algorithm.

Original size in

kilobytes

Comp ratio by using

clustering

Ratio by using apriori with clustering

technique

No of

clusters

128 k 91% 91% 6

168 k 90% 90% 11

176 k 86% 89% 6

356 k 85% 89% 14

372 k 62% 68% 697

640 k 63% 77% 6

704 k 60% 70% 43

852 k 73% 76% 107

872 k 49% 63% 4

1188 k 46% 70% 10

1444 k 65% 71% 150

2693 k 60% 61% 82

In the above table the comp ratio by using clustering (compression ratio by using a

clustering technique) represents the compression ratio by using adaptive k-means

compression algorithm, and the comp ratio by using apriori (compression ratio by using

apriori) with clustering represents the compression ratio by applying apriori compression

algorithm on the adaptive k-means compression algorithm results. Fig. 3 shows the

compression ratio when applying adaptive k-means algorithm alone and together with the

apriori algorithm on the original database file.

Fig 3. Illustration of the compression ratios for the adaptive k-means and apriori compression

algorithm.

The compression ratio is calculated by using the following equation:

Compression ratio=

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

128

k

168

k

176

k

356

k

372

k

640

k

704

k

852

k

872

k

1188

k

1444

k

2693

k

Comp ratio by using

adaptive k-means

Ratio by using apriori

with adaptive k-

means

(1- ) * 100 …………………. (4) Compressed file size

size Original file size


13

Table 3 expresses that the apriori compression ratio is more than the adaptive k-means

compression ratio accepts in small database file size where the apriori and the adaptive k-

means give the same compression ratio such as in (dept and niaid100) tables. Table 4 shows

the decompression time taken using the adaptive k-means decompression algorithm and the

apriori decompression algorithm.

Table 4: original database file size corresponded to the decompression time.

Original size

kilobytes

Clustering decomp time

in sec

Apriori decomp time

in sec

No of

clusters

128 2 2 6

168 4 4 11

176 4 4 6

356 11 12 14

372 157 14 697

640 92 114 6

704 73 77 43

852 71 74 107

872 123 140 4

1188 549 696 10

1444 144 151 150

2693 185 187 82

The clustering decomp time in sec (adaptive k-means decompression algorithm time in

seconds) represents the time taken to recover the original data from the compressed one by

using adaptive k-means decompression algorithm. The Apriori decomp time in sec (apriori

decompression algorithm time in seconds) represents the time taken to recover the original

data from the compressed one by using apriori decompression algorithm. Table 1 and table 4

shows that the compression time is less than the decompression time and the time will be

increased when the number of clusters are increased because that the number of loops will be

increased in the program execution. Also the adaptive k-means decompression algorithm

takes less time than the time taken in apriori decompression algorithm excepts in the

medicine table where the apriori decompression algorithm takes less time than that takes

when adaptive k-means decompression algorithm applied. Fig. 4 shows the comparison

between the decompression time by using adaptive k-means and apriori decompression

algorithm on the compressed files.

Table 5 shows the compressed file size when adaptive k-means algorithm is used alone,

when the apriori algorithm is used alone, and when using a joint of adaptive k-means

algorithm with apriori algorithm.


14

Fig 4. The decompression time for the adaptive k-means algorithm and the apriori algorithm.

Table 5: the compressed files size and the original files size.

Original size in

kilobytes

AK comp size in

kilobytes

Joint apriori with adaptive

k-means size kilobytes

apriori compression

size in kilobytes

No of

clusters

128 12 12 8 6

168 16 16 16 11

356 52 40 44 14

372 140 120 120 697

704 280 220 240 43

2693 1069 1059 1157 82

From the table above when applying the adaptive k-means together with the apriori

algorithm they gives better results than the application of each algorithm alone on the original

data base file. These results are also shown in Fig. 5.

Fig 5. The compressed file size vs original database files size.

Table 6 presents the compression ratio when using each algorithm separately and when using

an integration of these two algorithms.

0

100

200

300

400

500

600

700

800

128

168

176

356

372

640

704

852

872

1188

1444

2693

Clustering decomp

time in sec

Apriori decomp time

in sec

0

500

1000

1500

2000

2500

3000

dept

niaid1

00

salary

356

med

icin

831

niaid2

248

bact

eria 4

894

original size kilobytes

AK comp size in

kilobytes

joint apriori with

adaptive k-means

size kilobytes

apriori compression

size in kilobytes


15

Table 6: the resulted compression ratio of the three algorithms.

Original size

kilobytes

Comp ratio by using

clustering

Ratio by using a priori with

clustering comp ratio

Apriori

ratio

No of

clusters

128 91% 91% 94% 6

168 90% 90% 90% 11

356 85% 89% 88% 14

372 62% 68% 68% 697

704 60% 70% 66% 43

2693 60% 61% 57% 82

From the table above when applying the apriori algorithm on result of the adaptive k-

means algorithm gives a higher compression ratio than when applying each algorithm alone

on the original database except in the case that the number of records are few. The apriori

algorithm when applied alone gives a better compression ratio because of the support of the

LHS will be increased so that the confidence value is not matching the min_conf then few

rules will be produced. While in the case of large amount of records the amount of LHS and

RHS items that appears together will be increased. This increasing may caused matching

between the confidence value and the min_conf value even the increasing in the amount of

LHS items so the amount of the extracted rules will be increased. This increasing causes an

increasing in the compressed file size and the decreasing in the compression ratio. Fig. 6

demonstrates the compression ratios after applying apriori alone, adaptive k-means alone and

a joint between adaptive k-means algorithm with apriori algorithm

Fig 6. The compression ratios corresponding to the original database file size.

A joint between adaptive k-means and apriori algorithm means the adaptive k-means is

applied first on the database file then the apriori algorithm is applied on the adaptive k-means

resulted files.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

128 168 356 372 704 2693

comp ratio by using

adaptive k-means

ratio by using apriori

with adaptive k-

means

apriori ratio


16

Table 7: the compression taken time of the three discussed schemes measured in seconds.

Original size

kilobytes

AK comp time

in sec

Ap time in

sec

apriori comp time on

origin database in

sec

No of

clusters

128 1 2 2 6

168 1 2 4 11

356 2 10 38 14

372 11 147 23 697

704 4 63 431 43

2693 8 31 887 82

Fig. 7 shows the time taking to compress the database file by using adaptive k-means

algorithm alone, apriori algorithm alone and a joint between these two algorithms.

Fig 7. The time of the compression process when using each algorithm separately and when

using a joint of these algorithms.

The above figure shows that when applying the apriori algorithm on the original database,

it will take more time than when the apriori is applied on the results of the adaptive k-means

algorithm, except in the table of Medicine when it is applied on the results of the adaptive k-

means. It takes more time because the number of clusters that are resulting from applying the

adaptive k-means on the original database items is very large. Meaning that when the number

of clusters increases the time of the compression process will increase.

Table 8: the decompression taken time of the three algorithms.

Original size

kilobytes

Clustering decomp time

in sec

apriori decomp

time in sec

apriori on db

decomp time in sec No of clusters

128 2 2 2 6

168 4 4 3 11

356 11 12 15 14

372 157 14 14 697

704 73 77 86 43

2693 185 187 221 82

0

100

200

300

400

500

600

700

800

900

1000

128 168 356 372 704 2693

AK comp time in sec

ap time in sec

apriori comp time on

origin database in sec


17

In the above table the first column represents the original database size. The second

column represents the time that is taken to recover the original database using the adaptive k-

means decompression algorithm. The third column represents the time that is taken to recover

the original database using the priori decompression algorithm. In this case the apriori

decompression algorithm is applied on the compressed files that are resulted from applying

the apriori compression algorithm on the adaptive k-means results. The fourth column

represents the time that is taken to recover the original database using the apriori

decompression algorithm. In the fourth column the apriori decompression algorithm is

applied on the compressed files that are resulted from applying the apriori algorithm on the

original database data. The table above shows that apriori decompression algorithm takes

more time when it is applied on the compressed files resulting from applying the apriori

compression algorithm on the original databases. While decompression time of the apriori

decompression algorithm that is applied on the compressed files resulting from applying the

apriori compression algorithm on the results of the adaptive k-means compression algorithm

is less.

Fig 8. Illustration of the decompression time for the three algorithms.

The following table and Fig. 9 shows a comparison between the compression and the

decompression time.

Table 9: A comparison between the compression time and the decompression taken time.

AK comp time in sec ap time in sec

Clustering decomp time in

sec

apriori decomp time in

sec No of clusters

1 2 2 2 6

1 2 4 4 11

2 2 4 4 6

2 10 11 12 14

11 147 157 14 697

3 31 92 114 6

4 63 73 77 43

7 18 71 74 107

3 82 123 140 4

6 157 549 696 10

11 81 144 151 150

8 31 185 187 82

0

50

100

150

200

250

128 168 356 372 704 2693

clustering

decomptime in sec

apriori decomptime

in sec

apriori on db decomp

time in sec


18

The above table states that the time taken to return the original database is more than the

time taken to compress it. In order to return the original database the algorithm reads the data

from the compressed file and appends this data to the new database. This process will cause a

delaying in the time that it takes to recover the original database. Appending data to the

database is made automatically.

Fig 9. Demonstration a comparison of the compression time and the decompression time.

7. Conclusions

In this work, database files are exploited to discover the grouping of Objects and the

correlated data in each group. Grouping is proposed (adaptive k-means and apriori

algorithm). The adaptive k-means algorithm is used to discover the grouping of objects while

the apriori algorithm is used to extract the correlated data that are available in each group of

objects. The experimental results show that the proposed compression algorithms effective in

reducing the quantity of transmiting data and improving the compressibility, and thus,

reducing the energy consuming that uses for data transfer. The apriori algorithm achieves a

better compression ratio than the adaptive k-means algorithm but the apriori compression

algorithm takes more time than the adaptive k-means compression algorithm. The apriori

decompression algorithm takes more time than the adaptive k-means decompression

algorithm in order to recover the original database from the compressed one. Therefore the

apriori algorithm is considered better than the adaptive k-means algorithm in compressing

database file size. Even though if the apriori algorithm is applied alone on the database file

without joining with adaptive k-means algorithm, it gives a better compression ratio than the

adaptive k-means algorithm alone. Also it gives smaller file size than the files given when the

adaptive k-means is applied alone on the database file. But when the apriori algorithm is

applied alone on the database files the time that is taken to compress the database file will be

more than the time that is taken when the adaptive k-means is applied alone and when the

apriori algorithm is applied on the adaptive k-means results. The proposed adaptive k-means

algorithm can deal with any data types such as text, date… etc. While the recent k-means

algorithm deals with numerical data only. In order to validate the results, we have calculated

some measurements used to evaluate the performances of compression algorithms such as the

compression ratios, compression time and the decompression time. The compression ratios of

0

100

200

300

400

500

600

700

800

128

168

176

356

372

640

704

852

872

1188

1444

2693

AK comp time in sec

AP time in sec

adaptive k-means

decomp time in sec

apriori decomp time

in sec


19

joining the apriori algorithm with the adaptive k-means algorithm has values around between

61% and 91% . The compression ratios by using the adaptive k-means algorithm alone have

values around between 46% and 91%. As shown in section 4, the compression time has

values around between 2 sec and 157 sec for the joining of the adaptive k-means with the

apriori algorithm and we have obtained values around 1 sec – 11 sec for adaptive k-means

algorithm. Finally, the decompression time of the apriori decompression algorithm has values

between 2 Sec and 696 Sec and the decompression time of using the adaptive k-means

decompression algorithm has values between 2 Sec and 549 Sec.

References

Adriaans, P., & Zantinge, D. (1998). Data Mining. Addision_Wesley.

Afify, H. , Islam, M., & Wahed, M. A. (2011). Dna Lossless Differential Compression

Algorithm Based On Similarity Of Genomic Sequence Database. International Journal of

Computer Science & Information Technology (IJCSIT) Vol 3, No 4.

Agrawal, R., Imielinski, T. & Swami, A. (1997). Database Mining: A Performance

Perspective. IEEE Trans. Knowledge and Data Engineering, England.

Berkhin, P. (2002). Survey of Clustering Data Mining Techniques. Accrue Software, 1045

Forest Knoll Dr., San Jose, CA, 95129.

Brin ,S., Motwani, R., Ullman, J. D., & Tsur, S.(1997). Dynamic Itemset Counting And

Implication Rules For Market Basket Data. In proceeding of data (SGMOD97) Tucson,

Arizona USA

Figueroa ,V. (2010-2011).Clustering methods applied to Wikipedia, Ann´ee acad´emique.

Goetz, G., & Leonard, D. S. (1991). Data Compression and Database Performance. Oregon

Advanced Computing Institute (OACIS) and NSF awards IRI-8805200, IRI-8912618,

and IRI-9006348.

Graham, W. (1994). Signal coding and processing. (2 ed.). Cambridge University Press. p. 34

Han, J., & Kanmber, M. (2001). Data Mining Concepts and Techniques. Academic press,

USA.

Hand, D., Mannila, H., & Smyth, P. (2001). Principles of Data Mining. The MIT

Press,Cambridge, London England.

Jacob , N., Somvanshi ,P., & Tornekar, R. (2012). Comparative Analysis of Lossless Text

Compression Techniques. International Journal of Computer Applications (0975 – 8887),

Volume 56– No.3.

Kodituwakku, S.R.., & Amarasinghe ,U. S.. (2007).Comparison Of Lossless Data

Compression Algorithms For Text Data. Indian Journal of Computer Science and

Engineering Vol 1 No 4 416-425.

Neeraj,S., & Swati2, L. S. (2012)..Overview of Non-redundant Association Rule Mining.

Research Journal of Recent Sciences(Res.J.Recent Sci), ISSN 2277-2502 Vol. 1(2), 108-

112

Pu, I.M. (2006). Fundamental Data Compression, Elsevier, Britain.

Rai, P., & Singh, S. (2010). A Survey of Clustering Techniques, International Journal of

Computer Applications (0975 – 8887) Volume 7– No.12.

Rajaraman ,A., & Ullman ,J. D .( 2010, 2011). Mining of Massive Datasets.


20

Report Concerning Space Data System Standards. (2006). Lossless Data Compression. Ccsds

Secretariat, Office of Space Communication (Code M-3), National Aeronautics and Space

Administration, Washington, DC 20546, USA.

Saad K. Majeed & Hussein K. Abbas. (2010). An Improved Distributed Association Rule

Algorithm. Eng.& Tech. Journal, Vol.28, No.18 .

Suarjaya, I. M. A. D. (2012). A New Algorithm for Data Compression

Optimization.International Journal of Advanced Computer Science and Applications

(IJACSA), Vol. 3, No.8.

Szépkút , I. (2004). Difference Sequence Compression Of Multidimensional Databases.

Periodica Polytechnica Ser. EL. ENG. VOL. 48, NO. 3–4, PP. 197–218.

Kona, H. V. (2003). Association Rule Mining Over Multiple Databases: Partitioned And

Incremental Approaches. Published: Master Thesis, The University Of Texas At

Arlington.

Yijun. Lin ,X., & Tsang,C.( 2000). An Efficient Distributed Algorithm for Computing

Association Rules. Springer-Verlag Berlin Heidelberg.

large dataset compression approach article info using ...szepkuti/cites/525-2457-1-pb.pdf · the...

Documents