integrating compression technique for data mining

DR MANMOHAN SINGH Assistant professor

ITM UNIVERSE VDODARA GUJARAT INDIA

Presentation Outline Introduction

Compression Technique

Association Rule Mining

Limitation Of Apriori

Literature Survey

Problem Statement

Proposed Work

Implementation Enviroment

Conclusion

References

What Is Data Mining Data mining is used to help users discover interesting and useful knowledge more

easily.

Data compression is one of good solutions to reduce data size.

Data pre-process transforms the original database into a new data representation.

It generates a new transaction database at the end of the data pre-process step.

What Is Data Mining The figure shows data mining as a step in an iterative knowledge discovery process.

Why Data Mining? Data is scattered over network. so it is difficult to find the actual data. Data mining

helps to find that data.

A business man wants to grow up his business. For that he needs smart data,

techniques ,models , tools etc.

Data mining helps how we get, use & understand that data. .

There is a need to extract useful information from the data and to interpret the data.

Application

Financial Data Analysis

Retail Industry

Telecommunication Industry

Biological Data Analysis

Other Scientific Applications

Intrusion Detection

Issues

Mining Methodology

User Interaction

Performance Issues

Diverse Data Types Issues

Compression technique? Make optimal use of limited storage space.

It reduces the size of the data and improves I/O performance.

Compression has also been recently applied for reading large scientific files in

parallel file systems.

Compression decrease bandwidth consumption on networks, and reduce energy

consumption in hardware.

Compression has been used extensively in wireless networks.

Types Of Compression Techniques Null Compression: Replaces a series of blank spaces with a compression code.

Run length Compression:- Expands on the null compression, by compressing a

series of four repeating characters.

Keyword Encoding:- Creates a table with values that represent common sets of

character.

Adaptive Huffman Coding:-Assign fewer bits to symbols that occur more

frequently and more bits to symbols appear less often.

Lempel Ziv Compession:-

Building an indexed dictionary

Compressing a string of symbols

Association Rule Mining It is a method for discovering interesting relations between variables in large

databases.

Intended to identify strong rules discovered in databases using different measures of

interestingness.

Many Algorithms had been proposed for finding the strong association between the

data sets.

In which Apriori was the most well known association rule algorithm which was

developed in 1994, having some major issues.

Limitations of Apriori Needs several iterations for the scanning of the data.

Difficulties to find rarely occuring events.

Works for small set of data.

Costly wasting of time to hold a vast number of candidate sets.

Sr No Reference Paper Methodology Used

Future Work

1 Integrating Compression and Execution in ColumnOrientedDatabase Systems by Daniel J. Abadi,Samuel R. Madden,Miguel & C.Ferreira.

Column-Oriented Database system architecture

NIL

2 Integrating Online Compression To Accelerate Large-Scale Data Analytics Application. By Tekin Bicer, Jian Yin,. David Chiu,Gagan Agrawal,& Karen Schuchardt

Chunk Resource Allocation , Parallel Compressioon Engine

NIL

3 Efficient Mining Frequent Itemsets Algorithms.By Marghny H. Mohamed, & Mohammed M. Darwieesh.

Count Table , Binary Count Table

Extend the algorithms to mine other kinds of patterns, such as sequential patteern mining problem,

4 A Transaction Mapping Algorithm For Frequent Itemsets Mining By Mingjun Song, & Sanguthevar Rajasekaran.

Transaction Mapping Algorithm

To Improve the implementation of the TM algorithm and make a fair comparison with FP-growth.

Sr No Reference Paper Methodology Used

Future Work

5. Compact Transaction Database For Efficient Ffrequent Pattern Mining By Qian Wan & Aijun An.

Compact Tree Structure Called CT-tree

NIL

6. A New Association Rules Mining Algorithm Based On Vector By xin Zhang, Pin Liao & Huiyong Wang.

Association rule mining algorithm based on vector.

NIL

Problem Statement They all lack the ability to decompress the data to their original state and improve

the data mining performance..

It is even a bigger challenge to maintain the compressed database in the future

It spends too much time to check candidate itemsets in the data mining step.

Unable to enter the data set at runtime

Original database

Sorted database

Sorted database Group1



Compressed dataset and generate merged

group

Compressed transaction dataset

Generate frequent item set by simple apriori

algorithms

Now generate association rules and uncompressed dataset

Proposed Work The main criteria of research are related to the followings:-

(a) The compressed database can be decompressed to the original form.

(b) Reduce the process time of association rule mining by using a quantification table.

(c) Reduce I/O time by using only the compressed database to do data mining.

(d) Allow incremental data mining.

Implementation Enviroment Minimum Hardware Requirement:

1. 3 GHZ Pentium PC Machine.

2. 512 Megabytes Main Memory

3. Screen Resolution needs to be between 800*600 & 1200*800.

Minimum Software Requirement:

1. Operating system microsoft windows XP.

2. Microsoft Visual Studio.net(C#).

Conclusion

Rapid Increase of large data become a point of concern.

i.e, time required for data pre-process.

Hence, the proposed algorithm can be benificial while dealing with such large data.

As, it can decompressed the data also after compression.

It can also reduce the I/O time by using only compressed database.

References

1. Xin Zhang, Pin Liao and Huiyong Wang ”A New Association Rules Mining

Algorithm Based On Vector”, 2009 Third International Conference on Genetic and

Evolutionary Computing

2. Qian Wan And Aijun An” Compact Transaction database For Efficient Frequent

Pattern Mining” Department of Computer Science and Engineering York

University, Toronto, Ontario, M3J 1P3, Canada

3. Jis-Yu Dai, Don-lin Yang, Jungpin Wu, And Ming-Chuan Hung-” An Efficient

Data Mining Approach on Compressed Transactions.” International Journal of

Electrical and Computer Engineering 3:2 2008

References4. Wael Ahmad AlZoubi, Khairuddin Omar, Azuraliza Abu Bakar” An Efficient

Mining of Trasactional Data Using Graph-Based Technique” 2011 3rd Conference

on Data Mining and Optimization (DMO) 28-29 June 2011, Selangor, Malaysia

5. Mingjun Song And Sanguthevar Rajasekaran, “A Transaction Mapping Algorithm

For Frequent Itemsets Mining” IEEE TRANSACTIONS ON KNOWLEDGE AND

DATA ENGINEERING, October 2005.

6. Marghny H. Mohamed, Mohammed M. Darwieesh,”Efficient Mining Frequent

Itemsets Algorithm”. Revised: 7 March 2012/Accepted 29 April 2013 Springer-

Verlag Berlin Heidelberg 2013.

References7. Fan Zhang, Yan Zhang Jason Bakos,” GP Apriori: GPU-Accelerated Frequent

Itemset Mining”. 2011 IEEE International Conference On Cluster Computing

8. Tekin Bicer, Jian Yin, David Chiu, Gagan Agrawal And Karen Schuchardt“

Integrating Online Compression To Accelerate large-Scale Data Analytics

Application”. 2013 IEEE 27th International Sympoosium on parallel & distributed

processing.

9. Daniel J. Abadi, Samuel R. Madden, Miguel C. Ferreira”Integrating

Compression And Execution In Column-Oriented Database Systems”, SIGMOD

2006, June 27–29, 2006, Chicago, llinois, USA.Copyright 2006 ACM

1595932569/06/0006.

References10. Shalini Dutt, Naveen Choudhary & Dharm Singh, “ An Improved Apriori

Algorithm Based On Matrix Data Structure”, Global Journal Of Computer

Science And Technology : C Software & Data Engineering, Vol. 14 Issues

5/Version 1.0 Year 2014.

11. Wael A.ALZoubi, Azuraliza Abu Bakar, Khairuddin Omar, “Scalable And

Efficient Method For Mining Association Rules, ”2009 International Conference

On Electrical Engineering And Infrmatics 5-7 August 2009, Selangor Malaysia.

12. Loan T.T.Nguyen, Bay Vo, Tzung-Pei Hong,Hoang Chi Thanh,“CAR-Miner: An

Efficient Algorithm For Mining Class-Association Rules,”Expert system With

Applications 40(2013) 2305-2311, 2012@Elsevier Ltd. All Rights.

References10. Mohammed Al-Maolegi, Bassam Arkok, “An Improved Apriori Algorithm For

Association Rules ,” International Journal On Natural Language

Computing(IJNLC) Vol. 3, N.1, Feburary 2014.

ANY QUERY?

integrating compression technique for data mining

Education