integrating compression technique for data mining
TRANSCRIPT
DR MANMOHAN SINGH Assistant professor
ITM UNIVERSE VDODARA GUJARAT INDIA
Presentation Outline Introduction
Compression Technique
Association Rule Mining
Limitation Of Apriori
Literature Survey
Problem Statement
Proposed Work
Implementation Enviroment
Conclusion
References
What Is Data Mining Data mining is used to help users discover interesting and useful knowledge more
easily.
Data compression is one of good solutions to reduce data size.
Data pre-process transforms the original database into a new data representation.
It generates a new transaction database at the end of the data pre-process step.
What Is Data Mining The figure shows data mining as a step in an iterative knowledge discovery process.
Why Data Mining? Data is scattered over network. so it is difficult to find the actual data. Data mining
helps to find that data.
A business man wants to grow up his business. For that he needs smart data,
techniques ,models , tools etc.
Data mining helps how we get, use & understand that data. .
There is a need to extract useful information from the data and to interpret the data.
Application
Financial Data Analysis
Retail Industry
Telecommunication Industry
Biological Data Analysis
Other Scientific Applications
Intrusion Detection
Issues
Mining Methodology
User Interaction
Performance Issues
Diverse Data Types Issues
Compression technique? Make optimal use of limited storage space.
It reduces the size of the data and improves I/O performance.
Compression has also been recently applied for reading large scientific files in
parallel file systems.
Compression decrease bandwidth consumption on networks, and reduce energy
consumption in hardware.
Compression has been used extensively in wireless networks.
Types Of Compression Techniques Null Compression: Replaces a series of blank spaces with a compression code.
Run length Compression:- Expands on the null compression, by compressing a
series of four repeating characters.
Keyword Encoding:- Creates a table with values that represent common sets of
character.
Adaptive Huffman Coding:-Assign fewer bits to symbols that occur more
frequently and more bits to symbols appear less often.
Lempel Ziv Compession:-
Building an indexed dictionary
Compressing a string of symbols
Association Rule Mining It is a method for discovering interesting relations between variables in large
databases.
Intended to identify strong rules discovered in databases using different measures of
interestingness.
Many Algorithms had been proposed for finding the strong association between the
data sets.
In which Apriori was the most well known association rule algorithm which was
developed in 1994, having some major issues.
Limitations of Apriori Needs several iterations for the scanning of the data.
Difficulties to find rarely occuring events.
Works for small set of data.
Costly wasting of time to hold a vast number of candidate sets.
Sr No Reference Paper Methodology Used
Future Work
1 Integrating Compression and Execution in ColumnOrientedDatabase Systems by Daniel J. Abadi,Samuel R. Madden,Miguel & C.Ferreira.
Column-Oriented Database system architecture
NIL
2 Integrating Online Compression To Accelerate Large-Scale Data Analytics Application. By Tekin Bicer, Jian Yin,. David Chiu,Gagan Agrawal,& Karen Schuchardt
Chunk Resource Allocation , Parallel Compressioon Engine
NIL
3 Efficient Mining Frequent Itemsets Algorithms.By Marghny H. Mohamed, & Mohammed M. Darwieesh.
Count Table , Binary Count Table
Extend the algorithms to mine other kinds of patterns, such as sequential patteern mining problem,
4 A Transaction Mapping Algorithm For Frequent Itemsets Mining By Mingjun Song, & Sanguthevar Rajasekaran.
Transaction Mapping Algorithm
To Improve the implementation of the TM algorithm and make a fair comparison with FP-growth.
Sr No Reference Paper Methodology Used
Future Work
5. Compact Transaction Database For Efficient Ffrequent Pattern Mining By Qian Wan & Aijun An.
Compact Tree Structure Called CT-tree
NIL
6. A New Association Rules Mining Algorithm Based On Vector By xin Zhang, Pin Liao & Huiyong Wang.
Association rule mining algorithm based on vector.
NIL
Problem Statement They all lack the ability to decompress the data to their original state and improve
the data mining performance..
It is even a bigger challenge to maintain the compressed database in the future
It spends too much time to check candidate itemsets in the data mining step.
Unable to enter the data set at runtime
Original database
Sorted database
Sorted database Group1
Sorted database Group2
Sorted database Group3
Compressed dataset and generate merged
group
Compressed transaction dataset
Generate frequent item set by simple apriori
algorithms
Now generate association rules and uncompressed dataset
Proposed Work The main criteria of research are related to the followings:-
(a) The compressed database can be decompressed to the original form.
(b) Reduce the process time of association rule mining by using a quantification table.
(c) Reduce I/O time by using only the compressed database to do data mining.
(d) Allow incremental data mining.
Implementation Enviroment Minimum Hardware Requirement:
1. 3 GHZ Pentium PC Machine.
2. 512 Megabytes Main Memory
3. Screen Resolution needs to be between 800*600 & 1200*800.
Minimum Software Requirement:
1. Operating system microsoft windows XP.
2. Microsoft Visual Studio.net(C#).
Conclusion
Rapid Increase of large data become a point of concern.
i.e, time required for data pre-process.
Hence, the proposed algorithm can be benificial while dealing with such large data.
As, it can decompressed the data also after compression.
It can also reduce the I/O time by using only compressed database.
References
1. Xin Zhang, Pin Liao and Huiyong Wang ”A New Association Rules Mining
Algorithm Based On Vector”, 2009 Third International Conference on Genetic and
Evolutionary Computing
2. Qian Wan And Aijun An” Compact Transaction database For Efficient Frequent
Pattern Mining” Department of Computer Science and Engineering York
University, Toronto, Ontario, M3J 1P3, Canada
3. Jis-Yu Dai, Don-lin Yang, Jungpin Wu, And Ming-Chuan Hung-” An Efficient
Data Mining Approach on Compressed Transactions.” International Journal of
Electrical and Computer Engineering 3:2 2008
References4. Wael Ahmad AlZoubi, Khairuddin Omar, Azuraliza Abu Bakar” An Efficient
Mining of Trasactional Data Using Graph-Based Technique” 2011 3rd Conference
on Data Mining and Optimization (DMO) 28-29 June 2011, Selangor, Malaysia
5. Mingjun Song And Sanguthevar Rajasekaran, “A Transaction Mapping Algorithm
For Frequent Itemsets Mining” IEEE TRANSACTIONS ON KNOWLEDGE AND
DATA ENGINEERING, October 2005.
6. Marghny H. Mohamed, Mohammed M. Darwieesh,”Efficient Mining Frequent
Itemsets Algorithm”. Revised: 7 March 2012/Accepted 29 April 2013 Springer-
Verlag Berlin Heidelberg 2013.
References7. Fan Zhang, Yan Zhang Jason Bakos,” GP Apriori: GPU-Accelerated Frequent
Itemset Mining”. 2011 IEEE International Conference On Cluster Computing
8. Tekin Bicer, Jian Yin, David Chiu, Gagan Agrawal And Karen Schuchardt“
Integrating Online Compression To Accelerate large-Scale Data Analytics
Application”. 2013 IEEE 27th International Sympoosium on parallel & distributed
processing.
9. Daniel J. Abadi, Samuel R. Madden, Miguel C. Ferreira”Integrating
Compression And Execution In Column-Oriented Database Systems”, SIGMOD
2006, June 27–29, 2006, Chicago, llinois, USA.Copyright 2006 ACM
1595932569/06/0006.
References10. Shalini Dutt, Naveen Choudhary & Dharm Singh, “ An Improved Apriori
Algorithm Based On Matrix Data Structure”, Global Journal Of Computer
Science And Technology : C Software & Data Engineering, Vol. 14 Issues
5/Version 1.0 Year 2014.
11. Wael A.ALZoubi, Azuraliza Abu Bakar, Khairuddin Omar, “Scalable And
Efficient Method For Mining Association Rules, ”2009 International Conference
On Electrical Engineering And Infrmatics 5-7 August 2009, Selangor Malaysia.
12. Loan T.T.Nguyen, Bay Vo, Tzung-Pei Hong,Hoang Chi Thanh,“CAR-Miner: An
Efficient Algorithm For Mining Class-Association Rules,”Expert system With
Applications 40(2013) 2305-2311, 2012@Elsevier Ltd. All Rights.
References10. Mohammed Al-Maolegi, Bassam Arkok, “An Improved Apriori Algorithm For
Association Rules ,” International Journal On Natural Language
Computing(IJNLC) Vol. 3, N.1, Feburary 2014.
ANY QUERY?