fundamentals of data mining and its applications

International Journal of Conceptions on Computing & Information Technology Vol. 1, Issue. 1, November 2013; ISSN: 2345 - 9808

5 | 7 1

Fundamentals of data mining and its applications Sourav Sarangi and Subrat Swain

Dept. of Biotechnology, MITS Engineering College,

Rayagada, Odisha [email protected] and [email protected]

Abstract— This paper focuses on Data mining and its uses in our life or environment. Simply we can say Data mining is the essential process where intelligent methods are applied to extract data. It is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems. The overall goal of the data mining process is to extract knowledge from a data set in a human-understandable structure and besides the raw analysis step involves database. The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously unknown interesting patterns such as groups of data records, unusual records and dependencies. The data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Data mining have many application or uses. Now these days it is used in the field of Business, Science, Visual, Music, Telecommunication and many more. So here we are going to discuss about Data Mining and its gift or application for human being.

Keywords- Process, Software, Privacy Concerns & Ethics, Applications

I. INTRODUCTION Data mining techniques are the result of a long process of

research and product development. Data mining takes evolutionary process beyond retrospective data access and navigation to prospective and proactive information delivery. Data mining, the extraction of hidden predictive information from large databases. Data mining derives its name from the similarities between searching for valuable business information in a large database. It is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems.

Data mining is a promising and relatively new technology that is defined as a process of discovering hidden valuable and useful knowledge or information by analyzing large amounts of data storing in databases or data warehouse using different techniques such as machine learning, artificial intelligence(AI) and statistical. Data mining is an iterative process that typically involves the following phases:

1) Problem Definition

2) Data Exploration

3) Data Preparation

4) Modeling

5) Evaluation

6) Deployment

A. Problem Definition

A data mining project starts with the understanding of the business problem. Data mining experts, business experts, and domain experts work closely together to define the project objectives and the requirements from a business perspective. The project objective is then translated into a data mining problem definition.

II. PROCESS

Fig.2 Data mining Process

A. Data Exploration

Domain experts understand the meaning of the metadata. They collect, describe, and explore the data. Data exploration is a common process in data warehouses which are characterized by large bulks of data coming from disparate systems. Data exploration helps a data consumer focus an information search on the pertinent aspect of relevant data before true analysis can be achieved.


6 | 7 1

B. Data Prepararion

The data preparation normally consumes about 90% of the time. The outcome of the data preparation phase is the final data set. Once data sources available are identified, they need to be selected, cleaned, constructed and formatted into the desired form. Data preparation means manipulation of data into a form suitable for further analysis and processing. It is a process that involves many different tasks and which cannot be fully automated. Many of the data preparation activities are routine, tedious, and time consuming. It has been estimated that data preparation accounts for 60%-80% of the time spent on a data mining project.

C. Modeling

Modeling techniques have to be selected to be used for the prepared dataset. In the modeling phase, a frequent exchange with the domain experts from the data preparation phase is required. The modeling phase and the evaluation phase are coupled. They can be repeated several times to change parameters until optimal values are achieved.

D. Evaluation

The Modeling and Evaluation process are inter related. Data mining experts evaluate the model. If the model does not satisfy their expectations, they go back to the modeling phase and rebuild the model by changing its parameters until optimal values are achieved. In this phase, new business requirements may be raised due to new patterns has been discovered in the model results or from other factors.

E. Deployment

The knowledge or information that gain through data mining process needs to be presented in such a way that stakeholders can use it when they want it. The deployment term says that it is the application of a model for prediction or classification to new data. After a satisfactory model or set of models has been identified for a particular application, we usually want to deploy those models so that predictions or predicted classifications can quickly be obtained for new data. For example, a credit card company may want to deploy a trained model or set of models to quickly identify transactions which have a high probability of being fraudulent.

III. SOFTWARE Data mining software is a fairly new phrase that refers for the procedure by which predictive styles are taken out from info. Data mining software describes a set of tools used for the purpose of analyzing vast amounts of data in order to discover and understand specific patterns. Data mining software originated in the scientific community where it was used to discern patterns from data related to scientific studies. Data mining software quickly found a strong foothold in the business community as large businesses began to amass vast amounts of data. Now here some Data mining softwares which are recently invented or before:

1) Carrot2 – Text and search results clustering framework.

2) Chemicalize.org – A chemical structure miner and web search engine.

3) GATE – Natural language processing and language engineering tool.

4) KNIME – The Konstanz Information Miner, a user friendly and comprehensive data analytics framework.

5) Orange – A component-based data mining and machine learning software suite written in the Python language.

6) UIMA – The UIMAMENT(UNSTRUCTURED INFORMATION MANAGE) is a component framework for analyzing unstructured content such as text, audio and video, originally developed by IBM.

7) JHep Work– Java cross-platform data analysis framework developed at ANL.

Weka– A suite of machine learning software written in the Java language.

IV. PRIVACY CONCERNS & ETHICS Privacy is a loaded issue. In data mining, the privacy and

legal issues that may ensue are key to the conflict. In recent years privacy concerns have taken on a more significant role in American society as merchants, insurance companies, and government agencies amass warehouses containing personal data. Some people believe that data mining itself is ethically neutral. In many cases, the results of data mining applications such as association rule or classification rule mining can compromise the privacy of the data. It is important to note that the term data mining has no ethical implications. The problem of privacy-preserving data mining has become more important in recent years because of the increasing ability to store personal data about users, and the increasing sophistication of data mining algorithms to leverage this information. A number of techniques such as randomization and k-anonymity have been suggested in recent years in order to perform privacy-preserving data mining.

A. Randominazation

The randomization technique uses data distortion methods in order to create private representations of the records. In most cases, the individual records cannot be recovered, but only aggregate distributions can be recovered. These aggregate distributions can be used for data mining purposes. Two kinds of perturbation are possible with the randomization

Method: Additive Perturbation, Multiplicative Perturbation.

B. The K-anonymity method

An important method for privacy de-identification is the method of k-anonymity. The motivating factor behind the k-anonymity technique is that many attributes in the data can often be considered pseudo-identifiers which can be used in conjunction with public records in orderto uniquely identify the records. For example, if the identifications from the records are removed, attributes such as the birth date and zip-code be used in order to uniquely identify the identities of the underlying


7 | 7 1

records. The idea in k-anonymity is to reduce the granularity of representation of the data in such a way that a given record cannot be distinguished from at least (k − 1) other records. In chapter 5, the k-anonymity method is discussed in detail. A number of important algorithms for k-anonymity are discussed in the same chapter.

V. APPLICATIONS Now Data Mining helps in the field of Industry, Visual,

Music, Science and engineering and many more. Here we are going to discuss about the most efficient uses or application of Data mining in the described fields.

Fig.3 Purpose Of Data Mining

A. Industry

Data mining has been used extensively in the banking and financial markets. In the banking industry, data mining is heavily used to model and predict credit fraud, to evaluate risk,to perform trend analysis, and to analyze profitability, as well as to help with direct marketing campaigns. In the financial markets, neural networks have been used in stock-price forecasting,

In option trading, in bond rating, in portfolio management, in commodity price prediction, in mergers and acquisitions, as well as in forecasting financial disasters.

Slim margins have pushed retailers into embracing Data Mining earlier then other industry. Retailers have seen improved decision-support processes lead directly to improved efficiency in inventory management and financial forecasting. Large retail chains and grocery stores store vast amounts of point-of-sale data that is information rich. One application of data mining in real estate is the AREAS Property Valuation product from HNC (Higher National Certificate) Software, which performs property valuation. Data mining has been used extensively in the medical industry already. For example, NeuroMedical Systems used neural networks to perform a pap smear diagnostic aid. Vysis uses neural networks to perform protein analysis for drug development.

Fig.4 Customers Details In Bank

B. Science And Engineering

In recent years, data mining has been used widely in the areas of science and engineering, such as bioinformatics, genetics, medicine, education and electrical power engineering. In the study of human genetics, sequence mining helps address the important goal of understanding the mapping relationship between the inter-individual variation in human DNA sequences and variability in disease susceptibility. In the area of electrical power engineering, data mining methods have been widely used for condition monitoring of high voltage electrical equipment. The purpose of condition monitoring is to obtain valuable information on the insulation's health status of the equipment. Data clustering such as self-organizing map (SOM) has been applied on the vibration monitoring and analysis of transformer on-load tap-changers (OLTCS). Using vibration monitoring, it can be observed that each tap change operation generates a signal that contains information about the condition of the tap changer contacts and the drive mechanisms. Data mining in science/engineering is within educational research, where data mining has been used to study the factors leading students to choose to engage in behaviors which reduce their learning and to understand the factors influencing university student retention. A similar example of the social application of data mining is its use in expertise finding systems, whereby descriptors of human expertise are extracted, normalized and classified so as to facilitate the finding of experts, particularly in scientific and technical fields. In this way, data mining can facilitate Institutional memory.

C. Visual

There is a large number of visualization techniques which can be used for visualizing the data. In addition to standard 2D/3D-techniques such as x-y (x-y-z) plots, bar charts, line graphs, etc., there are a number of more sophisticated visualization techniques. These techniques are useful for data exploration but are limited to relatively small and low-


8 | 7 1

dimensional data sets. In the last decade, a large number of novel information visualization techniques have been developed, allowing visualizations of multidimensional data sets without inherent two- or three-dimensional semantics.

D. Music

Data mining techniques and in particular co-occurrence analysis has been used be used to discover relevant similarities among music corpora (radio list, CD databases) for the purpose of classifying music into genres in an objective manner.

VI. CONCLUSION In this paper we have given an overview of Data Mining

and its application. We are not describing the all aspects of Data mining, only the process uses and its some application in

different fields or areas. Data mining is a tool that is used by governments and corporations to predict and establish trends with specific purposes in mind. Corporations use data mining to examine buying patterns and predict future trends.

REFERENCES [1] Data Mining Concept And Techniques By J Han & Michelin Kamber [2] Intoducing to Data Mining By M Steinbac & V Kumar [3] Discovering Data Mining From Concept to Implementation By P

Hadjnian, J Verhees & R Sadler [4] Wikipedia The Free Encyclopedia- www.wikipedia.com [5] Britannica The Encyclopedia- www.britanica.com [6] Data Mining And Privacy- www.thearling.com [7] Data Mining Software- www.emanio.com

fundamentals of data mining and its applications

Documents