data mining and data warehousing (makaut)

Data Mining & Data Ware Housing(PGCSE302 C)

The New unified Syllabus for both CSE & IT followed from the session 2013-14

by

Maulana Abul Kalam Azad University of Technology, West Bengal(formerly West Bengal University of Technology)

Dr. Bikramjit SarkarAssistant Professor

Dept. of Computer Science and Engineering

Dr. B. C. Roy Engineering CollegeJemua Road, Fuljhore, Durgapur – 713206 (W. B.)

[www.bcrec.ac.in]

Presented by

Prescribed Curriculum (MAKAUT)Data Mining & Data Ware Housing (PGCS302C): 36L

UNIT-I: 4 LIntroduction: Basics of Data Mining. Data Mining Functionalities, Classification of Data Mining Systems, Data Mining Issues, Data Mining Goals. Stages of the Data Mining Process. UNIT-II: 5 LData Warehouse and OLAP: Data Warehouse concepts, Data Warehouse Architecture, OLAP technology, DBMS, OLTP VS. Data Warehouse Environment, Multidimensional data model Data marts. UNIT-III: 6 LData Mining Techniques: Statistics, Similarity Measures, Decision Trees, Neural Networks, Genetic Algorithms. UNIT-IV: 9 LMining Association Rules: Basic Algorithms, Parallel and Distributed algorithms, Comparative study, Incremental Rules, Advanced Association Rule Technique, Apriori Algorithm, Partition Algorithm, Dynamic Item set Counting Algorithm, FP tree growth Algorithm, Boarder Algorithm.

Prescribed Curriculum (MAKAUT) – contd..Data Mining & Data Ware Housing (PGCS302C): 36L

UNIT-V: 5 LClustering Techniques: Partitioning Algorithms-K- means Algorithm, CLARA, CLARANS, Hierarchical algorithms- DBSCAN, ROCK. UNIT-VI: 4 LClassification Techniques: Statistical–based, Distance-based, Decision Tree- based Decision tree. UNIT-VII: 3 LApplications and Trends in Data Mining: Applications, Advanced Techniques - Web Mining, Web Content Mining, Structure Mining.

- - -

UNIT-I

Data, Information, Knowledge, Understanding, Wisdom

Data

Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. This includes:(A)Operational or transactional data such as, sales, cost,

inventory, payroll, and accounting…(B)Non-operational data, such as industry sales, forecast data,

and macro-economic data…(C)Meta data - data about the data itself, such as logical

database design or data dictionary definitions…

Data, Information, Knowledge, Understanding, Wisdom– contd..

Information

The patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling and when.


Knowledge

Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts.


Understanding

Understanding is an interpolative and probabilistic process. It is cognitive and analytical. It is the process by which I can take knowledge and synthesize new knowledge from the previously held knowledge. The difference between understanding and knowledge is the difference between "learning" and "memorizing". People who have understanding can undertake useful actions because they can synthesize new knowledge, or in some cases, at least new information, from what is previously known (and understood).

Understanding – contd..

That is, understanding can build upon currently held information, knowledge and understanding itself. In computer parlance, AI systems possess understanding in the sense that they are able to synthesize new knowledge from previously stored information and knowledge.



Wisdom

Wisdom is an extrapolative and non-deterministic, non-probabilistic process. It calls upon all the previous levels of consciousness, and specifically upon special types of human programming (moral, ethical codes, etc.). It beckons to give us understanding about which there has previously been no understanding, and in doing so, goes far beyond understanding itself. It is the essence of philosophical probing. Unlike the previous four levels, it asks questions to which there is no (easily-achievable) answer, and in some cases, to which there can be no humanly-known answer period. Wisdom is therefore, the process by which we also discern, or judge, between right and wrong, good and bad.


Wisdom – contd..

Computers do not have, and will never have the ability to possess wisdom. Wisdom is a uniquely human state, or as I see it, wisdom requires one to have a soul, for it resides as much in the heart as in the mind. And a soul is something machines will never possess (or perhaps I should reword that to say, a soul is something that, in general, will never possess a machine).


The following diagram represents the transitions from data, to information, to knowledge, and finally to wisdom:

It is understanding that support the transition from each stage to the next. Understanding is not a separate level of its own.

Concepts of Data Mining

Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both.

Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

However, continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy of analysis while driving down the cost.

Concepts of Data Mining – contd..

Data Mining is a technology that uses data analysis tools with sophisticated algorithms to search useful information from large volumes of data.

Data mining is also defined as a process of automatically discovering useful information from massive amount of data repositories.

Concepts of Data Mining – contd..

Data mining is the practice of automatically searching large stores of data to discover patterns and trends that go beyond simple analysis. Data mining uses sophisticated mathematical algorithms to segment the data and evaluate the probability of future events. Data mining is also known as Knowledge Discovery in Data (KDD).

Data mining can answer questions that cannot be addressed through simple query and reporting techniques.

The key properties of data mining

• Automatic discovery of patterns• Prediction of likely outcomes• Creation of actionable information• Focus on large data sets and databases

The key properties of data mining – contd..

Automatic Discovery

Data mining is accomplished by building models. A model uses an algorithm to act on a set of data. The notion of automatic discovery refers to the execution of data mining models.

Data mining models can be used to mine the data on which they are built, but most types of models are generalizable to new data. The process of applying a model to new data is known as scoring.

Prediction

Many forms of data mining are predictive. For example, a model might predict income based on education and other demographic factors. Predictions have an associated probability (How likely is this prediction to be true?). Prediction probabilities are also known as confidence.

Some forms of predictive data mining generate rules, which are conditions that imply a given outcome. For example, a rule might specify that a person who has a bachelor's degree and lives in a certain neighborhood is likely to have an income greater than the regional average. Rules have an associated support.


Actionable Information

Data mining can derive actionable information from large volumes of data. For example, a town planner might use a model that predicts income based on demographics to develop a plan for low-income housing. A car leasing agency might a use model that identifies customer segments to design a promotion targeting high-value customers.



Grouping

Other forms of data mining identify natural groupings in the data. For example, a model might identify the segment of the population that has an income within a specified range, that has a good driving record, and that leases a new car on a yearly basis.

Data Mining and Knowledge Discovery

Data mining is an integral part of Knowledge Discovery in databases (KDD), which is an overall process of converting raw data into useful information, as shown in figure below. This process consists of a series of transformation steps, from pre-processing to post-processing of data mining results.

Knowledge Discovery in Databases

The following diagram represents the process of Knowledge Discovery in databases:

Knowledge Discovery in Databases – contd..

The Knowledge Discovery in Databases process comprises of a few steps leading from raw data collections to some form of new knowledge. The iterative process consists of the following steps:

• Data cleaning: also known as data cleansing, it is a phase in which noise data and irrelevant data are removed from the collection.

• Data integration: at this stage, multiple data sources, often heterogeneous, may be combined in a common source.

• Data selection: At this step, the data relevant to the analysis is decided on and retrieved from the data collection.

Knowledge Discovery in Databases – contd..

• Data transformation: also known as data consolidation, it is a phase in which the selected data is transformed into forms appropriate for the mining procedure.

• Data mining: it is the crucial step in which clever techniques are applied to extract patterns potentially useful.

• Pattern evaluation: in this step, strictly interesting patterns representing knowledge are identified based on given measures.

• Knowledge representation: is the final phase in which the discovered knowledge is visually represented to the user. This essential step uses visualization techniques to help users understand and interpret the data mining results.

Steps in Knowledge Discovery in Databases

Below are the steps leading from raw data collections to some form of new knowledge. The iterative process consists of the following steps:

• Data cleaning: also known as data cleansing, it is a phase in which noise data and irrelevant data are removed from the collection.

• Data integration: at this stage, multiple data sources, often heterogeneous, may be combined in a common source.

• Data selection: at this step, the data relevant to the analysis is decided on and retrieved from the data collection.

Steps in Knowledge Discovery in Databases – contd..

• Data transformation: also known as data consolidation, it is a phase in which the selected data is transformed into forms appropriate for the mining procedure.

• Data mining: it is the crucial step in which clever techniques are applied to extract patterns potentially useful.

• Pattern evaluation: in this step, strictly interesting patterns representing knowledge are identified based on given measures.

• Knowledge representation: is the final phase in which the discovered knowledge is visually represented to the user. This essential step uses visualization techniques to help users understand and interpret the data mining results.

Steps in Knowledge Discovery in Databases – contd..

It is common to combine some of these steps together. For instance, data cleaning and data integration can be performed together as a pre-processing phase to generate a data warehouse. Data selection and data transformation can also be combined where the consolidation of the data is the result of the selection, or, as for the case of data warehouses, the selection is done on transformed data.

The KDD is an iterative process. Once the discovered knowledge is presented to the user, the evaluation measures can be enhanced, the mining can be further refined, new data can be selected or further transformed, or new data sources can be integrated, in order to get different, more appropriate results.

Motivating Challenges

Below are the motivation challenges that motivated Data mining:

• Scalability• High Dimensionality• Heterogeneous and complex data• Data ownership and distribution• Non-traditional Analysis

Motivating Challenges – contd..

Scalability

Scaling and performance are often considered together in Data Mining. The problem of scalability in Data Mining is not only how to process such large sets of data, but how to do it within a useful timeframe. Many of the issues of scalability in Data Mining and DBMS are similar to scaling performance issues for Data Management in general.


High Dimensionality

The variable in 1-D data is usually time. An example is the log of interrupts in a processor. 2D data can often be found in statistics like the number of financial transactions in a certain period of time. 3-D data can be positions in 3-D space or points on a surface whereas time (the 3rd dimension) varies. High-dimensional data contains all those sets of data that have more than three considered variables. Examples are locations in space that vary with time (here time is the fourth dimension) or any other combination of more than three variables, e.g. product - channel - territory - period - customer’s income.


Heterogeneous and complex data

Heterogeneous data means data set contains attributes of different types. Traditional data analysis methods contain data sets with same types of attributes. Complex data is a data with different attribute and information. For example webpage with hyperlinks, DNA and 3D structure, climate data (temperature, pressure, mist, humidity, time, location).


Data ownership and distribution

Sometimes the data needed for an analysis is not stored in one location or owned by one organization. Instead the data is distributed in geographically among multiple entities. This requires the development of distributed data mining techniques.


Non-traditional analysis

It is based on hypothesis and test paradigm. Hypothesis is proposed one, it is an experiment designed to gather data. Currently huge data is present in data repositories so it requires thousands of hypotheses.

Data Mining Functionalities

Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. In general, data mining tasks can be classified into two categories:

• Description Methods: Here the objective is to derive patterns that summarize the underlying relationships in data. They find human-interpretable patterns that describe the data.

• Predictive tasks: The objective of these tasks is to predict the value of a particular attribute based on the values of other attribute. They use some variables (independent / explanatory variable) to predict unknown or future values of other variables (dependent / target variable).

Data Mining Functionalities – contd..

There are four core tasks in Data Mining:

• Predictive modelling• Association analysis• Clustering analysis• Anomaly detection


Predictive modelling

Find some missing or unavailable data values rather than class labels referred to as prediction. Although prediction may refer to both data value prediction and class label prediction, it is usually confined to data value prediction and thus is distinct from classification. Prediction also encompasses the identification of distribution trends based on the available data.


Association analysis

It is the discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data. For example, a data mining system may find association rules like

major(X, “computing science”) ? owns(X, “personal computer”)[support = 12%, confidence = 98%]

where X is a variable representing a student. The rule indicates that of the students under study, 12% (support) major in computing science and own a personal computer. There is a 98% probability (confidence, or certainty) that a student in this group owns a personal computer.


Clustering analysis

Clustering analyzes data objects without consulting a known class label. The objects are clustered or grouped based on the principle of maximizing the intra-class similarity and minimizing the interclass similarity. Each cluster that is formed can be viewed as a class of objects.


Anomaly detection

It is the task of identifying observations whose characteristics are significantly different from the rest of the data. Such observations are called anomalies or outliers. This is useful in fraud detection and network intrusions.

Classification of Data Mining systems

A data mining system can be classified according to the following criteria:

• Database Technology• Statistics• Machine Learning• Information Science• Visualization• Other Disciplines

Classification of Data Mining systems – contd..

Apart from the previous criteria, a data mining system can also be classified based on the kind of

• Databases mined• Knowledge mined• Techniques utilized• Applications adapted

Classification Based on the Databases Mined

Database system can be classified according to different criteria such as data models, types of data, etc. And the data mining system can be classified accordingly.

For example, if we classify a database according to the data model, then we may have a relational, transactional, object-relational, or data warehouse mining system.



Classification Based on the kind of Knowledge Mined

It means the data mining system is classified on the basis of functionalities such as

• Characterization• Discrimination• Association and Correlation Analysis• Classification• Prediction• Prediction• Outlier Analysis• Evolution Analysis


Classification Based on the Techniques Utilized

We can classify a data mining system according to the kind of techniques used. We can describe these techniques according to the degree of user interaction involved or the methods of analysis employed.


Classification Based on the Applications Adapted

We can classify a data mining system according to the applications adapted. These applications are as follows:

• Finance• Telecommunications• DNA• Stock Markets• E-mail

Integration of Data Mining systems

If a data mining system is not integrated with a database or a data warehouse system, then there will be no system to communicate with. This scheme is known as the non-coupling scheme. In this scheme, the main focus is on data mining design and on developing efficient and effective algorithms for mining the available data sets. Following are the Integration Schemes:

• No Coupling• Loose Coupling• Semi−tight Coupling• Tight coupling

Integration of Data Mining systems – contd..

No Coupling

In this scheme, the data mining system does not utilize any of the database or data warehouse functions. It fetches the data from a particular source and processes that data using some data mining algorithms. The data mining result is stored in another file.


Loose Coupling

In this scheme, the data mining system may use some of the functions of database and data warehouse system. It fetches the data from the data respiratory managed by these systems and performs data mining on that data. It then stores the mining result either in a file or in a designated place in a database or in a data warehouse.


Semi-tight Coupling

In this scheme, the data mining system is linked with a database or a data warehouse system and in addition to that, efficient implementations of a few data mining primitives can be provided in the database.


Tight coupling

In this coupling scheme, the data mining system is smoothly integrated into the database or data warehouse system. The data mining subsystem is treated as one functional component of an information system.

Data Mining issues

Data mining is not an easy task, as the algorithms used can get very complex and data is not always available at one place. It needs to be integrated from various heterogeneous data sources. These factors also create some issues. The major issues are regarding

• Mining Methodology and User Interaction• Performance Issues• Diverse Data Types Issues

Data Mining issues – contd..

Mining Methodology and User Interaction Issues

It refers to the following kinds of issues:

• Mining different kinds of knowledge in databases: Different users may be interested in different kinds of knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge discovery task.

• Interactive mining of knowledge at multiple levels of abstraction: The data mining process needs to be interactive because it allows users to focus the search for patterns, providing and refining data mining requests based on the returned results.


Mining Methodology and User Interaction Issues – contd..

• Incorporation of background knowledge: To guide discovery process and to express the discovered patterns, the background knowledge can be used. Background knowledge may be used to express the discovered patterns not only in concise terms but at multiple levels of abstraction.

• Data mining query languages and ad hoc data mining: Data Mining Query language that allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse query language and optimized for efficient and flexible data mining.


Mining Methodology and User Interaction Issues – contd..

• Presentation and visualization of data mining results: Once the patterns are discovered it needs to be expressed in high level languages, and visual representations. These representations should be easily understandable.

• Handling noisy or incomplete data: The data cleaning methods are required to handle the noise and incomplete objects while mining the data regularities. If the data cleaning methods are not there then the accuracy of the discovered patterns will be poor.

• Pattern evaluation: The patterns discovered should be interesting because either they represent common knowledge or lack novelty.


Performance Issues

There can be performance-related issues such as follows:

• Efficiency and scalability of data mining algorithms: In order to effectively extract the information from huge amount of data in databases, data mining algorithm must be efficient and scalable.


Performance Issues – contd..

• Parallel, distributed and incremental mining algorithms: The factors such as huge size of databases, wide distribution of data, and complexity of data mining methods motivate the development of parallel and distributed data mining algorithms. These algorithms divide the data into partitions which is further processed in a parallel fashion. Then the results from the partitions is merged. The incremental algorithms, update databases without mining the data again from scratch.


Diverse Data Types Issues

Diverse Data Types Issues may be as follows:

• Handling of relational and complex types of data: The database may contain complex data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system to mine all these kind of data.

• Mining information from heterogeneous databases and global information systems: The data is available at different data sources on LAN or WAN. These data source may be structured, semi structured or unstructured. Therefore mining the knowledge from them adds challenges to data mining.

Data Mining goals

Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related - also known as "big data") in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has the most direct business applications.

Stages of the Data Mining Process

The process of data mining consists of three stages:

• The initial exploration• Model building or pattern identification with validation /

verification• Deployment (i.e., the application of the model to new data

in order to generate predictions).

Stages of the Data Mining Process – contd..

Initial Exploration

This stage usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records and - in case of data sets with large numbers of variables ("fields") - performing some preliminary feature selection operations to bring the number of variables to a manageable range (depending on the statistical methods which are being considered).


Initial Exploration – contd..

Then, depending on the nature of the analytic problem, this first stage of the process of data mining may involve anywhere between a simple choice of straightforward predictors for a regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical methods in order to identify the most relevant variables and determine the complexity and/or the general nature of models that can be taken into account in the next stage.


Model building and validation

This stage involves considering various models and choosing the best one based on their predictive performance (i.e., explaining the variability in question and producing stable results across samples). This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. There are a variety of techniques developed to achieve that goal - many of which are based on so-called "competitive evaluation of models," that is, applying different models to the same data set and then comparing their performance to choose the best. These techniques - which are often considered the core of predictive data mining - include: Bagging (Voting, Averaging), Boosting, Stacking (Stacked Generalizations), and Meta-Learning.


Deployment

This final stage involves using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome.

UNIT-II

Concepts of Data Warehousing

A data warehouse is constructed by integrating data from multiple heterogeneous sources that support analytical reporting, structured and / or ad hoc queries, and decision making. Data warehousing involves data cleaning, data integration, and data consolidations.

Data warehousing is the process of constructing and using a data warehouse. Data warehousing is defined as a process of centralized data management and retrieval.

Data Warehouse Features

The key features of a data warehouse are discussed below:

• Subject Oriented: A data warehouse is subject oriented because it provides information around a subject rather than the organization's ongoing operations. These subjects can be product, customers, suppliers, sales, revenue, etc. A data warehouse does not focus on the ongoing operations, rather it focuses on modelling and analysis of data for decision making.

• Integrated: A data warehouse is constructed by integrating data from heterogeneous sources such as relational databases, flat files, etc. This integration enhances the effective analysis of data.

Data Warehouse Features – contd..

• Time Variant: The data collected in a data warehouse is identified with a particular time period. The data in a data warehouse provides information from the historical point of view.

• Non-volatile: Non-volatile means the previous data is not erased when new data is added to it. A data warehouse is kept separate from the operational database and therefore frequent changes in operational database is not reflected in the data warehouse.

Note: A data warehouse does not require transaction processing, recovery, and concurrency controls, because it is physically stored and separate from the operational database.

Data Warehouse Applications

As discussed before, a data warehouse helps business executives to organize, analyze, and use their data for decision making. A data warehouse serves as a sole part of a plan-execute-assess "closed-loop" feedback system for the enterprise management. Data warehouses are widely used in the following fields:

• Financial services• Banking services• Consumer goods• Retail sectors• Controlled manufacturing

Types of Data Warehouses

Information processing, analytical processing, and data mining are the three types of data warehouse applications that are discussed below:

• Information Processing: A data warehouse allows to process the data stored in it. The data can be processed by means of querying, basic statistical analysis, reporting using crosstabs, tables, charts, or graphs.

• Analytical Processing: A data warehouse supports analytical processing of the information stored in it. The data can be analyzed by means of basic OLAP operations, including slice-and-dice, drill down, drill up, and pivoting.

Types of Data Warehouses – contd..

• Data Mining: Data mining supports knowledge discovery by finding hidden patterns and associations, constructing analytical models, performing classification and prediction. These mining results can be presented using visualization tools.

Data Warehouse Architecture

Data warehouses normally adopt three-tier architecture:

• The bottom tiers is a warehouse database server that is almost always a relational database system. Data from operational databases and from external sources are extracted using application program interfaces known as gateways. A gateway is supported by the underlying DBMS and allows client programs to execute code.

• The middle tier is an OLAP server that is typically implemented using a relational OLAP (ROLAP) model.

• The top tier is a client, which contains query and reporting tools, analysis tools and / or data mining tools. From the architecture point of view there are three data warehouse models: the enterprise warehouse, the data mart, and the virtual warehouse.

Data Warehouse models

From the architecture point of view there are three data warehouse models:

• Enterprise Warehouse: An enterprise warehouse collects all details comprising of all information about subjects spanning the entire organization. It provides corporate wide data integration, usually from one or more operational systems and from external information providers. It takes extensive business modelling and it takes many years to design and build.

Data Warehouse models – contd..

• Data Mart: A data mart consists of a subset of corporate wide data that is of value to specific group of users. The scope is confined to specific selected subjects. The data contained in a data mart tend to be summarized.

• Virtual Warehouse: A virtual warehouse is a set of views over operational databases. For efficient query processing, only some of the possible summary views may be materialized. A virtual warehouse is easy to build and it requires excess capacity on the operational database servers.

OLAP technology

OLAP (online analytical processing) is computer processing that enables a user to easily and selectively extract and view data from different points of view.

For example, a user can request that data be analyzed to display a spreadsheet showing all of a company's beach ball products sold in Florida in the month of July, compare revenue figures with those for the same products in September, and then see a comparison of other product sales in Florida in the same time period.

OLAP technology – contd..

OLAP data is stored in a multidimensional database. Whereas a relational database can be thought of as two-dimensional, a multidimensional database considers each data attribute (such as product, geographic sales region, and time period) as a separate "dimension."

OLAP software can locate the intersection of dimensions (all products sold in the Eastern region above a certain price during a certain time period) and display them. Attributes such as time periods can be broken down into sub-attributes.

OLAP technology – contd..

OLAP can be used for data mining or the discovery of previously undiscerned relationships between data items. An OLAP database does not need to be as large as a data warehouse, since not all transactional data is needed for trend analysis. Using Open Database Connectivity (ODBC), data can be imported from existing relational databases to create a multidimensional database for OLAP.

Data Warehouse vs. Operational Databases

A data warehouses is kept separate from operational databases due to the following reasons:

• An operational database is constructed for well-known tasks and workloads such as searching particular records, indexing, etc. In contrast, data warehouse queries are often complex and they present a general form of data.

• Operational databases support concurrent processing of multiple transactions. Concurrency control and recovery mechanisms are required for operational databases to ensure robustness and consistency of the database.

Data Warehouse vs. Operational Databases – contd..

• An operational database query allows to read and modify operations, while an OLAP query needs only read only access of stored data.

• An operational database maintains current data. On the other hand, a data warehouse maintains historical data.


Data Warehouse (OLAP) Operational Database(OLTP)It involves historical processing of information.

It involves day-to-day processing.

OLAP systems are used by knowledge workers such as executives, managers, and analysts.

OLTP systems are used by clerks, DBAs, or database professionals.

It is used to analyze the business. It is used to run the business.It focuses on Information out. It focuses on Data in.It is based on Star Schema, Snowflake Schema, and Fact Constellation Schema.

It is based on Entity Relationship Model.

It focuses on Information out. It is application oriented.


Data Warehouse (OLAP) Operational Database(OLTP)It contains historical data. It contains current data.It provides summarized and consolidated data. It is highly flexible.

It provides primitive and highly detailed data. It provides high performance.

It provides summarized and multidimensional view of data.

It provides detailed and flat relational view of data.

The number of users is in hundreds. The number of users is in thousands.

The number of records accessed is in millions.

The number of records accessed is in tens.

The database size is from 100GB to 100 TB.

The database size is from 100 MB to 100 GB.

UNIT-III

Data Mining techniques

Following is an overview of some of the most common data mining techniques in use today. The techniques have been divided into two broad categories:

• Classical Techniques: Statistics, Neighbourhoods and Clustering

• Next Generation Techniques: Trees, Networks and Rules

These categories will describe a number of data mining algorithms at a high level and shall help to understand how each algorithm fits into the landscape of data mining techniques. Overall, six broad classes of data mining algorithms are covered.

Data Mining techniques – contd..

Classical Techniques

This category contains descriptions of techniques that have classically been used for decades and the next category represents techniques that have only been widely used since the early 1980s. The main techniques here are the ones that are used 99.9% of the time on existing business problems. There are certainly many other ones as well as proprietary techniques from particular vendors - but in general the industry is converging to those techniques that work consistently and are understandable and explainable.


Classical Techniques – contd..

Statistics

By strict definition statistics or statistical techniques are not data mining. They were being used long before, the term data mining was coined to apply to business applications. However, statistical techniques are driven by the data and are used to discover patterns and build predictive models. This is why it is important to have the idea of how statistical techniques work and how they can be applied.



Statistics – contd..

Prediction using Statistics

The term “prediction” is used for a variety of types of analysis that may elsewhere be more precisely called regression. Regression is further explained in order to simplify some of the concepts and to emphasize the common and most important aspects of predictive modelling. Nonetheless regression is a powerful and commonly used tool in statistics.




Linear Regression

In statistics prediction is usually synonymous with regression of some form. There are a variety of different types of regression in statistics but the basic idea is that a model is created that maps values from predictors in such a way that the lowest error occurs in making a prediction. The simplest form of regression is simple linear regression that just contains one predictor and a prediction.




Linear Regression – contd..

The relationship between the two can be mapped on a two dimensional space and the records plotted for the prediction values along the Y axis and the predictor values along the X axis. The simple linear regression model then could be viewed as the line that minimized the error rate between the actual prediction value and the point on the line (the prediction from the model).




Linear Regression – contd..

Graphically this would look as it does in the figure below:



Nearest Neighbour

Clustering and the Nearest Neighbour prediction technique are among the oldest techniques used in data mining. Most people think that clustering is like records are grouped together. Nearest neighbour is a prediction technique that is quite similar to clustering. Its essence is that in order to predict what a prediction value is in one record look for records with similar predictor values in the historical database and use the prediction value from the record that is “nearest” to the unclassified record.



Nearest Neighbour – contd..

The nearest neighbour prediction algorithm works in very much the same way except that “nearness” in a database may consist of a variety of factors not just where the person lives. It may, for instance, be far more important to know which school someone attended and what degree they attained when predicting income. Nearest Neighbour techniques are easy to use and understand because they work in a way similar to the way that people think - by detecting closely matching examples.



Clustering

Clustering is basically a partition of the database so that each partition or group is similar according to some criteria or metric. Clustering according to similarity is a concept, which appears in many disciplines. If a measure of similarity is available there are a number of techniques for forming clusters. Membership of groups can be based on the level of similarity between members and from this the rules of membership can be defined. Another approach is to build set functions that measure some property of partitions i.e. groups or subsets as functions of some parameter of the partition. This latter approach achieves what is known as optimal partitioning.



Clustering – contd..

Hierarchical Clustering

The hierarchical clustering techniques create a hierarchy of clusters, from small to big. The main reason is that clustering is an unsupervised learning technique, and as such, there is no absolutely correct answer. Now depending upon the particular application of the clustering, fewer or greater numbers of clusters may be desired. With a hierarchy of clusters defined it is possible to choose the number of clusters that are desired. Also it is possible to have as many clusters as there are records in the database.




Hierarchical Clustering – contd..

There are two main types of hierarchical clustering algorithms:

• Agglomerative: Agglomerative clustering techniques start with as many clusters as there are records where each cluster contains just one record. The clusters that are nearest to each other are merged together to form the next largest cluster. This merging is continued until a hierarchy of clusters is built with just a single cluster containing all the records at the top of the hierarchy.




Hierarchical Clustering – contd..

• Divisive: Divisive clustering techniques take the opposite approach from agglomerative techniques. These techniques start with all the records in one cluster and then try to split that cluster into smaller pieces and then in turn to try to split those smaller pieces into more smaller ones.




Non-Hierarchical Clustering

There are two main non-hierarchical clustering techniques. Both of them are very fast to compute on the database but have some drawbacks.

• The first are the single pass methods. They derive their name from the fact that the database must only be passed through once in order to create the clusters (i.e. each record is only read from the database once).




Non-Hierarchical Clustering – contd..

• The other class of techniques is called reallocation methods. They get their name from the movement or “reallocation” of records from one cluster to another in order to create better clusters. The reallocation techniques do use multiple passes through the database but are relatively fast in comparison to the hierarchical techniques.


Next Generation Techniques

This category of techniques include the following:

• Trees• Networks• Rules


Next Generation Techniques – contd..

Decision Trees

A decision tree is a predictive model that, as its name implies, can be viewed as a tree. Specifically each branch of the tree is a classification question and the leaves of the tree are partitions of the dataset with their classification.

There are some interesting things about the tree:

• It divides up the data on each branch point without losing any of the data (the number of total records in a given parent node is equal to the sum of the records contained in its two children).



Decision Trees – contd..

• The number of churners and non-churners is conserved as you move up or down the tree.

• It is pretty easy to understand how the model is being built (in contrast to the models from neural networks or from standard statistics).

• It would also be pretty easy to use this model if you actually had to target those customers that are likely to churn with a targeted marketing offer.



Neural Networks

Neural networks is an approach to computing that involves developing mathematical structures with the ability to learn. The methods are the result of academic investigations to model nervous system learning. Neural networks have the remarkable ability to derive meaning from complicated or imprecise data. This can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. A trained neural network can be thought of as an "expert" in the category of information it has been given to analyze. This expert can then be used to provide projections given new situations of interest and answer "what if' questions.



Neural Networks – contd..

Neural networks is an approach to computing that involves developing mathematical structures with the ability to learn. The methods are the result of academic investigations to model nervous system learning. Neural networks have the remarkable ability to derive meaning from complicated or imprecise data. This can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. A trained neural network can be thought of as an "expert" in the category of information it has been given to analyze. This expert can then be used to provide projections given new situations of interest and answer "what if' questions.




The structure of a neural network is shown in figure below:




In the figure, the bottom layer represents the input layer, in this case with 5 inputs labels Xl through X5. In the middle, there is the hidden layer, with a variable number of nodes. The hidden layer performs much of the work of the network. The output layer in this case has two nodes, Z1 and Z2 representing output values determined from the inputs.



Rule Induction

Rule induction is one of the major forms of data mining and is the most common form of knowledge discovery in unsupervised learning systems. Rule induction on a data base can be a massive undertaking where all possible patterns are systematically pulled out of the data and then an accuracy and significance are added to them that tell the user how strong the pattern is and how likely it is to occur again.



Rule Induction – contd..

In general these rules are relatively simple such as for a market basket database of items scanned in a consumer market basket you might find interesting correlations in your database such as:

• If bagels are purchased then cream cheese is purchased 90% of the time and this pattern occurs in 3% of all shopping baskets.

• If live plants are purchased from a hardware store then plant fertilizer is purchased 60% of the time and these two items are bought together in 6% of the shopping baskets.

UNIT-IV

Mining Association Rules

There are several efficient algorithms that cope with the popular and computationally expensive tasks of association rule mining. In brief, association rule is an expression that X => Y, where X and Y are sets of items. The meaning of such rules is quite intuitive: Given a database D of transactions – where each transaction T ϵ D is a set of items. X => Y expresses that whenever a transaction T contains X, T probably contains Y also. The probability of rule confidence is defined as the percentage of transactions containing Y in addition to X with regard to the overall number of transactions containing X.

Mining Association Rules – contd..

Below are the most common algorithms:

• BFS and Counting Occurrences• BFS and TID-list Intersections• DFS and Counting Occurrences• DFS and TID-list Intersections


Distributed Algorithms

Most parallel or distributed association rule algorithms strive to parallelize either the data, known as data parallelism, or the candidates, referred to as task parallelism. With task parallelism, the candidates are partitioned and counted separately at each processor. Obviously, the partition algorithm would be easy to parallelize using the task parallelism approach.


Distributed Algorithms – contd..

Other dimensions in differentiating the parallel association rule algorithms are the load-balancing approach used and the architecture. The data parallelism algorithms have reduced communication costs over the task, because only the initial candidates (the set of items) and the local counts must be distributed at each iteration. With task parallelism, not only the candidates but also the local set of transactions must be broadcast to all other sites. However, the data parallelism algorithms require that memory at each processor be large enough to store all candidates at each scan (otherwise the performance will degrade considerably because I/O is required for both the database and the candidate set).



The task parallelism approaches can avoid this because only the subset of the candidates that are assigned to a processor during each scan must fit into memory. Since not all partitions of the candidates must be the same size, the task parallel algorithms can adapt to the amount of memory at each site. The only restriction is that the total size of all candidates be small enough to fit into the total size of memory in all processors.



The CDA Algorithm

One data parallelism algorithm is the count distribution algorithm (CDA). The database is divided into p partitions, one for each processor. Each processor counts the candidates for its data and then broadcasts its counts to all other processors. Each processor then determines the global counts. These then are used to determine the large item sets and to generate the candidates for the next scan.



The FDM Algorithm

The FDM (Fast Distributed Algorithm for Data Mining) algorithm, proposed in (Cheung et al. 1996) has the following distinguishing characteristics:

• Candidate set generation is Apriori-like. However, some interesting properties of locally and globally frequent item sets are used to generate a reduced set of candidates at each iteration, this resulting in a reduction in the number of messages interchanged between sites.



The FDM Algorithm – contd..

• After the candidate sets were generated, two types of reduction techniques are applied, namely a local reduction and a global reduction, to eliminate some candidate sets from each site.

• To be able to determine if a candidate set is frequent, the algorithm needs only O(n) messages for the exchange of support counts, where n is the number of sites from the distributed system. This number is much less than a discrete adaptation of Apriori, which would need O(n2) messages for calculating the support counts.



Increasing the support factor also increases the performance of the algorithms. Also good performances are obtained when the support factor is low and the data set large, but the number of processors increased.

The increase in processor number should be done relative to the dimension of the data set. Thus, for a relatively small data set, the large increase in processor number can lead to large sets of local candidates and a large number of messages, thus increasing the execution time of CDA and FDM algorithms.



The CDA algorithm has a simple synchronization scheme, using only one set of messages for every step, while the FDM algorithm uses two synchronizations and the same scheme as CDA.

The distributed mining algorithms can be used on distributed databases, as well as for mining large databases by partitioning them between sites and processing them in a distributed manner. The high flexibility, the scalability, the small cost/performance ratio and the connectivity of a distributed system make them an ideal platform for data mining.


Incremental Rules

With the increasing use of the record-based databases whose data is being continuously added, recent important applications have called for the need of incremental mining. In dynamic transaction databases, new transactions are appended and obsolete transactions are discarded as time advances. Several research works have developed feasible algorithms for deriving precise association rules efficiently and effectively in such dynamic databases.


Incremental Rules – contd..

The mining of association rules on transactional database is usually an offline process since it is costly to find the association rules in large databases. With usual market-basket applications, new transactions are generated and old transactions may be obsolete as time advances. As a result, incremental updating techniques should be developed for maintenance of the discovered association rules to avoid redoing mining on the whole updated database.


Apriori-Based Algorithms

Algorithm Apriori is an influential algorithm for mining association rules. It uses prior knowledge of frequent item set properties to help on narrowing the search space of required frequent item sets. Specifically, k-item sets are used to explore (k+1)-item sets during the level wise process of frequent item set generation. The set of frequent 1-itemsets (L1) is firstly found by scanning the whole dataset once. L1 is then used by performing join and prune actions to form the set of candidate 2-itemsets (C2). After another data scan, the set of frequent 2 item sets (L2) are identified and extracted from C2. The whole process continues iteratively until there is no more candidate item sets which can be formed from previous Lk.


Apriori-Based Algorithms for Incremental Mining

The Apriori heuristic is an anti-monotone principle. Specifically, if any item set is not frequent in the database, its super item set will never be frequent. Below are the algorithms belonging to this category that adopt a level wise approach:

• Algorithm FUP (Fast UPdate)• Algorithms FUP2 and FUP2H• Algorithm UWEP (Update With Early Pruning)• Algorithm Utilizing Negative Borders• Algorithm DELI (Difference Estimation for Large Item sets)• Algorithms MAAP (Maintaining Association rules with

Apriori Property) and PELICAN


Partition-Based Algorithms

There are several techniques developed in prior works to improve the efficiency of algorithm Apriori, e.g., hashing item set counts, transaction reduction, data sampling, data partitioning and so on. Among the various techniques, the data partitioning is the one with great importance since the goal in this chapter is on the incremental mining where bulks of transactions may be appended or discarded as time advances.


Partition-Based Algorithms for Incremental Mining

In contrast to the Apriori heuristic, the partition-based technique well utilizes the partitioning on the whole transactional dataset. Moreover, after the partitioning, it is understood that if X is a frequent item set in database D which is divided into n partitions p1, p2, ..., pn, then X must be a frequent item set in at least one of the n partitions. Consequently, algorithms belonging to this category work on each partition of data iteratively and gather the information obtained from the processing of each partition to generate the final (integrated) results.


Partition-Based Algorithms for Incremental Mining – contd..

Below are the algorithms belonging to this category:

• Algorithm SWF (Sliding—Window Filtering)• Algorithms FI_SWF and CI_SWF


Pattern Growth Algorithms

The generation of frequent item sets in both the Apriori-based algorithms and the partition-based algorithms is in the style of candidate generate-and-test. No matter how the search space for candidate item sets is narrowed, in some cases, it may still need to generate a huge number of candidate item sets. In addition, the number of database scans is limited to be at least twice, and usually some extra scans are needed to avoid unreasonable computing overheads. These two problems are nontrivial and are resulted from the utilization of the Apriori approach.


Pattern Growth Algorithms – contd..

To overcome these difficulties, the tree structure which stores projected information of large datasets are utilized in some prior works. The algorithm TreeProjection constructs a lexicographical tree and has the whole database projected based on the frequent item sets mined so far. The transaction projection can limit the support counting in a relatively small space and the lexicographical tree can facilitate the management of candidate item sets. These features of algorithm TreeProjection provide a great improvement in computing efficiency when mining association rules.


Pattern Growth Algorithms for Incremental Mining

Both the Apriori-based algorithms and the partition-based algorithms aim at the goal of reducing the number of scans on the entire dataset when updates occur. Generally speaking, the updated portions, i.e., ∆− and ∆+, could be scanned several times during the level wise generation of frequent item sets in works belonging to these two categories.

Below are the algorithms belonging to this category:

• Algorithms DB-tree and PotFp-tree (Potential Frequent Pattern)

• Algorithm FELINE (FrEquent/Large patterns mINing with CATS trEe)

UNIT-V

Clustering Techniques

A cluster is a collection of data objects, similar to one another within the same cluster. The objects of a particular cluster are dissimilar to those in other clusters. Below are the major clustering approaches:

• Partitioning algorithms: Construct various partitions and then evaluate them by some criterion

• Hierarchical algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion

• Density-based algorithms: based on connectivity and density functions

• Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other

Clustering Techniques – contd..

Partitioning Algorithms: Basic Concept

To Construct a partition of a database D of n objects into a set of k clusters. Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion

• Global optimal: exhaustively enumerate all partitions• Heuristic methods: k-means and k-medoids algorithms• k-means (MacQueen’67): Each cluster is represented by the

center of the cluster• k-medoids or PAM (Partition around medoids) (Kaufman &

Rousseeuw’87): Each cluster is represented by one of the objects in the cluster


Optimization problem

The goal is to optimize a score function. The most commonly used is the square error criterion:

k

i iCpimpE

1

2


The K-Means Clustering Method

Given k, the k-means algorithm is implemented in 4 steps:

• Partition objects into k nonempty subsets.• Compute seed points as the centroids of the clusters of the

current partition. The centroid is the centre (mean point) of the cluster.

• Assign each object to the cluster with the nearest seed point. • Go back to Step 2, stop when no more new assignment.


The K-Means Clustering Method – contd..

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10



Strength• Relatively efficient: O(tkn), where n is # objects, k is #

clusters, and t is # iterations. Normally, k, t << n.• Often terminates at a local optimum. The global optimum

may be found using techniques such as: deterministic annealing and genetic algorithms

Weakness• Applicable only when mean is defined, then what about

categorical data?• Need to specify k, the number of clusters, in advance• Unable to handle noisy data and outliers• Not suitable to discover clusters with non-convex shapes



• Find representative objects, called medoids, in clusters• PAM (Partitioning Around Medoids, 1987)• starts from an initial set of medoids and iteratively

replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering

• PAM works effectively for small data sets, but does not scale well for large data sets

• CLARA (Kaufmann & Rousseeuw, 1990)• CLARANS (Ng & Han, 1994): Randomized sampling• Focusing + spatial data structure (Ester et al., 1995)



PAM (Partitioning Around Medoids)

• PAM (Kaufman and Rousseeuw, 1987), built in Splus• Use real object to represent the cluster• Select k representative objects arbitrarily• For each pair of non-selected object h and selected object

i, calculate the total swapping cost TCih• For each pair of i and h, • If TCih < 0, i is replaced by h• Then assign each non-selected object to the most

similar representative object• repeat steps 2-3 until there is no change



PAM (Partitioning Around Medoids) – contd..

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

j

ih

t

Cjih = 00

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

t

i hj

Cjih = d(j, h) - d(j, i)

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

h

i t

j

Cjih = d(j, t) - d(j, i)

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

t

ih j

Cjih = d(j, h) - d(j, t)



CLARA (Clustering Large Applications)

• Built in statistical analysis packages, such as S+• It draws multiple samples of the data set, applies PAM on

each sample, and gives the best clustering as the output• Strength:• deals with larger data sets than PAM

• Weakness:• Efficiency depends on the sample size• A good clustering based on samples will not necessarily

represent a good clustering of the whole data set if the sample is biased



CLARANS (“Randomized” CLARA)

• CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94)

• CLARANS draws sample of neighbors dynamically• The clustering process can be presented as searching a graph

where every node is a potential solution, that is, a set of k medoids

• If the local optimum is found, CLARANS starts with new randomly selected node in search for a new local optimum

• It is more efficient and scalable than both PAM and CLARA



Hierarchical Clustering

The Hierarchical Clustering uses distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition.

UNIT-VI

Classification Techniques

In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) on whose category membership is known.

In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance.

Classification Techniques – contd..

Statistical-based

Two main phases of work on classification can be identified within the statistical community. The first, “classical” phase concentrated on derivatives of Fisher’s early work on linear discrimination. The second, “modern” phase exploits more flexible classes of models, many of which attempt to provide an estimate of the joint distribution of the features within each class, which can in turn provide a classification rule.


Statistical-based – contd..

Statistical approaches are generally characterised by having an explicit underlying probability model, which provides a probability of being in each class rather than simply a classification. In addition, it is usually assumed that the techniques will be used by statisticians, and hence some human intervention is assumed with regard to variable selection and transformation, and overall structuring of the problem.


Distance-based

A typical distance-based classifier is Knn (K Nearest Neighbours). Knn calculates proximity between a test instance and each one of all the training instances for selecting k nearest neighbours of the test instance1. Among K nearest neighbours (training instances) the class label of the nearest neighbours is assigned as the class label of the test instance. Majority voting is used to assign a class label to a test instance: it will be the class of the majority of the training instances in the k-nn set. The most used proximity measure is Euclidean distance or cosine similarity: with instances described by the values of n attributes, proximity is computed between two instances where each instance is thought as a vector in an n-dimensional space.


Distance-based – contd..

These classifiers are simple and powerful but some of the well-known limitations of Knn are given below:

•If there are many training instances then Knn requires many distance calculations as well.•k-nn has the problem of model over-fitting. Model over-fitting is the following situation in which: 1) the classifier relies too much on the training data for its predictions and is not able to generalize its model to new test data and 2) Over-fitting is exemplified by the observation of the classification errors respectively in the training set and in the test set. The misclassification error on the training set continues to decrease whilst the error on test instances starts to increase again.


Decision Tree- based

A decision tree is a classifier expressed as a recursive partition of the instance space. The decision tree consists of nodes that form a rooted tree, meaning it is a directed tree with a node called “root” that has no incoming edges. All other nodes have exactly one incoming edge. A node with outgoing edges is called an internal or test node. All other nodes are called leaves (also known as terminal or decision nodes). In a decision tree, each internal node splits the instance space into two or more sub-spaces according to a certain discrete function of the input attributes values. In the simplest and most frequent case, each test considers a single attribute, such that the instance space is partitioned according to the attribute’s value. In the case of numeric attributes, the condition refers to a range.


Decision Tree- based – contd..

Each leaf is assigned to one class representing the most appropriate target value. Alternatively, the leaf may hold a probability vector indicating the probability of the target attribute having a certain value. Instances are classified by navigating them from the root of the tree down to a leaf, according to the outcome of the tests along the path.

Decision tree inducers are algorithms that automatically construct a decision tree from a given dataset. Typically the goal is to find the optimal decision tree by minimizing the generalization error. However, other target functions can be also defined, for instance, minimizing the number of nodes or minimizing the average depth.

UNIT-VII

Applications and Trends in Data Mining

Data mining is an interdisciplinary field with wide and diverse applications. There exist nontrivial gaps between data mining principles and domain-specific applications.

Some application domains:

• Financial data analysis• Retail industry• Telecommunication industry• Biological data analysis

Applications and Trends in Data Mining – contd..

Financial Data Analysis

• Financial data collected in banks and financial institutions are often relatively complete, reliable, and of high quality

• Design and construction of data warehouses for multidimensional data analysis and data mining• View the debt and revenue changes by month, by region,

by sector, and by other factors• Access statistical information such as max, min, total,

average, trend, etc.• Loan payment prediction/consumer credit policy analysis• feature selection and attribute relevance ranking• Loan payment performance• Consumer credit rating


Financial Data Analysis – contd..

• Classification and clustering of customers for targeted marketing• multidimensional segmentation by nearest-neighbor,

classification, decision trees, etc. to identify customer groups or associate a new customer to an appropriate customer group

• Detection of money laundering and other financial crimes• integration of from multiple DBs (e.g., bank transactions,

federal/state crime history DBs)• Tools: data visualization, linkage analysis, classification,

clustering tools, outlier analysis, and sequential pattern analysis tools (find unusual access sequences)


Retail Industry

• Retail industry: huge amounts of data on sales, customer shopping history, etc.

• Applications of retail data mining • Identify customer buying behaviors• Discover customer shopping patterns and trends• Improve the quality of customer service• Achieve better customer retention and satisfaction• Enhance goods consumption ratios• Design more effective goods transportation and

distribution policies


Telecomm. Industry

• A rapidly expanding and highly competitive industry and a great demand for data mining• Understand the business involved• Identify telecommunication patterns• Catch fraudulent activities• Make better use of resources• Improve the quality of service

• Multidimensional analysis of telecommunication data• Intrinsically multidimensional: calling-time, duration,

location of caller, location of callee, type of call, etc.


Telecomm. Industry – contd..

• Fraudulent pattern analysis and the identification of unusual patterns• Identify potentially fraudulent users and their atypical

usage patterns• Detect attempts to gain fraudulent entry to customer

accounts• Discover unusual patterns which may need special

attention• Multidimensional association and sequential pattern analysis• Find usage patterns for a set of communication services

by customer group, by month, etc.• Promote the sales of specific services• Improve the availability of particular services in a region


Biomedical Data Analysis

• DNA sequences: 4 basic building blocks (nucleotides): adenine (A), cytosine (C), guanine (G), and thymine (T).

• Gene: a sequence of hundreds of individual nucleotides arranged in a particular order

• Humans have around 30,000 genes• Tremendous number of ways that the nucleotides can be

ordered and sequenced to form distinct genes• Semantic integration of heterogeneous, distributed genome

databases• Current: highly distributed, uncontrolled generation and

use of a wide variety of DNA data• Data cleaning and data integration methods developed in

data mining will help


Choosing a Data Mining System

• Commercial data mining systems have little in common • Different data mining functionality or methodology • May even work with completely different kinds of data

sets• Need multiple dimensional view in selection• Data types: relational, transactional, text, time sequence,

spatial?• System issues• running on only one or on several operating systems?• a client/server architecture?• Provide Web-based interfaces and allow XML data as

input and/or output?


Choosing a Data Mining System – contd..

• Data sources• ASCII text files, multiple relational data sources• support ODBC connections (OLE DB, JDBC)?

• Data mining functions and methodologies• One vs. multiple data mining functions• One vs. variety of methods per function• More data mining functions and methods per function

provide the user with greater flexibility and analysis power

• Coupling with DB and/or data warehouse systems• Four forms of coupling: no coupling, loose coupling,

semitight coupling, and tight coupling


Choosing a Data Mining System – contd..

• Scalability• Row (or database size) scalability• Column (or dimension) scalability• Curse of dimensionality: it is much more challenging to

make a system column scalable that row scalable• Visualization tools• “A picture is worth a thousand words”• Visualization categories: data visualization, mining result

visualization, mining process visualization, and visual data mining

• Data mining query language and graphical user interface• Easy-to-use and high-quality graphical user interface • Essential for user-guided, highly interactive data mining

Advanced Techniques of Data Mining

Web Mining

Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services.

There are three general classes of information that can be discovered by web mining:

• Web activity, from server logs and Web browser activity tracking.

• Web graph, from links between pages, people and other data.• Web content, for the data found on Web pages and inside of

documents.

Advanced Techniques of Data Mining – contd..

Web Mining – contd..

Note that there’s no explicit reference to “search” in the above description. While search is the biggest web miner by far, and generates the most revenue, there are many other valuable end uses for web mining results. A partial list includes:

• Business intelligence• Competitive intelligence• Pricing analysis• Events• Product data• Popularity• Reputation


Web Mining – contd..

When extracting Web content information using web mining, there are four typical steps:

• Collect: fetch the content from the Web• Parse: extract usable data from formatted data (HTML, PDF,

etc)• Analyze: tokenize, rate, classify, cluster, filter, sort, etc.• Produce: turn the results of analysis into something useful

(report, search index, etc)


Web Mining versus Data Mining

When comparing web mining with traditional data mining, there are three main differences to consider:

• Scale: In traditional data mining, processing 1 million records from a database would be large job. In web mining, even 10 million pages wouldn’t be a big number.


Web Mining versus Data Mining – contd..

• Access: When doing data mining of corporate information, the data is private and often requires access rights to read. For web mining, the data is public and rarely requires access rights. But web mining has additional constraints, due to the implicit agreement with webmasters regarding automated (non-user) access to this data. This implicit agreement is that a webmaster allows crawlers access to useful data on the website, and in return the crawler (a) promises not to overload the site, and (b) has the potential to drive more traffic to the website once the search index is published. With web mining, there often is no such index, which means the crawler has to be extra careful/polite during the crawling process, to avoid causing any problems for the webmaster.


Web Mining versus Data Mining – contd..

• Structure: A traditional data mining task gets information from a database, which provides some level of explicit structure. A typical web mining task is processing unstructured or semi-structured data from web pages. Even when the underlying information for web pages comes from a database, this often is obscured by HTML markup.

Note that by “traditional” data mining we mean the type of analysis supported by most vendor tools, which assumes you’re processing table-oriented data that typically comes from a database.

Text Books

• Roiger & Geatz, Data Mining, Pearson Education• A.K.Pujari, Data Mining, University Press• M. H. Dunham. Data Mining: Introductory and Advanced

Topics. Pearson Education.• J. Han and M. Kamber. Data Mining: Concepts and

Techniques. Morgan Kaufman.

References Books

• I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann.

• D. Hand, H. Mannila and P. Smyth. Principles of Data Mining. Prentice-Hall.

Data mining:“Drowning in Data

yet Starving for Knowledge”

data mining and data warehousing (makaut)

Documents