data mining. data mining, at its core, is the transformation of large amounts of data into...

14
Data mining

Upload: howard-fields

Post on 04-Jan-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data mining. Data mining, at its core, is the transformation of large amounts of data into meaningful patterns and rules

Data mining

Page 2: Data mining. Data mining, at its core, is the transformation of large amounts of data into meaningful patterns and rules

• Data mining, at its core, is the transformation of large amounts of data into meaningful patterns and rules.

Page 3: Data mining. Data mining, at its core, is the transformation of large amounts of data into meaningful patterns and rules

Definition of Data Mining• The nontrivial process of identifying valid, novel,

potentially useful, and ultimately understandable patterns in data stored in structured databases. - Fayyad et al., (1996)

• Keywords in this definition: Process, nontrivial, valid, novel, potentially useful, understandable.

• Other names: knowledge extraction, pattern analysis, knowledge discovery, information harvesting, pattern searching, data dredging,…

Page 4: Data mining. Data mining, at its core, is the transformation of large amounts of data into meaningful patterns and rules

Two types of Data mining

• Directed Data mining (supervised)• Undirected (Unsupervised)

Page 5: Data mining. Data mining, at its core, is the transformation of large amounts of data into meaningful patterns and rules

Direct Data mining

• In directed data mining, you are trying to predict a particular data point .

• For example. the sales price of a house given information about other houses for sale in the neighborhood

Page 6: Data mining. Data mining, at its core, is the transformation of large amounts of data into meaningful patterns and rules

Undirect Data mining mining

• In undirected data mining, you are trying to create groups of data, or find patterns in existing data .

• For example. In effect, every U.S. census is data mining, as the government looks to gather data about everyone in the country and turn it into useful information.

Page 7: Data mining. Data mining, at its core, is the transformation of large amounts of data into meaningful patterns and rules

• Modern data mining started in the mid-1990s, as the power of computing, and the cost of computing and storage finally reached a level where it was possible for companies to do it in-house, without having to look to outside computer powerhouses

Page 8: Data mining. Data mining, at its core, is the transformation of large amounts of data into meaningful patterns and rules

• The term data mining is all-encompassing, referring to dozens of techniques and procedures used to examine and transform data.

Page 9: Data mining. Data mining, at its core, is the transformation of large amounts of data into meaningful patterns and rules

Data Mining at the Intersection of Many Disciplines

Management Science & Information Systems

Databases

Pattern Recognition

MachineLearning

MathematicalModeling

DATAMINING

Page 10: Data mining. Data mining, at its core, is the transformation of large amounts of data into meaningful patterns and rules

Data Mining Characteristics/Objectives

• Source of data for DM is often a consolidated data warehouse (not always!)

• DM environment is usually a client-server or a Web-based information systems architecture

• Data is the most critical ingredient for DM which may include soft/unstructured data

• The miner is often an end user• Striking it rich requires creative thinking• Data mining tools’ capabilities and ease of use are

essential (Web, Parallel processing, etc.)

Page 11: Data mining. Data mining, at its core, is the transformation of large amounts of data into meaningful patterns and rules

Data in Data Mining

Data

Categorical Numerical

Nominal Ordinal Interval Ratio

• Data: a collection of facts usually obtained as the result of experiences, observations, or experiments

• Data may consist of numbers, words, images, …• Data: lowest level of abstraction (from which information

and knowledge are derived)

- DM with different data types?

- Other data types?

Page 12: Data mining. Data mining, at its core, is the transformation of large amounts of data into meaningful patterns and rules

A Taxonomy for Data Mining TasksData Mining

Prediction

Classification

Regression

Clustering

Association

Link analysis

Sequence analysis

Learning Method Popular Algorithms

Supervised

Supervised

Supervised

Unsupervised

Unsupervised

Unsupervised

Unsupervised

Decision trees, ANN/MLP, SVM, Rough sets, Genetic Algorithms

Linear/Nonlinear Regression, Regression trees, ANN/MLP, SVM

Expectation Maximization, Apriory Algorithm, Graph-based Matching

Apriory Algorithm, FP-Growth technique

K-means, ANN/SOM

Outlier analysis Unsupervised K-means, Expectation Maximization (EM)

Apriory, OneR, ZeroR, Eclat

Classification and Regression Trees, ANN, SVM, Genetic Algorithms

Page 13: Data mining. Data mining, at its core, is the transformation of large amounts of data into meaningful patterns and rules

• The ultimate goal of data mining is to create a model.

• A model that can improve the way you read and interpret your existing data and predict your future data.

• Since there are so many techniques with data mining, the major step to creating a good model is to determine what type of technique to use. That will come with practice and experience, and some guidance. From there, the model needs to be refined to make it even more useful.

Page 14: Data mining. Data mining, at its core, is the transformation of large amounts of data into meaningful patterns and rules

Weka as a Data mining tool• Data mining isn't solely the domain of big companies and expensive

software. • In fact, there's a piece of software that does almost all the same

things as these expensive pieces of software — the software is called WEKA .

• WEKA is the product of the University of Waikato (New Zealand) and was first implemented in its modern form in 1997.

• It uses the GNU General Public License (GPL). • The software is written in the Java™ language and contains a GUI

for interacting with data files and producing visual results (think tables and curves).

• It also has a general API, so you can embed WEKA, like any other library, in your own applications to such things as automated server-side data-mining tasks.