part i data mining fundamentals. data mining: a first view chapter 1
TRANSCRIPT
Part I
Data Mining Fundamentals
Data Mining: A First View
Chapter 1
1.1 Data Mining: A Definition
Data Mining
The process of employing one or more computer learning techniques to automatically analyze and extract knowledge from data.
Induction-based Learning
The process of forming general concept definitions by observing specific examples of concepts to be learned.
Knowledge Discovery in Databases (KDD)
The application of the scientific method to data mining. Data mining is one step of the KDD process.
1.2 What Can Computers Learn?
Four Levels of Learning
• Facts
• Concepts
• Procedures
• Principles
Concepts
Computers are good at learning concepts. Concepts are the output of a data mining session.
Three Concept Views
• Classical View
• Probabilistic View
• Exemplar View
Supervised Learning
• Build a learner model using data instances of known origin.
• Use the model to determine the outcome new instances of
unknown origin.
Supervised Learning:
A Decision Tree Example
Decision Tree
A tree structure where non-terminal nodes represent tests on one or more attributes and terminal nodes reflect decision outcomes.
Table 1.1 • Hypothetical Training Data for Disease Diagnosis
Patient Sore SwollenID# Throat Fever Glands Congestion Headache Diagnosis
1 Yes Yes Yes Yes Yes Strep throat2 No No No Yes Yes Allergy3 Yes Yes No Yes No Cold4 Yes No Yes No No Strep throat5 No Yes No Yes No Cold6 No No No Yes No Allergy7 No No Yes No No Strep throat8 Yes No No Yes Yes Allergy9 No Yes No Yes Yes Cold10 Yes Yes No Yes Yes Cold
Figure 1.1 A decision tree for the data in Table 1.1
SwollenGlands
Fever
No
Yes
Diagnosis = Allergy Diagnosis = Cold
No
Yes
Diagnosis = Strep Throat
Table 1.2 • Data Instances with an Unknown Classification
Patient Sore SwollenID# Throat Fever Glands Congestion Headache Diagnosis
11 No No Yes Yes Yes ?12 Yes Yes No No Yes ?13 No No No No Yes ?
Production Rules
IF Swollen Glands = Yes
THEN Diagnosis = Strep Throat
IF Swollen Glands = No & Fever = Yes
THEN Diagnosis = Cold
IF Swollen Glands = No & Fever = No
THEN Diagnosis = Allergy
Unsupervised Clustering
A data mining method that builds models from data without predefined classes.
Table 1.3 • Acme Investors Incorporated
Customer Account Margin Transaction Trades/ Favorite AnnualID Type Account Method Month Sex Age Recreation Income
1005 Joint No Online 12.5 F 30–39 Tennis 40–59K1013 Custodial No Broker 0.5 F 50–59 Skiing 80–99K1245 Joint No Online 3.6 M 20–29 Golf 20–39K2110 Individual Yes Broker 22.3 M 30–39 Fishing 40–59K1001 Individual Yes Online 5.0 M 40–49 Golf 60–79K
1.3 Is Data Mining Appropriate for My Problem?
Data Mining or Data Query?
• Shallow Knowledge
• Multidimensional Knowledge
• Hidden Knowledge
• Deep Knowledge
Data Mining vs. Data Query: An Example
• Use data query if you already almost know what you are looking for.
• Use data mining to find regularities in data that are not obvious.
1.4 Expert Systems or Data Mining?
Expert System
A computer program that emulates the problem-solving skills of one or more human experts.
Knowledge Engineer
A person trained to interact with an expert in order to capture their knowledge.
Figure 1.2 Data mining vs. expert systems
Data Mining Tool
Expert SystemBuilding Tool
Human Expert
If Swollen Glands = YesThen Diagnosis = Strep Throat
If Swollen Glands = YesThen Diagnosis = Strep Throat
Knowledge Engineer
Data
1.5 A Simple Data Mining Process Model
Figure 1.3 A simple data mining process model
SQL QueriesOperationalDatabase
DataWarehouse
ResultApplication
Interpretation&
EvaluationData Mining
Assembling the Data
• The Data Warehouse
• Relational Databases and Flat Files
Mining the Data
Interpreting the Results
Result Application
1.7 Data Mining Applications