data mining: a closer look chapter 2. 2.1 data mining strategies

41
Data Mining: A Closer Look Chapter 2

Post on 19-Dec-2015

222 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Data Mining: A Closer Look

Chapter 2

Page 2: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

2.1 Data Mining Strategies

Page 3: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Figure 2.1 A hierarchy of data mining strategies

Data MiningStrategies

SupervisedLearning

Market BasketAnalysis

UnsupervisedClustering

PredictionEstimationClassification

Page 4: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Classification

• Learning is supervised.

• The dependent variable is categorical.

• Well-defined classes.

• Current rather than future behavior.

Page 5: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Estimation

• Learning is supervised.

• The dependent variable is numeric.

• Well-defined classes.

• Current rather than future behavior.

Page 6: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Prediction

• The emphasis is on predicting future rather than current outcomes.

• The output attribute may be categorical or numeric.

Page 7: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

The Cardiology Patient Dataset

Page 8: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Table 2.1 • Cardiology Patient Data Attribute Mixed Numeric Name Values Values Comments

Age Numeric Numeric Age in years

Sex Male, Female 1, 0 Patient gender

Chest Pain Type Angina, Abnormal Angina, 1–4 NoTang = Nonanginal NoTang, Asymptomatic pain

Blood Pressure Numeric Numeric Resting blood pressure upon hospital admission

Cholesterol Numeric Numeric Serum cholesterol

Fasting Blood True, False 1, 0 Is fasting blood sugar less Sugar < 120 than 120?

Resting ECG Normal, Abnormal, Hyp 0, 1, 2 Hyp = Left ventricular hypertrophy

Maximum Heart Numeric Numeric Maximum heart rate Rate achieved

Induced Angina? True, False 1, 0 Does the patient experience angina as a result of exercise?

Old Peak Numeric Numeric ST depression induced by exercise relative to rest

Slope Up, flat, down 1–3 Slope of the peak exercise ST segment

Number Colored 0, 1, 2, 3 0, 1, 2, 3 Number of major vessels Vessels colored by fluorosopy

Thal Normal fix, rev 3, 6, 7 Normal, fixed defect, reversible defect

Concept Class Healthy, Sick 1, 0 Angiographic disease status

Page 9: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Table 2.2 • Most and Least Typical Instances from the Cardiology Domain

Attribute Most Typical Least Typical Most Typical Least TypicalName Healthy Class Healthy Class Sick Class Sick Class

Age 52 63 60 62Sex Male Male Male FemaleChest Pain Type NoTang Angina Asymptomatic AsymptomaticBlood Pressure 138 145 125 160Cholesterol 223 233 258 164Fasting Blood Sugar < 120 False True False FalseResting ECG Normal Hyp Hyp HypMaximum Heart Rate 169 150 141 145Induced Angina? False False True FalseOld Peak 0 2.3 2.8 6.2Slope Up Down Flat DownNumber of Colored Vessels 0 0 1 3Thal Normal Fix Rev Rev

Page 10: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

A Healthy Class Rule for the Cardiology Patient Dataset

IF 169 <= Maximum Heart Rate <=202

THEN Concept Class = Healthy

Rule accuracy: 85.07%

Rule coverage: 34.55%

Page 11: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

A Sick Class Rule for the Cardiology Patient Dataset

IF Thal = Rev & Chest Pain Type = Asymptomatic

THEN Concept Class = Sick

Rule accuracy: 91.14%

Rule coverage: 52.17%

Page 12: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Unsupervised Clustering

• Determine if concepts can be found in the data.• Evaluate the likely performance of a supervised model.• Determine a best set of input attributes for supervised learning.• Detect Outliers.

Page 13: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Market Basket Analysis

• Find interesting relationships among retail products.

• Uses association rule algorithms.

Page 14: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

2.2 Supervised Data Mining Techniques

Page 15: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

The Credit Card Promotion Database

Page 16: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Table 2.3 • The Credit Card Promotion Database

Income Magazine Watch Life Insurance Credit CardRange ($) Promotion Promotion Promotion Insurance Sex Age

40–50K Yes No No No Male 4530–40K Yes Yes Yes No Female 4040–50K No No No No Male 4230–40K Yes Yes Yes Yes Male 4350–60K Yes No Yes No Female 3820–30K No No No No Female 5530–40K Yes No Yes Yes Male 3520–30K No Yes No No Male 2730–40K Yes No No No Male 4330–40K Yes Yes Yes No Female 4140–50K No Yes Yes No Female 4320–30K No Yes Yes No Male 2950–60K Yes Yes Yes No Female 3940–50K No Yes No No Male 5520–30K No No Yes Yes Female 19

Page 17: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

A Hypothesis for the Credit Card Promotion Database

A combination of one or more of the dataset attributes differentiate Acme Credit Card Company card holders who have taken advantage of the life insurance promotion and those card holders who have chosen not to participate in the promotional offer.

Page 18: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

A Production Rule for theCredit Card Promotion Database

IF Sex = Female & 19 <=Age <= 43

THEN Life Insurance Promotion = Yes

Rule Accuracy: 100.00%

Rule Coverage: 66.67%

Page 19: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Production Rules

• Rule accuracy is a between-class measure.

• Rule coverage is a within-class measure.

Page 20: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Neural Networks

Page 21: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Figure 2.2 A multilayer fully connected neural network

InputLayer

OutputLayer

HiddenLayer

Page 22: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Table 2.4 • Neural Network Training: Actual and Computed Output

Instance Number Life Insurance Promotion Computed Output

1 0 0.024

2 1 0.998

3 0 0.023

4 1 0.986

5 1 0.999

6 0 0.050

7 1 0.999

8 0 0.262

9 0 0.060

10 1 0.997

11 1 0.999

12 1 0.776

13 1 0.999

14 0 0.023

15 1 0.999

Page 23: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Statistical Regression

Life insurance promotion =

0.5909 (credit card insurance) -

0.5455 (sex) + 0.7727

Page 24: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

2.3 Association Rules

Page 25: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

An Association Rule for the Credit Card Promotion Database

IF Sex = Female & Age = over40 &

Credit Card Insurance = No

THEN Life Insurance Promotion = Yes

Page 26: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

2.4 Clustering Techniques

Page 27: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Figure 2.3 An unsupervised cluster of the credit card database

# Instances: 5Sex: Male => 3

Female => 2Age: 37.0Credit Card Insurance: Yes => 1

No => 4Life Insurance Promotion: Yes => 2

No => 3

Cluster 1

Cluster 2

Cluster 3

# Instances: 3Sex: Male => 3

Female => 0Age: 43.3Credit Card Insurance: Yes => 0

No => 3Life Insurance Promotion: Yes => 0

No => 3

# Instances: 7Sex: Male => 2

Female => 5Age: 39.9Credit Card Insurance: Yes => 2

No => 5Life Insurance Promotion: Yes => 7

No => 0

Page 28: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

2.5 Evaluating Performance

Page 29: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Evaluating Supervised Learner Models

Page 30: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Confusion Matrix

• A matrix used to summarize the results of a supervised classification.

• Entries along the main diagonal are correct classifications.

• Entries other than those on the main diagonal are classification errors.

Page 31: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Table 2.5 • A Three-Class Confusion Matrix

Computed Decision

C1 C2 C3C1 C11 C12 C13

C2 C21 C22 C23

C3 C31 C32 C33

Page 32: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Two-Class Error Analysis

Page 33: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Table 2.6 • A Simple Confusion Matrix

Computed Computed

Accept Reject

Accept True FalseAccept Reject

Reject False TrueAccept Reject

Page 34: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Table 2.7 • Two Confusion Matrices Each Showing a 10% Error Rate

Model Computed Computed Model Computed ComputedA Accept Reject B Accept Reject

Accept 600 25 Accept 600 75Reject 75 300 Reject 25 300

Page 35: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Evaluating Numeric Output

• Mean absolute error

• Mean squared error

• Root mean squared error

Page 36: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Comparing Models by Measuring Lift

Page 37: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Figure 2.4 Targeted vs. mass mailing

0

200

400

600

800

1000

1200

0 10 20 30 40 50 60 70 80 90 100

NumberResponding

% Sampled

Page 38: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Computing Lift

)|(

)|(

PopulationCP

SampleCPLift

i

i

Page 39: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Table 2.8 • Two Confusion Matrices: No Model and an Ideal Model

No Computed Computed Ideal Computed ComputedModel Accept Reject Model Accept Reject

Accept 1,000 0 Accept 1,000 0Reject 99,000 0 Reject 0 99,000

Page 40: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Table 2.9 • Two Confusion Matrices for Alternative Models with Lift Equal to 2.25

Model Computed Computed Model Computed ComputedX Accept Reject Y Accept Reject

Accept 540 460 Accept 450 550Reject 23,460 75,540 Reject 19,550 79,450

Page 41: Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies

Unsupervised Model Evaluation