dr. chen, data mining a/w & dr. chen, data mining part i data mining fundamentals chapter 1...

32
A/W & Dr. Chen, Data Mining Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration Gonzaga University Spokane, WA 99223 [email protected]

Upload: martina-jenkins

Post on 04-Jan-2016

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Part IData Mining Fundamentals

Chapter 1Data Mining: A First View

Jason C. H. Chen, Ph.D.Professor of MIS

School of Business AdministrationGonzaga UniversitySpokane, WA 99223

[email protected]

Page 2: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

1.1 Data Mining: A Definition

Page 3: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

3A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

1.1 Data Mining: A Definition

• The process of employing one or more computer learning techniques to automatically analyze and extract knowledge from data.

Page 4: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

4A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Induction-based Learning

• The process of forming general concept definitions by observing specific examples of concepts to be learned.

Knowledge Discovery in Databases (KDD)

• The application of the scientific method to data mining. Data mining is one step of the KDD process.

Page 5: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

5A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Data Mining Examples

• A telephone company used a data mining tool to analyze their customer’s data warehouse. The data mining tool found about 10,000 supposedly residential customers that were expending over $1,000 monthly in phone bills.

• After further study, the phone company discovered that they were really small business owners trying to avoid paying business rates

*

Page 6: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

6A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Other Data Mining Examples

• 65% of customers who did not use the credit card in the last six months are 88% likely to cancel their accounts.

• If age < 30 and income <= $25,000 and credit rating < 3 and credit amount > $25,000 then the minimum loan term is 10 years.

• 82% of customers who bought a new TV 27" or larger are 90% likely to buy an entertainment center within the next 4 weeks.

Page 7: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

7A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

1.2 What Can Computers Learn?

Page 8: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

8A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Four Levels of Learning• Fact

– a simple statement of truth

• Concept– a set of objects, symbols, or events grouped together because they

share certain characteristics

• Principle– is a step-by-step course of action to achieve a goal. We use procedures

in our everyday functioning as well as in the solution of difficult problems

• Procedure– represents the highest level of learning. Principles are general truths or

laws that are basic to other truths.

Source: Merril and Tennyson, 1977, p.5 of the text N

Page 9: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

9A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Concepts• Computers are good at learning concepts.

Concepts are the output of a data mining session.

Three Concept Views

• Classical View

• Probabilistic View

• Exemplar View

Page 10: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

10A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Three Concept Views

• Classical View– Attests that all concepts have definite defining

properties.

• Probabilistic View– Concepts are represented by properties that are probable

of concept members.

• Exemplar View– States that a given instance is determined to be an

example of a particular concept if the instance is similar enough to a set of one or more known examples of the concepts

Page 11: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

11A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Figure - A hierarchy of data mining strategies

Data Mining Strategies

Unsupervised Clustering

Supervised Learning

Market Basket Analysis

Classification EstimationPrediction

Categorical/discrete(current behavior)

NumericFuture outcome

(categorical/numeric)

No output attributes

Page 12: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

12A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Supervised Learning

Two purposes:• 1. Build a learner (classification) model using data

instances of known origin.– is an induction process

• 2. Use the model to determine the outcome new instances of unknown origin.– is a deduction process

Supervised learning is the process of building classification models using data instances of known origin.

Page 13: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Supervised Learning:

A Decision Tree Example

Page 14: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

14A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Decision Tree• A tree structure where non-terminal nodes

represent tests on one or more attributes and terminal nodes reflect decision outcomes.

Table 1.1 – Hypothetical Training Data for Disease DiagnosisPatient Sore Swollen

ID# Throat Fever Glands Congestion Headache Diagnosis1 Yes Yes Yes Yes Yes Strep throat

2 No No No Yes Yes Allergy

3 Yes Yes No Yes No Cold

4 Yes No Yes No No Strep throat

5 No Yes No Yes No Cold

6 No No No Yes No Allergy

7 No No Yes No No Strep throat

8 Yes No No Yes Yes Allergy

9 No Yes No Yes Yes Cold

10 Yes Yes No Yes Yes Cold

Page 15: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

15A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

SwollenGlands

Fever

No

Yes

Diagnosis = Allergy Diagnosis = Cold

No

Yes

Diagnosis = Strep Throat

Figure 1.1 – A decision tree for the data in Table 1.1

Page 16: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

16A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Patient Sore Swollen

ID# Throat Fever Glands Congestion Headache Diagnosis11 No No Yes Yes Yes ?

12 Yes Yes No No Yes ?

13 No No No No Yes ?

Table 1.2 Data Instances with an Unknown Classification

Patient Sore Swollen

ID# Throat Fever Glands Congestion Headache Diagnosis1 Yes Yes Yes Yes Yes Strep throat

2 No No No Yes Yes Allergy

3 Yes Yes No Yes No Cold

4 Yes No Yes No No Strep throat

5 No Yes No Yes No Cold

6 No No No Yes No Allergy

7 No No Yes No No Strep throat

8 Yes No No Yes Yes Allergy

9 No Yes No Yes Yes Cold

10 Yes Yes No Yes Yes Cold

Table 1.1 – Hypothetical Training Data for Disease Diagnosis

Page 17: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

17A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Production Rules

• IF Swollen Glands = Yes THEN Diagnosis = Strep Throat• IF Swollen Glands = No & Fever = Yes THEN Diagnosis = Cold• IF Swollen Glands = No & Fever = No THEN Diagnosis = Allergy

We can translate any decision tree into a set of production rules. They are rules of the form:IF <antecedent conditions>THEN <consequent conditions>

Page 18: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

18A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Unsupervised Clustering

• A data mining method that builds models from data without predefined classes (see Table 1.3).

• Data instances are grouped together based on a similarity scheme defined by the clustering system.

• With the help of one or several evaluation techniques, it is up to us to decide the meaning of the formed clusters.

Page 19: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

19A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Table 1.3 – Acme Investors Incorporated

Customer Account Margin Transaction Trades/ Favorite Annual

ID Type Account Method Month Sex Age Recreation Income1005 Joint No Online 12.5 F 30–39 Tennis 40–59K

1013 Custodial No Broker 0.5 F 50–59 Skiing 80–99K

1245 Joint No Online 3.6 M 20–29 Golf 20–39K

2110 Individual Yes Broker 22.3 M 30–39 Fishing 40–59K

1001 Individual Yes Online 5 M 40–49 Golf 60–79K

Page 20: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

20A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Possible Questions

1. Can I develop a general profile of an online investor? If so, what characteristics distinguish online investors from investors that use a broker?

2. Can I determine if a new customer who does not initially open a margin account is likely to do so in the future?

3. Can I build a model able to accurately predict the average number of trades per month for a new investor?

4. What characteristics differentiate female and male investors?

1. What attribute similarities group customers of Acme Investors together?2. What differences in attribute values segment the customer database?

Questions for supervised learning

Questions for unsupervised learning

Page 21: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

21A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

1.3 Is Data Mining Appropriate for My Problem?

Page 22: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

22A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Data Mining or Data Query?

• Shallow Knowledge– is factual; tools used: DBMS/SQL

• Multidimensional Knowledge– Is factual; tools used: OLAP

• Hidden Knowledge– Represents patterns or regularities in data that cannot

be easily found, tools used: data mining

• Deep Knowledge– Knowledge stored in a database that can only be found

if we are given some direction.

Page 23: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

23A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Data Mining vs. Data Query: An Example

• Use data query if you already almost know what you are looking for.

• Use data mining to find regularities in data that are not obvious.

Page 24: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

24A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

1.4 Expert Systems or Data Mining?

Page 25: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

25A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Expert System and Knowledge Engineer

• An expert system is a computer program that emulates the problem-solving skills of one or more human experts.

• A knowledge engineer is a person trained to interact with an expert in order to capture their knowledge.

Page 26: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

26A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Data Mining Tool

Expert SystemBuilding Tool

Human Expert

If Swollen Glands = YesThen Diagnosis = Strep Throat

If Swollen Glands = YesThen Diagnosis = Strep Throat

Knowledge Engineer

Data

Page 27: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

27A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

1.5 A Simple Data Mining Process Model

Page 28: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

28A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Figure 1.3 - A simples data mining process model

Operational Database

Data Warehouse

SQL Queries

Data MiningInterpretation &

Evaluation

Result Application

Page 29: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

29A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

Characteristics of Data Warehouse

• Data Warehouse: – Definitions: a subject-oriented, integrated, time-

variant, non-updatable collection of data used in support of management decision-making processes

– Subject-oriented: e.g. customers, patients, students, products

– Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources

– Time-variant: Can study trends and changes– Nonupdatable: Read-only, periodically refreshed

• Data Mart:– A data warehouse that is limited in scope

Page 30: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

30A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

A four-step process for performing a data mining session

• 1. Assembling the data– Operational database (relational databases and flat

files) vs. data warehouse

• 2. Mining the Data (Giving the data to a mining tool)

– Instances for building the model or testing the model

• 3. Interpreting the results• 4. Result application

Page 31: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

31A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

1.7 Data Mining Applications (p.24)

• Fraud Detection

• Health care

• Business and finance

• Scientific applications

• Sports and gaming

Page 32: Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor

32A/W & Dr. Chen, Data MiningDr. Chen, Data Mining

X

X

X

X

X

XX

X

X

_

_

__

_

_

_

_

_

__

Intrinsic(Predicted)

Value

Actual Value

Customer Intrinsic Value

A

B

C