data mining and application part 1: data mining fundamentals part 2: tools for knowledge discovery...

40
Data Mining and Data Mining and Application Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent Systems 1

Upload: clemence-webster

Post on 27-Dec-2015

221 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Data Mining and ApplicationData Mining and Application

Part 1: Data Mining Fundamentals

Part 2: Tools for Knowledge Discovery

Part 3: Advanced Data Mining Techniques

Part 4: Intelligent Systems

1

Page 2: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Data Mining: A First ViewData Mining: A First View

Chapter 1

2

Page 3: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

1.1 Data Mining: A Definition1.1 Data Mining: A Definition

•The process of employing one or more computer learning techniques to automatically analyze and extract knowledge from data.

3

Page 4: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Induction-based LearningInduction-based Learning

•The process of forming general concept definitions by observing specific examples of concepts to be learned.

–Many televised golf tournaments are sponsored by online brokerage firms–Advertise rap music in magazines for senior citizens–Suspect a stolen credit card

4

Page 5: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Knowledge Discovery in Knowledge Discovery in Databases (KDD)Databases (KDD)

The application of the scientific method to data mining. Data mining is one step of the KDD process.

5

Page 6: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

1.2 What Can Computers 1.2 What Can Computers Learn?Learn?

6

Page 7: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Four Levels of LearningFour Levels of Learning• Facts

–Sea is blue• Concepts

–Trees, rules, networks, and mathematical equations

• Procedures–A step-by-step course of action to achieve a goal

• Principles–General truths or laws

7

Page 8: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

ConceptsConcepts•Computers are good at learning concepts. Concepts are the output of a data mining session.

8

•Three concept viewsClassical view

Probabilistic view

Exemplar view

Page 9: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Classical ViewClassical ViewAll concepts have definite

defining properties

IF Annual Income >= 30,000 & Years at Current Position >= 5 & Owns Home = TrueTHEN Good Credit Risk = True

9

Page 10: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Probabilistic ViewProbabilistic ViewRepresented by properties that

are probable of concept members

The majority of good credit risks own their own home

10

Page 11: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Exemplar ViewExemplar View

A given instance is determined to be an example of a particular concept

Good credit risks example

Annual Income = 32,000Number of Years at Current Position =

6Homeowner

11

Page 12: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Supervised LearningSupervised Learning

• Build a learner model using data instances of known origin.• Use the model to determine the outcome new instances of unknown origin.

12

Page 13: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Decision TreeDecision Tree

•A tree structure where nonterminal nodes represent tests on one or more attributes and terminal nodes reflect decision outcomes.

13

Page 14: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

14

Table 1.1 • Hypothetical Training Data for Disease Diagnosis

Patient Sore SwollenID# Throat Fever Glands Congestion Headache Diagnosis

1 Yes Yes Yes Yes Yes Strep throat2 No No No Yes Yes Allergy3 Yes Yes No Yes No Cold4 Yes No Yes No No Strep throat5 No Yes No Yes No Cold6 No No No Yes No Allergy7 No No Yes No No Strep throat8 Yes No No Yes Yes Allergy9 No Yes No Yes Yes Cold10 Yes Yes No Yes Yes Cold

Page 15: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

15

SwollenGlands

Fever

No

Yes

Diagnosis = Allergy Diagnosis = Cold

No

Yes

Diagnosis = Strep Throat

Page 16: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

16

Table 1.2 • Data Instances with an Unknown Classification

Patient Sore SwollenID# Throat Fever Glands Congestion Headache Diagnosis

11 No No Yes Yes Yes ?12 Yes Yes No No Yes ?13 No No No No Yes ?

Page 17: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Production RulesProduction RulesIF Swollen Glands = Yes THEN Diagnosis = Strep ThroatIF Swollen Glands = No & Fever = Yes THEN Diagnosis = ColdIF Swollen Glands = No & Fever = No THEN Diagnosis = Allergy

17

Page 18: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Unsupervised ClusteringUnsupervised Clustering

A data mining method that builds models from data without predefined classes.

18

Page 19: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

19

Table 1.3 • Acme Investors Incorporated

Customer Account Margin Transaction Trades/ Favorite AnnualID Type Account Method Month Sex Age Recreation Income

1005 Joint No Online 12.5 F 30–39 Tennis 40–59K1013 Custodial No Broker 0.5 F 50–59 Skiing 80–99K1245 Joint No Online 3.6 M 20–29 Golf 20–39K2110 Individual Yes Broker 22.3 M 30–39 Fishing 40–59K1001 Individual Yes Online 5.0 M 40–49 Golf 60–79K

Page 20: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

QuestionQuestionCan I develop a general profile of an

online investor?Can I determine if a new customer

who does not initially open a margin account is likely to do so in the future

Can I build a model able to accurately predict the average number of trades per month for a new investor?

What characteristics differentiate female and male investors?

20

Page 21: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Candidate questions for Candidate questions for unsupervised clusteringunsupervised clustering

What attribute similarities group customers?

What differences in attribute values segment the customer database?

21

Page 22: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Three ClustersThree ClustersIF (Conditions) Margin Account=Yes & Age=20-29&Annual Income=40-59KTHEN Cluster=1Accuracy=0.8, Coverage=0.5

22

Accuracy => rule confidence for all instancesEX. This rule will be erroneous in 20%

Coverage => rule significance for the cluster50% in the cluster satisfy the conditions

Page 23: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Other two rulesOther two rulesIF Account Type=Custodial & Favorite Recreation=Skiing & Annual Income = 80-90KTHEN Cluster=2Accuracy=0.95, coverage=0.35

IF Account Type=Joint & Trades/Month>5 & Transaction Method=Online

THEN Cluster=3Accuracy=0.82, coverage=0.65

23

Page 24: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

1.3 Is Data Mining 1.3 Is Data Mining Appropriate for My Appropriate for My Problem?Problem?

24

Page 25: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Data Mining or Data Query?Data Mining or Data Query?• Shallow Knowledge

–Is factual

• Multidimensional Knowledge–Is factual and stored in a multidimensional format

• Hidden Knowledge–Patterns or regularities

• Deep Knowledge–Need some direction to find it

25

Page 26: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Data Mining vs. Data Data Mining vs. Data QueryQuery

• Use data query if you already almost know what you are looking for.• Use data mining to find regularities in data that are not obvious.

26

Page 27: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

1.4 Expert Systems or Data 1.4 Expert Systems or Data Mining?Mining?

27

Page 28: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Expert SystemExpert System

•A computer program that emulates the problem-solving skills of one or more human experts.

28

Page 29: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Knowledge EngineerKnowledge Engineer

A person trained to interact with an expert in order to capture their knowledge.

29

Page 30: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

30

Data Mining Tool

Expert SystemBuilding Tool

Human Expert

If Swollen Glands = YesThen Diagnosis = Strep Throat

If Swollen Glands = YesThen Diagnosis = Strep Throat

Knowledge Engineer

Data

Page 31: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

1.5 A Simple Data Mining 1.5 A Simple Data Mining Process ModelProcess Model

31

• Assembling the Data• Mining the Data• Interpreting the Results• Result Application

Page 32: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

32

SQL QueriesOperationalDatabase

DataWarehouse

ResultApplication

Interpretation&

EvaluationData Mining

Page 33: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Assembling the DataAssembling the Data

• The Data Warehouse–Only data useful for decision support is extracted from the operational environment

• Relational Databases and Flat Files

33

Page 34: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Mining the Data

34

• Supervised learning or unsupervised?• Which instances will be used?• Which attributes will be selected?• Setting learning parameter

Page 35: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Interpreting the Results

35

• If the results are less than optimal we can repeat the data mining step using new attributes and/or instances

Page 36: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Result Application

36

• apply what has been discovered to new situations

–Baby diapers and beer

Page 37: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

1.6 Why Not Simple 1.6 Why Not Simple Search?Search?

• Nearest Neighbor Classifier• K-nearest Neighbor Classifier•Problem:•Computation times•Differentiating between relevant and irrelevant attributes•Which attributes are able to differentiate the classes

37

Page 38: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

1.7 Data Mining Applications1.7 Data Mining Applications

38

• Fraud Detection• Health Care• Business and Finance• Scientific Applications• Sports and Gaming

Page 39: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

Customer Intrinsic ValueCustomer Intrinsic Value

39

• Customer’s expected value based on the historical value of similar customers.

• Once it is determined, an appropriate marketing strategy can be applied

Page 40: Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent

40

X

X

X

X

X

XX

X

X

_

_

__

_

_

_

_

_

__

Intrinsic(Predicted)

Value

Actual Value