data mining and application part 1: data mining fundamentals part 2: tools for knowledge discovery...

Data Mining and ApplicationData Mining and Application

Part 1: Data Mining Fundamentals

Part 2: Tools for Knowledge Discovery

Part 3: Advanced Data Mining Techniques

Part 4: Intelligent Systems

1

Data Mining: A First ViewData Mining: A First View

Chapter 1

2

1.1 Data Mining: A Definition1.1 Data Mining: A Definition

•The process of employing one or more computer learning techniques to automatically analyze and extract knowledge from data.

3

Induction-based LearningInduction-based Learning

•The process of forming general concept definitions by observing specific examples of concepts to be learned.

–Many televised golf tournaments are sponsored by online brokerage firms–Advertise rap music in magazines for senior citizens–Suspect a stolen credit card

4

Knowledge Discovery in Knowledge Discovery in Databases (KDD)Databases (KDD)

The application of the scientific method to data mining. Data mining is one step of the KDD process.

5

1.2 What Can Computers 1.2 What Can Computers Learn?Learn?

6

Four Levels of LearningFour Levels of Learning• Facts

–Sea is blue• Concepts

–Trees, rules, networks, and mathematical equations

• Procedures–A step-by-step course of action to achieve a goal

• Principles–General truths or laws

7

ConceptsConcepts•Computers are good at learning concepts. Concepts are the output of a data mining session.

8

•Three concept viewsClassical view

Probabilistic view

Exemplar view

Classical ViewClassical ViewAll concepts have definite

defining properties

IF Annual Income >= 30,000 & Years at Current Position >= 5 & Owns Home = TrueTHEN Good Credit Risk = True

9

Probabilistic ViewProbabilistic ViewRepresented by properties that

are probable of concept members

The majority of good credit risks own their own home

10

Exemplar ViewExemplar View

A given instance is determined to be an example of a particular concept

Good credit risks example

Annual Income = 32,000Number of Years at Current Position =

6Homeowner

11

Supervised LearningSupervised Learning

• Build a learner model using data instances of known origin.• Use the model to determine the outcome new instances of unknown origin.

12

Decision TreeDecision Tree

•A tree structure where nonterminal nodes represent tests on one or more attributes and terminal nodes reflect decision outcomes.

13

14

Table 1.1 • Hypothetical Training Data for Disease Diagnosis

Patient Sore SwollenID# Throat Fever Glands Congestion Headache Diagnosis

1 Yes Yes Yes Yes Yes Strep throat2 No No No Yes Yes Allergy3 Yes Yes No Yes No Cold4 Yes No Yes No No Strep throat5 No Yes No Yes No Cold6 No No No Yes No Allergy7 No No Yes No No Strep throat8 Yes No No Yes Yes Allergy9 No Yes No Yes Yes Cold10 Yes Yes No Yes Yes Cold

15

SwollenGlands

Fever

No

Yes

Diagnosis = Allergy Diagnosis = Cold

No

Yes

Diagnosis = Strep Throat

16

Table 1.2 • Data Instances with an Unknown Classification

Patient Sore SwollenID# Throat Fever Glands Congestion Headache Diagnosis

11 No No Yes Yes Yes ?12 Yes Yes No No Yes ?13 No No No No Yes ?

Production RulesProduction RulesIF Swollen Glands = Yes THEN Diagnosis = Strep ThroatIF Swollen Glands = No & Fever = Yes THEN Diagnosis = ColdIF Swollen Glands = No & Fever = No THEN Diagnosis = Allergy

17

Unsupervised ClusteringUnsupervised Clustering

A data mining method that builds models from data without predefined classes.

18

19

Table 1.3 • Acme Investors Incorporated

Customer Account Margin Transaction Trades/ Favorite AnnualID Type Account Method Month Sex Age Recreation Income

1005 Joint No Online 12.5 F 30–39 Tennis 40–59K1013 Custodial No Broker 0.5 F 50–59 Skiing 80–99K1245 Joint No Online 3.6 M 20–29 Golf 20–39K2110 Individual Yes Broker 22.3 M 30–39 Fishing 40–59K1001 Individual Yes Online 5.0 M 40–49 Golf 60–79K

QuestionQuestionCan I develop a general profile of an

online investor?Can I determine if a new customer

who does not initially open a margin account is likely to do so in the future

Can I build a model able to accurately predict the average number of trades per month for a new investor?

What characteristics differentiate female and male investors?

20

Candidate questions for Candidate questions for unsupervised clusteringunsupervised clustering

What attribute similarities group customers?

What differences in attribute values segment the customer database?

21

Three ClustersThree ClustersIF (Conditions) Margin Account=Yes & Age=20-29&Annual Income=40-59KTHEN Cluster=1Accuracy=0.8, Coverage=0.5

22

Accuracy => rule confidence for all instancesEX. This rule will be erroneous in 20%

Coverage => rule significance for the cluster50% in the cluster satisfy the conditions

Other two rulesOther two rulesIF Account Type=Custodial & Favorite Recreation=Skiing & Annual Income = 80-90KTHEN Cluster=2Accuracy=0.95, coverage=0.35

IF Account Type=Joint & Trades/Month>5 & Transaction Method=Online

THEN Cluster=3Accuracy=0.82, coverage=0.65

23

1.3 Is Data Mining 1.3 Is Data Mining Appropriate for My Appropriate for My Problem?Problem?

24

Data Mining or Data Query?Data Mining or Data Query?• Shallow Knowledge

–Is factual

• Multidimensional Knowledge–Is factual and stored in a multidimensional format

• Hidden Knowledge–Patterns or regularities

• Deep Knowledge–Need some direction to find it

25

Data Mining vs. Data Data Mining vs. Data QueryQuery

• Use data query if you already almost know what you are looking for.• Use data mining to find regularities in data that are not obvious.

26

1.4 Expert Systems or Data 1.4 Expert Systems or Data Mining?Mining?

27

Expert SystemExpert System

•A computer program that emulates the problem-solving skills of one or more human experts.

28

Knowledge EngineerKnowledge Engineer

A person trained to interact with an expert in order to capture their knowledge.

29

30

Data Mining Tool

Expert SystemBuilding Tool

Human Expert

If Swollen Glands = YesThen Diagnosis = Strep Throat

If Swollen Glands = YesThen Diagnosis = Strep Throat

Knowledge Engineer

Data

1.5 A Simple Data Mining 1.5 A Simple Data Mining Process ModelProcess Model

31

• Assembling the Data• Mining the Data• Interpreting the Results• Result Application

32

SQL QueriesOperationalDatabase

DataWarehouse

ResultApplication

Interpretation&

EvaluationData Mining

Assembling the DataAssembling the Data

• The Data Warehouse–Only data useful for decision support is extracted from the operational environment

• Relational Databases and Flat Files

33

Mining the Data

34

• Supervised learning or unsupervised?• Which instances will be used?• Which attributes will be selected?• Setting learning parameter

Interpreting the Results

35

• If the results are less than optimal we can repeat the data mining step using new attributes and/or instances

Result Application

36

• apply what has been discovered to new situations

–Baby diapers and beer

1.6 Why Not Simple 1.6 Why Not Simple Search?Search?

• Nearest Neighbor Classifier• K-nearest Neighbor Classifier•Problem:•Computation times•Differentiating between relevant and irrelevant attributes•Which attributes are able to differentiate the classes

37

1.7 Data Mining Applications1.7 Data Mining Applications

38

• Fraud Detection• Health Care• Business and Finance• Scientific Applications• Sports and Gaming

Customer Intrinsic ValueCustomer Intrinsic Value

39

• Customer’s expected value based on the historical value of similar customers.

• Once it is determined, an appropriate marketing strategy can be applied

40

X

X

X

X

X

XX

X

X

_

_

__

_

_

_

_

_

__

Intrinsic(Predicted)

Value

Actual Value

data mining and application part 1: data mining fundamentals part 2: tools for knowledge discovery...

Documents

data mining fundamentalspart

data mining session

hypothetical training

data instances of known

learning concepts

supervised learning

coldif swollen glands

truethen good credit