data mining and application part 1: data mining fundamentals part 2: tools for knowledge discovery...
TRANSCRIPT
Data Mining and ApplicationData Mining and Application
Part 1: Data Mining Fundamentals
Part 2: Tools for Knowledge Discovery
Part 3: Advanced Data Mining Techniques
Part 4: Intelligent Systems
1
Data Mining: A First ViewData Mining: A First View
Chapter 1
2
1.1 Data Mining: A Definition1.1 Data Mining: A Definition
•The process of employing one or more computer learning techniques to automatically analyze and extract knowledge from data.
3
Induction-based LearningInduction-based Learning
•The process of forming general concept definitions by observing specific examples of concepts to be learned.
–Many televised golf tournaments are sponsored by online brokerage firms–Advertise rap music in magazines for senior citizens–Suspect a stolen credit card
4
Knowledge Discovery in Knowledge Discovery in Databases (KDD)Databases (KDD)
The application of the scientific method to data mining. Data mining is one step of the KDD process.
5
1.2 What Can Computers 1.2 What Can Computers Learn?Learn?
6
Four Levels of LearningFour Levels of Learning• Facts
–Sea is blue• Concepts
–Trees, rules, networks, and mathematical equations
• Procedures–A step-by-step course of action to achieve a goal
• Principles–General truths or laws
7
ConceptsConcepts•Computers are good at learning concepts. Concepts are the output of a data mining session.
8
•Three concept viewsClassical view
Probabilistic view
Exemplar view
Classical ViewClassical ViewAll concepts have definite
defining properties
IF Annual Income >= 30,000 & Years at Current Position >= 5 & Owns Home = TrueTHEN Good Credit Risk = True
9
Probabilistic ViewProbabilistic ViewRepresented by properties that
are probable of concept members
The majority of good credit risks own their own home
10
Exemplar ViewExemplar View
A given instance is determined to be an example of a particular concept
Good credit risks example
Annual Income = 32,000Number of Years at Current Position =
6Homeowner
11
Supervised LearningSupervised Learning
• Build a learner model using data instances of known origin.• Use the model to determine the outcome new instances of unknown origin.
12
Decision TreeDecision Tree
•A tree structure where nonterminal nodes represent tests on one or more attributes and terminal nodes reflect decision outcomes.
13
14
Table 1.1 • Hypothetical Training Data for Disease Diagnosis
Patient Sore SwollenID# Throat Fever Glands Congestion Headache Diagnosis
1 Yes Yes Yes Yes Yes Strep throat2 No No No Yes Yes Allergy3 Yes Yes No Yes No Cold4 Yes No Yes No No Strep throat5 No Yes No Yes No Cold6 No No No Yes No Allergy7 No No Yes No No Strep throat8 Yes No No Yes Yes Allergy9 No Yes No Yes Yes Cold10 Yes Yes No Yes Yes Cold
15
SwollenGlands
Fever
No
Yes
Diagnosis = Allergy Diagnosis = Cold
No
Yes
Diagnosis = Strep Throat
16
Table 1.2 • Data Instances with an Unknown Classification
Patient Sore SwollenID# Throat Fever Glands Congestion Headache Diagnosis
11 No No Yes Yes Yes ?12 Yes Yes No No Yes ?13 No No No No Yes ?
Production RulesProduction RulesIF Swollen Glands = Yes THEN Diagnosis = Strep ThroatIF Swollen Glands = No & Fever = Yes THEN Diagnosis = ColdIF Swollen Glands = No & Fever = No THEN Diagnosis = Allergy
17
Unsupervised ClusteringUnsupervised Clustering
A data mining method that builds models from data without predefined classes.
18
19
Table 1.3 • Acme Investors Incorporated
Customer Account Margin Transaction Trades/ Favorite AnnualID Type Account Method Month Sex Age Recreation Income
1005 Joint No Online 12.5 F 30–39 Tennis 40–59K1013 Custodial No Broker 0.5 F 50–59 Skiing 80–99K1245 Joint No Online 3.6 M 20–29 Golf 20–39K2110 Individual Yes Broker 22.3 M 30–39 Fishing 40–59K1001 Individual Yes Online 5.0 M 40–49 Golf 60–79K
QuestionQuestionCan I develop a general profile of an
online investor?Can I determine if a new customer
who does not initially open a margin account is likely to do so in the future
Can I build a model able to accurately predict the average number of trades per month for a new investor?
What characteristics differentiate female and male investors?
20
Candidate questions for Candidate questions for unsupervised clusteringunsupervised clustering
What attribute similarities group customers?
What differences in attribute values segment the customer database?
21
Three ClustersThree ClustersIF (Conditions) Margin Account=Yes & Age=20-29&Annual Income=40-59KTHEN Cluster=1Accuracy=0.8, Coverage=0.5
22
Accuracy => rule confidence for all instancesEX. This rule will be erroneous in 20%
Coverage => rule significance for the cluster50% in the cluster satisfy the conditions
Other two rulesOther two rulesIF Account Type=Custodial & Favorite Recreation=Skiing & Annual Income = 80-90KTHEN Cluster=2Accuracy=0.95, coverage=0.35
IF Account Type=Joint & Trades/Month>5 & Transaction Method=Online
THEN Cluster=3Accuracy=0.82, coverage=0.65
23
1.3 Is Data Mining 1.3 Is Data Mining Appropriate for My Appropriate for My Problem?Problem?
24
Data Mining or Data Query?Data Mining or Data Query?• Shallow Knowledge
–Is factual
• Multidimensional Knowledge–Is factual and stored in a multidimensional format
• Hidden Knowledge–Patterns or regularities
• Deep Knowledge–Need some direction to find it
25
Data Mining vs. Data Data Mining vs. Data QueryQuery
• Use data query if you already almost know what you are looking for.• Use data mining to find regularities in data that are not obvious.
26
1.4 Expert Systems or Data 1.4 Expert Systems or Data Mining?Mining?
27
Expert SystemExpert System
•A computer program that emulates the problem-solving skills of one or more human experts.
28
Knowledge EngineerKnowledge Engineer
A person trained to interact with an expert in order to capture their knowledge.
29
30
Data Mining Tool
Expert SystemBuilding Tool
Human Expert
If Swollen Glands = YesThen Diagnosis = Strep Throat
If Swollen Glands = YesThen Diagnosis = Strep Throat
Knowledge Engineer
Data
1.5 A Simple Data Mining 1.5 A Simple Data Mining Process ModelProcess Model
31
• Assembling the Data• Mining the Data• Interpreting the Results• Result Application
32
SQL QueriesOperationalDatabase
DataWarehouse
ResultApplication
Interpretation&
EvaluationData Mining
Assembling the DataAssembling the Data
• The Data Warehouse–Only data useful for decision support is extracted from the operational environment
• Relational Databases and Flat Files
33
Mining the Data
34
• Supervised learning or unsupervised?• Which instances will be used?• Which attributes will be selected?• Setting learning parameter
Interpreting the Results
35
• If the results are less than optimal we can repeat the data mining step using new attributes and/or instances
Result Application
36
• apply what has been discovered to new situations
–Baby diapers and beer
1.6 Why Not Simple 1.6 Why Not Simple Search?Search?
• Nearest Neighbor Classifier• K-nearest Neighbor Classifier•Problem:•Computation times•Differentiating between relevant and irrelevant attributes•Which attributes are able to differentiate the classes
37
1.7 Data Mining Applications1.7 Data Mining Applications
38
• Fraud Detection• Health Care• Business and Finance• Scientific Applications• Sports and Gaming
Customer Intrinsic ValueCustomer Intrinsic Value
39
• Customer’s expected value based on the historical value of similar customers.
• Once it is determined, an appropriate marketing strategy can be applied
40
X
X
X
X
X
XX
X
X
_
_
__
_
_
_
_
_
__
Intrinsic(Predicted)
Value
Actual Value