cse 711: data mining sargur n. srihari e-mail: [email protected] phone: 645-6164, ext. 113
DESCRIPTION
CSE 711: DATA MINING Sargur N. Srihari E-mail: [email protected] Phone: 645-6164, ext. 113. CSE 711 Texts. Required Text 1. Witten, I. H., and E. Frank , Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations , Morgan Kaufmann, 2000. - PowerPoint PPT PresentationTRANSCRIPT
2
CSE 711 Texts
Required Text
1. Witten, I. H., and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000.
Recommended Texts
1. Adriaans, P., and D. Zantinge, Data Mining, Addison- Wesley,1998.
3
CSE 711 Texts
2. Groth, R., Data Mining: A Hands-on Approach for Business Professionals, Prentice-Hall PTR,1997.
3. Kennedy, R., Y. Lee, et al., Solving Data Mining Problems through Pattern Recognition, Prentice-Hall PTR, 1998.
4. Weiss, S., and N. Indurkhya, Predictive Data Mining: A Practical Guide, Morgan Kaufmann, 1998.
4
Introduction
• Challenge: How to manage ever-increasing amounts of information
• Solution: Data Mining and Knowledge Discovery Databases (KDD)
5
Information as a Production Factor
• Most international organizations produce more information in a week than many people could read in a lifetime
6
Data Mining Motivation
• Mechanical production of data need for mechanical consumption of data
• Large databases = vast amounts of information
• Difficulty lies in accessing it
7
KDD and Data Mining
• KDD: Extraction of knowledge from data
• Official definition: “non-trivial extraction of implicit, previously unknown & potentially useful knowledge from data”
• Data Mining: Discovery stage of the KDD process
8
Data Mining
• Process of discovering patterns, automatically or semi-automatically, in large quantities of data
• Patterns discovered must be useful: meaningful in that they lead to some advantage, usually economic
9
Exportsystems
Statistics
Machinelearning
KDD
Database
Visualization
Figure 1.1 Data mining is a multi-disciplinary field.
KDD and Data Mining
10
Data Mining vs. Query Tools
• SQL: When you know exactly what you are looking for
• Data Mining: When you only vaguely know what you are looking for
11
Practical Applications
• KDD more complicated than initially thought
• 80% preparing data • 20% mining data
12
Data Mining Techniques
• Not so much a single technique
• More the idea that there is more knowledge hidden in the data than shows itself on the surface
13
Data Mining Techniques
• Any technique that helps to extract more out of data is useful
• Query tools• Statistical techniques• Visualization• On-line analytical processing (OLAP)• Case-based learning (k-nearest neighbor)
14
Data Mining Techniques
• Decision trees• Association rules• Neural networks• Genetic algorithms
15
Machine Learning and theMethodology of Science
Analysis
Observation
Prediction
Theory
Empirical cycle of scientific research
16
Machine Learning...Analysis
Limited number of
observation
Theory ‘All swans are
white’
Rea
lity:
Infin
ite n
umbe
r of
sw
ans
Theory formation
17
Machine Learning...
Prediction
Single observation
Theory “All swans are
white”
Theory falsification
Rea
lity:
Infin
ite n
umbe
r of
sw
ans
18
A Kangaroo in Mist
a.) b.) c.)
d.) e.) f.)
Complexity of search spaces
19
Association Rules
Definition: Given a set of transactions, where each transaction is a set of items, an association rule is an expression XY, where X and Y are sets of an item.
20
Association Rules
Intuitive meaning of such a rule: transactions in the database which contain the items in X tend also to contain the items in Y.
21
Association Rules
Example: 98% of customers that purchase tires and automotive accessories also buy some automotive services.
Here, 98% is called the confidence of the rule. The support of the rule X Y is the percentage of transactions that contain both X and Y.
22
Association Rules
Problem: The problem of mining association rules is to find all rules which satisfy a user-specified minimum support and minimum confidence. Applications include cross-marketing, attached mailing, catalog design, loss leader analysis, add-on sales, store layout and customer segmentation based on buying patterns.
23
Example Data Sets• Contact Lens (symbolic)• Weather (symbolic data)• Weather ( numeric +symbolic)• Iris (numeric; outcome:symbolic)• CPU Perf.(numeric; outcome:numeric)• Labor Negotiations (missing values)• Soybean
24
Contact Lens Data
agespectacle prescription astigmatism
tear production rate
recommendation lenses
young myope no reduced noneyoung myope no normal softyoung myope yes reduced noneyoung myope yes normal hardyoung hypermetrope no reduced noneyoung hypermetrope no normal softyoung hypermetrope yes reduced noneyoung hypermetrope yes normal hardpre-presbyopic myope no reduced nonepre-presbyopic myope no normal softpre-presbyopic myope yes reduced nonepre-presbyopic myope yes normal hardpre-presbyopic hypermetrope no reduced nonepre-presbyopic hypermetrope no normal softpre-presbyopic hypermetrope yes reduced nonepre-presbyopic hypermetrope yes normal nonepresbyopic myope no reduced nonepresbyopic myope no normal nonepresbyopic myope yes reduced nonepresbyopic myope yes normal hardpresbyopic hypermetrope no reduced nonepresbyopic hypermetrope no normal softpresbyopic hypermetrope yes reduced nonepresbyopic hypermetrope yes normal none
25
Structural Patterns
• Part of structural description
• Example is simplistic because all combinations of possible values are represented in table
If tear production rate = reduced then recommendation = none
Otherwise, if age = young and astigmatic = nothen recommendation = soft
26
Structural Patterns
• In most learning situations, the set of examples given as input is far from complete
• Part of the job is to generalize to other, new examples
27
Weather Data
outlook temperature humidity windy playsunny hot high false nosunny hot high true noovercast hot high false yesrainy mild high false yesrainy cool normal false yesrainy cool normal true noovercast cool normal true yessunny mild high false nosunny cool normal false yesrainy mild normal false yessunny mild normal true yesovercast mild high true yesovercast hot normal false yesrainy mild high true no
28
Weather Problem
• This creates 36 possible combinations (3 X 3 X 2 X 2 = 36), of which 14 are present in the set of examples
If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
29
Weather Data with Some Numeric Attributes
outlook temperature humidity windy playsunny 85 85 false nosunny 80 90 true noovercast 83 86 false yesrainy 70 96 false yesrainy 68 80 false yesrainy 65 70 true noovercast 64 65 true yessunny 72 95 false nosunny 69 70 false yesrainy 75 80 false yessunny 75 70 true yesovercast 72 90 true yesovercast 81 75 false yesrainy 71 91 true no
30
Classification and AssociationRules
• Classification Rules: rules which predict the classification of the example in terms of whether to play or not
If outlook = sunny and humidity = >83, then play = no
31
Classification and AssociationRules
• Association Rules: rules which strongly associate different attribute values
• Association rules which derive from weather table
If temperature = cool then humidity = normal
If humidity = normal and windy = false then play = yes
If outlook = sunny and play = no then humidity = high
If windy = false and play = no then outlook = sunnyand humidity = high
32
If tear production rate = reduced then recommendation = none
If age = young and astigmatic = no andtear production rate = normal then recommendation = soft
If age = pre-presbyopic and astigmatic = no andtear production rate = normal then recommendation = soft
If age = presbyopic and spectacle prescription = myope andastigmatic = no then recommendation = none
If spectacle prescription = hypermetrope and astigmatic = no and tear production rate = normal then recommendation = soft
If spectacle prescription = myope and astigmatic = yes and tear production rate = normal then recommendation = hard
If age = young and astigmatic = yes andtear production rate = normal then recommendation = hard
If age = pre-presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none
If age = presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none
Rules for Contact Lens Data
33
Decision Tree for Contact Lens Data
tear production rate
astigmatism
spectacle prescription
none
soft
hard none
34
sepal length
sepal width
pedal lenth
pedal width type
1 5.1 3.5 1.4 0.2 Iris setosa2 4.9 3.0 1.4 0.2 Iris setosa3 4.7 3.2 1.3 0.2 Iris setosa4 4.6 3.1 1.5 0.2 Iris setosa5 5.0 3.6 1.4 0.2 Iris setosa…51 7.0 3.2 4.7 1.4 Iris 52 6.4 3.2 4.5 1.5 Iris 53 6.9 3.1 4.9 1.5 Iris 54 5.5 2.3 4.0 1.3 Iris 55 6.5 2.8 4.6 1.5 Iris 101 6.3 3.3 6.0 2.5 Iris virginica102 5.8 2.7 5.1 1.9 Iris virginica103 7.1 3.0 5.9 2.1 Iris virginica104 6.3 2.9 5.6 1.8 Iris virginica105 6.5 3.0 5.8 2.2 Iris virginica
Iris Data
35
Iris Rules Learned• If petal-length <2.45 then Iris-setosa
• If sepal-width <2.10 then Iris-versicolor
• If sepal-width < 2.45 and petal-length <4.55 then Iris-versicolor
• ...
36
CPU Performance Data
cycle cacheperfor-mance
time (ns) min max (Kb) min max
MYCT MMIN MMAX CACH CHMIN CHMAX PRP1 125 256 6000 256 16 128 1982 29 8000 32000 32 8 32 2693 29 8000 32000 32 8 32 2204 29 8000 32000 32 8 32 1725 29 8000 16000 32 8 16 132
…207 125 2000 8000 0 2 14 52208 480 512 8000 32 0 0 67209 480 1000 4000 0 0 0 45
main memory (Kb) channels
37
CPU Performance
• Numerical Prediction: outcome as linear sum of weighted attributes
• Regression equation:
• PRP=-55.9+.049MYCT+.+1.48CHMAX
• Regression can discover linear relationships, not non-linear ones
38
Linear Regression
Debt
Regression Line
Income
A simple linear regression for the loan data set
39
Labor Negotiations Data
attribute type 1 2 3 … 40duration (number of years) 1 2 3 2wage increase first year persentage 2% 4% 4.3% 4.5wage increase second year persentage ? 5% 4.4% 4.0wage increase third year persentage ? ? ? ?cost of living adjustment {none, tcf, tc} none tcf ? noneworking hours per week (number of hours 28 35 38 40pension {none, ret-allw, none ? ? ?standby pay persentage ? 13% ? ?shift-work supplement persentage ? 5% 4% 4education allowance {yes, no} yes ? ? ?statutory holidays (number of days) 11 15 12 12vacation {below-avg, avg, avg gen gen avglong-term disablity {yes, no} no ? ? yesdental plan contribution {none, half, full} none ? full fullbereavement assistance {yes, no} no ? ? yeshealth plan contribution {none, half, full} none ? full halfacceptablity of contract {good, bad} bad good good good
40
Decision Trees for ...
Wage increase first year
Statutory holidays
Wage increase first year
Bad
Good
Bad Good
2.5> 2.5
> 10
< 4
10
4
41
… Labor Negotiations DataWage
increase first year
Good
Bad Good
2.5 > 2.5
> 10
< 4
10
4
Working hours per week
Statutory holidays
Health plan contribution
Wage increase first
yearBad
Bad Good Bad
> 36 36
fullhalf
none
42
Soy Bean DataAttribute Number of Values Sample Value
Environment time of occurrence 7 Julyprecipitation 3 above normaltemperature 3 normal
Seed condition 2 normalmold growth 2 absentdiscoloration 2 absent
Fruit condition of fruit pods 4 normal
Leaves condition 2 abnormalyellow leaf spot halo 3 absentleaf spot margins 3 no data
Stem condition 2 abnormalstem lodging 2 yesstem cankers 4 above the soil line
Roots condition 3 normal
Diagnosis 19 diaporthe stem canker
43
Two Example Rules
If [leaf condition is normal andstem condition is abnormal andstem cankers is below soil line andcanker lesion color is brown]
thendiagnosis is rhizoctonia root rot
If [leaf malformation is absent andstem condition is abnormal andstem cankers is below soil line and canker lesion color is brown]
thendiagnosis is rhizoctonia root rot
44
Debt
Loan
Income
No loan
A simple linear classification boundary for the loan data set; shaded region denotes class “no loan”
Classification
45
Clustering
Debt
Cluster 1 Cluster 2
Cluster 3
Income
A simple clustering of the loan data set into 3 clusters; note that the original labels are replaced by +’s
46
Non-Linear Classification
Debt No Loan
Loan
Income
An example of classification boundaries learned by a non-linear classifier (such asa neural network) for the loan data set
47
Nearest Neighbor Classifier
Debt No Loan
Loan
Income
Classification boundaries for a nearest neighbor classifier for the loan data set