cse 711: data mining sargur n. srihari e-mail: [email protected] phone: 645-6164, ext. 113

47
1 CSE 711: DATA MINING Sargur N. Srihari E-mail: [email protected] Phone: 645-6164, ext. 113

Upload: ledell

Post on 04-Jan-2016

59 views

Category:

Documents


1 download

DESCRIPTION

CSE 711: DATA MINING Sargur N. Srihari E-mail: [email protected] Phone: 645-6164, ext. 113. CSE 711 Texts. Required Text 1. Witten, I. H., and E. Frank , Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations , Morgan Kaufmann, 2000. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

1

CSE 711:DATA MINING

Sargur N. Srihari

E-mail: [email protected]: 645-6164, ext. 113

Page 2: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

2

CSE 711 Texts

Required Text

1. Witten, I. H., and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000.

Recommended Texts

1. Adriaans, P., and D. Zantinge, Data Mining, Addison- Wesley,1998.

Page 3: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

3

CSE 711 Texts

2. Groth, R., Data Mining: A Hands-on Approach for Business Professionals, Prentice-Hall PTR,1997.

3. Kennedy, R., Y. Lee, et al., Solving Data Mining Problems through Pattern Recognition, Prentice-Hall PTR, 1998.

4. Weiss, S., and N. Indurkhya, Predictive Data Mining: A Practical Guide, Morgan Kaufmann, 1998.

Page 4: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

4

Introduction

• Challenge: How to manage ever-increasing amounts of information

• Solution: Data Mining and Knowledge Discovery Databases (KDD)

Page 5: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

5

Information as a Production Factor

• Most international organizations produce more information in a week than many people could read in a lifetime

Page 6: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

6

Data Mining Motivation

• Mechanical production of data need for mechanical consumption of data

• Large databases = vast amounts of information

• Difficulty lies in accessing it

Page 7: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

7

KDD and Data Mining

• KDD: Extraction of knowledge from data

• Official definition: “non-trivial extraction of implicit, previously unknown & potentially useful knowledge from data”

• Data Mining: Discovery stage of the KDD process

Page 8: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

8

Data Mining

• Process of discovering patterns, automatically or semi-automatically, in large quantities of data

• Patterns discovered must be useful: meaningful in that they lead to some advantage, usually economic

Page 9: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

9

Exportsystems

Statistics

Machinelearning

KDD

Database

Visualization

Figure 1.1 Data mining is a multi-disciplinary field.

KDD and Data Mining

Page 10: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

10

Data Mining vs. Query Tools

• SQL: When you know exactly what you are looking for

• Data Mining: When you only vaguely know what you are looking for

Page 11: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

11

Practical Applications

• KDD more complicated than initially thought

• 80% preparing data • 20% mining data

Page 12: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

12

Data Mining Techniques

• Not so much a single technique

• More the idea that there is more knowledge hidden in the data than shows itself on the surface

Page 13: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

13

Data Mining Techniques

• Any technique that helps to extract more out of data is useful

• Query tools• Statistical techniques• Visualization• On-line analytical processing (OLAP)• Case-based learning (k-nearest neighbor)

Page 14: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

14

Data Mining Techniques

• Decision trees• Association rules• Neural networks• Genetic algorithms

Page 15: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

15

Machine Learning and theMethodology of Science

Analysis

Observation

Prediction

Theory

Empirical cycle of scientific research

Page 16: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

16

Machine Learning...Analysis

Limited number of

observation

Theory ‘All swans are

white’

Rea

lity:

Infin

ite n

umbe

r of

sw

ans

Theory formation

Page 17: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

17

Machine Learning...

Prediction

Single observation

Theory “All swans are

white”

Theory falsification

Rea

lity:

Infin

ite n

umbe

r of

sw

ans

Page 18: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

18

A Kangaroo in Mist

a.) b.) c.)

d.) e.) f.)

Complexity of search spaces

Page 19: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

19

Association Rules

Definition: Given a set of transactions, where each transaction is a set of items, an association rule is an expression XY, where X and Y are sets of an item.

Page 20: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

20

Association Rules

Intuitive meaning of such a rule: transactions in the database which contain the items in X tend also to contain the items in Y.

Page 21: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

21

Association Rules

Example: 98% of customers that purchase tires and automotive accessories also buy some automotive services.

Here, 98% is called the confidence of the rule. The support of the rule X Y is the percentage of transactions that contain both X and Y.

Page 22: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

22

Association Rules

Problem: The problem of mining association rules is to find all rules which satisfy a user-specified minimum support and minimum confidence. Applications include cross-marketing, attached mailing, catalog design, loss leader analysis, add-on sales, store layout and customer segmentation based on buying patterns.

Page 23: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

23

Example Data Sets• Contact Lens (symbolic)• Weather (symbolic data)• Weather ( numeric +symbolic)• Iris (numeric; outcome:symbolic)• CPU Perf.(numeric; outcome:numeric)• Labor Negotiations (missing values)• Soybean

Page 24: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

24

Contact Lens Data

agespectacle prescription astigmatism

tear production rate

recommendation lenses

young myope no reduced noneyoung myope no normal softyoung myope yes reduced noneyoung myope yes normal hardyoung hypermetrope no reduced noneyoung hypermetrope no normal softyoung hypermetrope yes reduced noneyoung hypermetrope yes normal hardpre-presbyopic myope no reduced nonepre-presbyopic myope no normal softpre-presbyopic myope yes reduced nonepre-presbyopic myope yes normal hardpre-presbyopic hypermetrope no reduced nonepre-presbyopic hypermetrope no normal softpre-presbyopic hypermetrope yes reduced nonepre-presbyopic hypermetrope yes normal nonepresbyopic myope no reduced nonepresbyopic myope no normal nonepresbyopic myope yes reduced nonepresbyopic myope yes normal hardpresbyopic hypermetrope no reduced nonepresbyopic hypermetrope no normal softpresbyopic hypermetrope yes reduced nonepresbyopic hypermetrope yes normal none

Page 25: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

25

Structural Patterns

• Part of structural description

• Example is simplistic because all combinations of possible values are represented in table

If tear production rate = reduced then recommendation = none

Otherwise, if age = young and astigmatic = nothen recommendation = soft

Page 26: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

26

Structural Patterns

• In most learning situations, the set of examples given as input is far from complete

• Part of the job is to generalize to other, new examples

Page 27: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

27

Weather Data

outlook temperature humidity windy playsunny hot high false nosunny hot high true noovercast hot high false yesrainy mild high false yesrainy cool normal false yesrainy cool normal true noovercast cool normal true yessunny mild high false nosunny cool normal false yesrainy mild normal false yessunny mild normal true yesovercast mild high true yesovercast hot normal false yesrainy mild high true no

Page 28: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

28

Weather Problem

• This creates 36 possible combinations (3 X 3 X 2 X 2 = 36), of which 14 are present in the set of examples

If outlook = sunny and humidity = high then play = no

If outlook = rainy and windy = true then play = no

If outlook = overcast then play = yes

If humidity = normal then play = yes

If none of the above then play = yes

Page 29: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

29

Weather Data with Some Numeric Attributes

outlook temperature humidity windy playsunny 85 85 false nosunny 80 90 true noovercast 83 86 false yesrainy 70 96 false yesrainy 68 80 false yesrainy 65 70 true noovercast 64 65 true yessunny 72 95 false nosunny 69 70 false yesrainy 75 80 false yessunny 75 70 true yesovercast 72 90 true yesovercast 81 75 false yesrainy 71 91 true no

Page 30: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

30

Classification and AssociationRules

• Classification Rules: rules which predict the classification of the example in terms of whether to play or not

If outlook = sunny and humidity = >83, then play = no

Page 31: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

31

Classification and AssociationRules

• Association Rules: rules which strongly associate different attribute values

• Association rules which derive from weather table

If temperature = cool then humidity = normal

If humidity = normal and windy = false then play = yes

If outlook = sunny and play = no then humidity = high

If windy = false and play = no then outlook = sunnyand humidity = high

Page 32: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

32

If tear production rate = reduced then recommendation = none

If age = young and astigmatic = no andtear production rate = normal then recommendation = soft

If age = pre-presbyopic and astigmatic = no andtear production rate = normal then recommendation = soft

If age = presbyopic and spectacle prescription = myope andastigmatic = no then recommendation = none

If spectacle prescription = hypermetrope and astigmatic = no and tear production rate = normal then recommendation = soft

If spectacle prescription = myope and astigmatic = yes and tear production rate = normal then recommendation = hard

If age = young and astigmatic = yes andtear production rate = normal then recommendation = hard

If age = pre-presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none

If age = presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none

Rules for Contact Lens Data

Page 33: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

33

Decision Tree for Contact Lens Data

tear production rate

astigmatism

spectacle prescription

none

soft

hard none

Page 34: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

34

sepal length

sepal width

pedal lenth

pedal width type

1 5.1 3.5 1.4 0.2 Iris setosa2 4.9 3.0 1.4 0.2 Iris setosa3 4.7 3.2 1.3 0.2 Iris setosa4 4.6 3.1 1.5 0.2 Iris setosa5 5.0 3.6 1.4 0.2 Iris setosa…51 7.0 3.2 4.7 1.4 Iris 52 6.4 3.2 4.5 1.5 Iris 53 6.9 3.1 4.9 1.5 Iris 54 5.5 2.3 4.0 1.3 Iris 55 6.5 2.8 4.6 1.5 Iris 101 6.3 3.3 6.0 2.5 Iris virginica102 5.8 2.7 5.1 1.9 Iris virginica103 7.1 3.0 5.9 2.1 Iris virginica104 6.3 2.9 5.6 1.8 Iris virginica105 6.5 3.0 5.8 2.2 Iris virginica

Iris Data

Page 35: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

35

Iris Rules Learned• If petal-length <2.45 then Iris-setosa

• If sepal-width <2.10 then Iris-versicolor

• If sepal-width < 2.45 and petal-length <4.55 then Iris-versicolor

• ...

Page 36: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

36

CPU Performance Data

cycle cacheperfor-mance

time (ns) min max (Kb) min max

MYCT MMIN MMAX CACH CHMIN CHMAX PRP1 125 256 6000 256 16 128 1982 29 8000 32000 32 8 32 2693 29 8000 32000 32 8 32 2204 29 8000 32000 32 8 32 1725 29 8000 16000 32 8 16 132

…207 125 2000 8000 0 2 14 52208 480 512 8000 32 0 0 67209 480 1000 4000 0 0 0 45

main memory (Kb) channels

Page 37: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

37

CPU Performance

• Numerical Prediction: outcome as linear sum of weighted attributes

• Regression equation:

• PRP=-55.9+.049MYCT+.+1.48CHMAX

• Regression can discover linear relationships, not non-linear ones

Page 38: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

38

Linear Regression

Debt

Regression Line

Income

A simple linear regression for the loan data set

Page 39: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

39

Labor Negotiations Data

attribute type 1 2 3 … 40duration (number of years) 1 2 3 2wage increase first year persentage 2% 4% 4.3% 4.5wage increase second year persentage ? 5% 4.4% 4.0wage increase third year persentage ? ? ? ?cost of living adjustment {none, tcf, tc} none tcf ? noneworking hours per week (number of hours 28 35 38 40pension {none, ret-allw, none ? ? ?standby pay persentage ? 13% ? ?shift-work supplement persentage ? 5% 4% 4education allowance {yes, no} yes ? ? ?statutory holidays (number of days) 11 15 12 12vacation {below-avg, avg, avg gen gen avglong-term disablity {yes, no} no ? ? yesdental plan contribution {none, half, full} none ? full fullbereavement assistance {yes, no} no ? ? yeshealth plan contribution {none, half, full} none ? full halfacceptablity of contract {good, bad} bad good good good

Page 40: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

40

Decision Trees for ...

Wage increase first year

Statutory holidays

Wage increase first year

Bad

Good

Bad Good

2.5> 2.5

> 10

< 4

10

4

Page 41: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

41

… Labor Negotiations DataWage

increase first year

Good

Bad Good

2.5 > 2.5

> 10

< 4

10

4

Working hours per week

Statutory holidays

Health plan contribution

Wage increase first

yearBad

Bad Good Bad

> 36 36

fullhalf

none

Page 42: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

42

Soy Bean DataAttribute Number of Values Sample Value

Environment time of occurrence 7 Julyprecipitation 3 above normaltemperature 3 normal

Seed condition 2 normalmold growth 2 absentdiscoloration 2 absent

Fruit condition of fruit pods 4 normal

Leaves condition 2 abnormalyellow leaf spot halo 3 absentleaf spot margins 3 no data

Stem condition 2 abnormalstem lodging 2 yesstem cankers 4 above the soil line

Roots condition 3 normal

Diagnosis 19 diaporthe stem canker

Page 43: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

43

Two Example Rules

If [leaf condition is normal andstem condition is abnormal andstem cankers is below soil line andcanker lesion color is brown]

thendiagnosis is rhizoctonia root rot

If [leaf malformation is absent andstem condition is abnormal andstem cankers is below soil line and canker lesion color is brown]

thendiagnosis is rhizoctonia root rot

Page 44: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

44

Debt

Loan

Income

No loan

A simple linear classification boundary for the loan data set; shaded region denotes class “no loan”

Classification

Page 45: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

45

Clustering

Debt

Cluster 1 Cluster 2

Cluster 3

Income

A simple clustering of the loan data set into 3 clusters; note that the original labels are replaced by +’s

Page 46: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

46

Non-Linear Classification

Debt No Loan

Loan

Income

An example of classification boundaries learned by a non-linear classifier (such asa neural network) for the loan data set

Page 47: CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo Phone: 645-6164, ext. 113

47

Nearest Neighbor Classifier

Debt No Loan

Loan

Income

Classification boundaries for a nearest neighbor classifier for the loan data set