data mining demystified john aleshunas fall faculty institute october 2006

Data Mining DemystifiedData Mining Demystified

John AleshunasJohn Aleshunas

Fall Faculty InstituteFall Faculty Institute

October 2006October 2006

Prediction is very hard, especially when it's about the future.

- Yogi Berra

Data Mining StoriesData Mining Stories

““My bank called and said that they saw that I My bank called and said that they saw that I bought two surfboards at Laguna Beach, bought two surfboards at Laguna Beach, California.” - credit card fraud detectionCalifornia.” - credit card fraud detection

The NSA is using data mining to analyze The NSA is using data mining to analyze telephone call data to track al’Qaeda activitiestelephone call data to track al’Qaeda activities

Victoria’s Secret uses data mining to control Victoria’s Secret uses data mining to control product distribution based on typical customer product distribution based on typical customer buying patterns at individual storesbuying patterns at individual stores

PreviewPreview

Why data mining?Why data mining?

Example data setsExample data sets

Data mining methodsData mining methods

Example application of data miningExample application of data mining

Social issues of data miningSocial issues of data mining

Source: HanSource: Han

Why Data Mining?Why Data Mining?

Database systems have been around since Database systems have been around since the 1970sthe 1970s

Organizations have a vast digital history of Organizations have a vast digital history of the day-to-day pieces of their processesthe day-to-day pieces of their processes

Simple queries no longer provide Simple queries no longer provide satisfying resultssatisfying results They take too long to executeThey take too long to execute They cannot help us find new opportunitiesThey cannot help us find new opportunities

Source: HanSource: Han

Why Data Mining?Why Data Mining?

Data doubles about every year while Data doubles about every year while useful information seems to be useful information seems to be decreasingdecreasing

Vast data stores overload traditional Vast data stores overload traditional decision making processesdecision making processes

We are data rich, but information poorWe are data rich, but information poor

Data Mining: a definitionData Mining: a definition

Simply stated, data mining refers to Simply stated, data mining refers to the extraction of knowledge from the extraction of knowledge from large amounts of data.large amounts of data.

Source: DunhamSource: Dunham

Data Mining ModelsData Mining ModelsA TaxonomyA Taxonomy

Data Mining

Predictive Descriptive

Classification

Regression

Time SeriesAnalysis

Prediction SummarizationSequenceDiscovery

Clustering AssociationRules

Example DatasetsExample Datasets

IrisIris

WineWine

DiabetesDiabetes

Source: FisherSource: Fisher

Iris DatasetIris Dataset

Created by R.A. Fisher (1936)Created by R.A. Fisher (1936)

150 instances150 instances

Three cultivars (Setosa, Virginica, Versicolor) 50 Three cultivars (Setosa, Virginica, Versicolor) 50 instances eachinstances each

4 measurements (petal width, petal length, sepal 4 measurements (petal width, petal length, sepal width, sepal length)width, sepal length)

One cultivar (Setosa) is easily separable, the One cultivar (Setosa) is easily separable, the others are not – noisy dataothers are not – noisy data

Iris Dataset AnalysisIris Dataset Analysis

Petal Width

0

0.5

1

1.5

2

2.5

3

0 10 20 30 40 50 60

Number of Records (Integer)(Figure 4)

Iris-Setosa

Iris-Versicolor

Iris-Virginica

Sepal Width

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 10 20 30 40 50 60

Number of Records (Integers)(Figure 2)

Sep

al W

idth

(cm

)

Iris-Setosa

Iris-Versicolor

Iris-Verginica

Source: UCI Machine Learning RSource: UCI Machine Learning Repository epository

Wine DatasetWine Dataset

This data is the result of a chemical This data is the result of a chemical analysis of wines grown in the same analysis of wines grown in the same region in Italy but derived from three region in Italy but derived from three different varieties.different varieties.

153 instances with 13 constituents 153 instances with 13 constituents found in each of the three types of found in each of the three types of wines.wines.

Wine Dataset AnalysisWine Dataset Analysis

Flavinoids

0

1

2

3

4

5

6

0 10 20 30 40 50 60 70

Instance

Val

ue

Class 1

Class 2

Class 3

Ash

0

0.5

1

1.5

2

2.5

3

3.5

0 10 20 30 40 50 60 70

Instances

Val

ues

Class 1

Class 2

Class 3


Diabetes DatasetDiabetes Dataset

Data is based on a population of women who Data is based on a population of women who were at least 21 years old of Pima Indian heritage were at least 21 years old of Pima Indian heritage and living near Phoenix in 1990and living near Phoenix in 1990

768 instances768 instances

9 attributes (Pregnancies, PG Concentration, 9 attributes (Pregnancies, PG Concentration, Diastolic BP, Tri Fold Thick, Serum Ins, BMI, DP Diastolic BP, Tri Fold Thick, Serum Ins, BMI, DP Function, Age, Diabetes)Function, Age, Diabetes)

Dataset has many missing values, only 532 Dataset has many missing values, only 532 instances are completeinstances are complete

Diabetes Dataset AnalysisDiabetes Dataset Analysis

PG Concentration

0

50

100

150

200

250

0 100 200 300 400 500 600

Instances

Val

ues Healthy

Sick

Diastlic BP

0

20

40

60

80

100

120

140

0 100 200 300 400 500 600

Instances

Val

ues Healthy

Sick

ClassificationClassification

Classification builds a model using a Classification builds a model using a training dataset with known classes training dataset with known classes of dataof data

That model is used to classify new, That model is used to classify new, unknown data into those classesunknown data into those classes

Classification TechniquesClassification Techniques

K-Nearest NeighborsK-Nearest Neighbors

Decision Tree Classification (ID3, Decision Tree Classification (ID3, C4.5)C4.5)

K-Nearest Neighbors K-Nearest Neighbors ExampleExample

A A

A

A

A

A

B

B

B

B

BX

A

A

A

A

BBB

B

A

B

• Easy to explain

• Simple to implement

• Sensitive to the selection of the classification population

• Not always conclusive for complex data

Source: IndelicatoSource: Indelicato

K-Nearest Neighbors K-Nearest Neighbors ExampleExample

MISCLASSIFICATION PERCENTAGMISCLASSIFICATION PERCENTAGESES

Iris DatasetIris Dataset

All AttributesAll Attributes Petal Length and Petal WidthPetal Length and Petal Width

SetosaSetosa 0/150 = 0%0/150 = 0% 0/150 = 0%0/150 = 0%

VersicolorVersicolor 0/150 = 0%0/150 = 0% 0/150 = 0%0/150 = 0%

VirginicaVirginica 9/150 = 6%9/150 = 6% 7/150 = 4.67%7/150 = 4.67%

TotalTotal 6%6% 4.67%4.67%

Wine DatasetWine Dataset

All AttributesAll Attributes Phenols, Flavanoids, OD280/OD315Phenols, Flavanoids, OD280/OD315

Class 1Class 1 0/153 = 0%0/153 = 0% 2/153 = 1.31%2/153 = 1.31%

Class 2Class 2 9/153 = 5.88%9/153 = 5.88% 30/153 = 19.61%30/153 = 19.61%

Class 3Class 3 0/153 = 0%0/153 = 0% 0/153 = 0%0/153 = 0%

TotalTotal 5.88%5.88% 20.92%20.92%

Decision Tree Example Decision Tree Example (C4.5)(C4.5)

C4.5 is a decision tree generating algorithm, based on the C4.5 is a decision tree generating algorithm, based on the ID3 algorithm. It contains several improvements, especially ID3 algorithm. It contains several improvements, especially needed for software implementation. needed for software implementation.

Choice of best splitting attribute is based on an entropy Choice of best splitting attribute is based on an entropy calculation.calculation.

These improvements include:These improvements include: Choosing an appropriate attribute selection measure. Choosing an appropriate attribute selection measure. Handling training data with missing attribute values. Handling training data with missing attribute values. Handling attributes with differing costs. Handling attributes with differing costs. Handling continuous attributes.Handling continuous attributes.

Source: SiedlerSource: Siedler


Iris dataset Wine dataset

Accuracy 97.67% Accuracy 86.7%


C4.5 produces a complex tree (195 nodes)C4.5 produces a complex tree (195 nodes)

The simplified (pruned) tree reduces the classification accuracyThe simplified (pruned) tree reduces the classification accuracy

Diabetes dataset

Before PruningBefore Pruning After PruningAfter Pruning

SizeSize ErrorsErrors SizeSize ErrorsErrors

195195 40 (5.2%)40 (5.2%) 6969 102 (13.3%)102 (13.3%)

AccuracyAccuracy 94.8%94.8% 86.7%86.7%

Association RulesAssociation Rules

Association rules are used to show the Association rules are used to show the relationships between data items. relationships between data items.

Purchasing one product when another Purchasing one product when another product is purchased is an example of an product is purchased is an example of an association rule.association rule.

They do not represent any causality or They do not represent any causality or correlation.correlation.

Association Rule TechniquesAssociation Rule Techniques

Market Basket AnalysisMarket Basket Analysis

TerminologyTerminology

Transaction databaseTransaction database

Association rule – implication {A, B} Association rule – implication {A, B} ═> {C}═> {C}

Support - % of transactions in which {A, B, C} occursSupport - % of transactions in which {A, B, C} occurs

Confidence – ratio of the number of transactions that Confidence – ratio of the number of transactions that contain {A, B, C}contain {A, B, C} to the number of transactions that to the number of transactions that contain contain {A, B}{A, B}


Association Rule ExampleAssociation Rule Example1984 United States Congressional Voting Records Database

Attribute Information:

1. Class Name: 2 (democrat, republican) 2. handicapped-infants: 2 (y,n) 3. water-project-cost-sharing: 2 (y,n) 4. adoption-of-the-budget-resolution: 2 (y,n) 5. physician-fee-freeze: 2 (y,n) 6. El-Salvador-aid: 2 (y,n) 7. religious-groups-in-schools: 2 (y,n) 8. anti-satellite-test-ban: 2 (y,n) 9. aid-to-Nicaraguan-contras: 2 (y,n) 10. MX-missile: 2 (y,n) 11. immigration: 2 (y,n) 12. synfuels-corporation-cutback: 2 (y,n) 13. education-spending: 2 (y,n) 14. superfund-right-to-sue: 2 (y,n) 15. crime: 2 (y,n) 16. duty-free-exports: 2 (y,n) 17. export-administration-act-south-africa: 2 (y,n)

Rules:

{budget resolution = no, MX-missile = no,aid to El Salvador = yes} {Republican} confidence 91.0%

{budget resolution = yes, MX-missile = yes,aid to El Salvador = no} {Democrat} confidence 97.5%

{crime = yes, right-to-sue = yes,Physician fee freeze = yes} {Republican} confidence 93.5%

{crime = no, right-to-sue = no,Physician fee freeze = no} {Democrat} confidence 100.0%

ClusteringClustering

Clustering is similar to classification in that Clustering is similar to classification in that data are grouped. data are grouped.

Unlike classification, the groups are not Unlike classification, the groups are not predefined; they are discovered.predefined; they are discovered.

Grouping is accomplished by finding Grouping is accomplished by finding similarities between data according to similarities between data according to characteristics found in the actual data.characteristics found in the actual data.

Clustering TechniquesClustering Techniques

K-Means ClusteringK-Means Clustering

Neural Network Clustering (SOM)Neural Network Clustering (SOM)

K-Means ExampleK-Means Example

The K-Means algorithm is an method The K-Means algorithm is an method to cluster objects based on their to cluster objects based on their attributes into k partitions. attributes into k partitions.

It assumes that the It assumes that the kk clusters exhibit clusters exhibit normal distributions. normal distributions.

The objective it tries to achieve is to The objective it tries to achieve is to minimize the variance within the minimize the variance within the clusters.clusters.


Cluster 1 Cluster 2 Cluster 3

Dataset

Mean 3Mean 2Mean 1


Cluster 1Cluster 1 Cluster 2Cluster 2 Cluster 3Cluster 346 Versicolor46 Versicolor

3 Virginica3 Virginica

Cluster mean 4.22857Cluster mean 4.22857

4 Versicolor4 Versicolor



50 Setosa50 Setosa


Cluster 1Cluster 1 Cluster 2Cluster 2 Cluster 3Cluster 347 Versicolor47 Versicolor


Mean 6.30, 2.89, 4.96, 1.70Mean 6.30, 2.89, 4.96, 1.70

21 Setosa21 Setosa


Mean 4.59, 3.07, 1.44, 0.29Mean 4.59, 3.07, 1.44, 0.29

29 Setosa29 Setosa

3 Versicolor3 Versicolor

Mean 5.21, 3.53, 1.67, 0.35Mean 5.21, 3.53, 1.67, 0.35

Iris dataset, only the petal width attribute, Accuracy 95.33%Iris dataset, only the petal width attribute, Accuracy 95.33%

Iris dataset, all attributes, Accuracy 66.0Iris dataset, all attributes, Accuracy 66.0 %%

Cluster Cluster 11

Cluster Cluster 22

Cluster Cluster 33

Cluster Cluster 44

Cluster Cluster 55

Cluster Cluster 66

Cluster Cluster 77

23 23 VirginicaVirginica

1 Virginica1 Virginica 26 Setosa26 Setosa 12 12 VirginicaVirginica

24 24 VersicolorVersicolor


26 26 VersicolorVersicolor

13 13 VirginicaVirginica

24 Setosa24 Setosa

Iris dataset, all attributes, Accuracy 90.67Iris dataset, all attributes, Accuracy 90.67 %%

Self-Organizing Map Self-Organizing Map ExampleExample

The Self-Organizing Map was first described by the Finnish The Self-Organizing Map was first described by the Finnish professor Teuvo Kohonen and is thus sometimes referred to as professor Teuvo Kohonen and is thus sometimes referred to as a Kohonen map. a Kohonen map.

SOM is especially good for visualizing high-dimensional data.SOM is especially good for visualizing high-dimensional data.

SOM maps input vectors onto a two-dimensional grid of nodes. SOM maps input vectors onto a two-dimensional grid of nodes.

Nodes that are close together have similar attribute values Nodes that are close together have similar attribute values and nodes that are far apart have different attribute values.and nodes that are far apart have different attribute values.

Z

YX


Input Vectors

ZY

X


Virginica Virginica Virginica Versicolor Setosa Setosa Setosa Setosa

Virginica Virginica Virginica Versicolor Setosa Setosa

Virginica Virginica Virginica Versicolor Versicolor Setosa Setosa Setosa Setosa

Virginica Virginica Versicolor Setosa Setosa Setosa

Virginica Virginica Virginica Versicolor Versicolor Setosa Setosa Setosa

Virginica Versicolor Versicolor Versicolor Versicolor Setosa

Virginica Virginica Versicolor Versicolor Versicolor Versicolor Versicolor

Virginica Virginica Virginica Versicolor Versicolor Versicolor Versicolor Versicolor Versicolor Versicolor

Virginica Versicolor Versicolor Versicolor Versicolor Versicolor Versicolor Versicolor Versicolor

Virginica Virginica Versicolor Versicolor Virginica Versicolor Versicolor

Iris Data


Class-2 Class-2 Class-2 Class-2 Class-3 Class-2 Class-2 Class-3


Class-2 Class-2 Class-3 Class-2 Class-2 Class-2 Class-3

Class-2 Class-3 Class-3 Class-3 Class-3 Class-3 Class-1


Class-3 Class-3 Class-2 Class-3 Class-3 Class-3

Class-3 Class-3 Class-1 Class-1

Class-1 Class-1 Class-1 Class-1 Class-1



Wine Data


Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy

Sick Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy

Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy

Healthy Healthy Sick Healthy Sick Healthy Healthy Healthy Healthy Healthy

Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy

Healthy Healthy Sick Healthy Healthy Healthy Healthy Healthy Healthy Healthy

Sick Sick Healthy Sick Sick Healthy Sick Healthy Healthy Healthy

Sick Healthy Healthy Sick Sick Healthy Healthy Healthy Healthy

Sick Sick Healthy Sick Healthy Sick Sick Healthy Healthy

Sick Healthy Sick Sick Sick Sick Sick Sick Healthy Sick

Diabetes Data

Source: McKeeSource: McKee

NFL Quarterback AnalysisNFL Quarterback Analysis

Data from 2005 for 42 NFL Data from 2005 for 42 NFL quarterbacksquarterbacks

Preprocessed data to normalize for a Preprocessed data to normalize for a full 16 game regular seasonfull 16 game regular season

Used SOM to cluster individuals based Used SOM to cluster individuals based on performance and descriptive dataon performance and descriptive data



The SOM Map



QB Passing Rating Overall Clustering



The SOM Map

Data Mining Stories - Data Mining Stories - RevisitedRevisited

Credit card fraud detectionCredit card fraud detection

NSA telephone network analysisNSA telephone network analysis

Supply chain managementSupply chain management

Social Issues of Data MiningSocial Issues of Data Mining

Impacts on personal privacy and confidentialityImpacts on personal privacy and confidentiality

Classification and clustering is similar to profilingClassification and clustering is similar to profiling

Association rules resemble logical implicationsAssociation rules resemble logical implications

Data mining is an imperfect process subject to Data mining is an imperfect process subject to interpretationinterpretation

ConclusionConclusion

Why data mining?Why data mining?

Example data setsExample data sets

Data mining methodsData mining methods

Example application of data miningExample application of data mining

Social issues of data miningSocial issues of data mining

What on earth would a man do with himself if something did not stand in his way? - H.G. Wells

I don’t think necessity is the mother of invention – invention, in my opinion, arises directly from idleness, probably also from laziness, to save oneself trouble. - Agatha Christie, from “An Autobiography, Pt III, Growing Up”

ReferencesReferences

Dunham, Margaret, Data Mining Introductory and Advanced Topics, Pearson Dunham, Margaret, Data Mining Introductory and Advanced Topics, Pearson Education, Inc., 2003Education, Inc., 2003

Fisher, R.A., Fisher, R.A., The Use of Multiple Measurements in Taxonomic ProblemsThe Use of Multiple Measurements in Taxonomic Problems, Annals of , Annals of Eugenics Eugenics 77, pp. 179-188 , pp. 179-188

Han, Jiawei, Data Mining: Concepts and Techniques, Elsevier Inc., 2006Han, Jiawei, Data Mining: Concepts and Techniques, Elsevier Inc., 2006

Indelicato, Nicolas, Analysis of the K-Nearest Neighbors Algorithm, MATH 4500: Indelicato, Nicolas, Analysis of the K-Nearest Neighbors Algorithm, MATH 4500: Foundations of Data Mining, 2004Foundations of Data Mining, 2004

McKee , Kevin, McKee , Kevin, The Self Organized Map Applied to 2005 NFL Quarterbacks, MATH MATH 4200: Data Mining Foundations, 20064200: Data Mining Foundations, 2006

Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science Irvine, CA: University of California, Department of Information and Computer Science

Seidler, Toby, The C4.5 Project: An Overview of the Algorithm with Results of Seidler, Toby, The C4.5 Project: An Overview of the Algorithm with Results of Experimentation, MATH 4500: Foundations of Data Mining, 2004 Experimentation, MATH 4500: Foundations of Data Mining, 2004

data mining demystified john aleshunas fall faculty institute october 2006

Documents

data mining stories

data rich

data doubles

noisy data slide

dunham data mining models

large amounts of data

individual stores slide

information poor slide