data mining demystified john aleshunas fall faculty institute october 2006
TRANSCRIPT
Data Mining DemystifiedData Mining Demystified
John AleshunasJohn Aleshunas
Fall Faculty InstituteFall Faculty Institute
October 2006October 2006
Data Mining StoriesData Mining Stories
““My bank called and said that they saw that I My bank called and said that they saw that I bought two surfboards at Laguna Beach, bought two surfboards at Laguna Beach, California.” - credit card fraud detectionCalifornia.” - credit card fraud detection
The NSA is using data mining to analyze The NSA is using data mining to analyze telephone call data to track al’Qaeda activitiestelephone call data to track al’Qaeda activities
Victoria’s Secret uses data mining to control Victoria’s Secret uses data mining to control product distribution based on typical customer product distribution based on typical customer buying patterns at individual storesbuying patterns at individual stores
PreviewPreview
Why data mining?Why data mining?
Example data setsExample data sets
Data mining methodsData mining methods
Example application of data miningExample application of data mining
Social issues of data miningSocial issues of data mining
Source: HanSource: Han
Why Data Mining?Why Data Mining?
Database systems have been around since Database systems have been around since the 1970sthe 1970s
Organizations have a vast digital history of Organizations have a vast digital history of the day-to-day pieces of their processesthe day-to-day pieces of their processes
Simple queries no longer provide Simple queries no longer provide satisfying resultssatisfying results They take too long to executeThey take too long to execute They cannot help us find new opportunitiesThey cannot help us find new opportunities
Source: HanSource: Han
Why Data Mining?Why Data Mining?
Data doubles about every year while Data doubles about every year while useful information seems to be useful information seems to be decreasingdecreasing
Vast data stores overload traditional Vast data stores overload traditional decision making processesdecision making processes
We are data rich, but information poorWe are data rich, but information poor
Data Mining: a definitionData Mining: a definition
Simply stated, data mining refers to Simply stated, data mining refers to the extraction of knowledge from the extraction of knowledge from large amounts of data.large amounts of data.
Source: DunhamSource: Dunham
Data Mining ModelsData Mining ModelsA TaxonomyA Taxonomy
Data Mining
Predictive Descriptive
Classification
Regression
Time SeriesAnalysis
Prediction SummarizationSequenceDiscovery
Clustering AssociationRules
Source: FisherSource: Fisher
Iris DatasetIris Dataset
Created by R.A. Fisher (1936)Created by R.A. Fisher (1936)
150 instances150 instances
Three cultivars (Setosa, Virginica, Versicolor) 50 Three cultivars (Setosa, Virginica, Versicolor) 50 instances eachinstances each
4 measurements (petal width, petal length, sepal 4 measurements (petal width, petal length, sepal width, sepal length)width, sepal length)
One cultivar (Setosa) is easily separable, the One cultivar (Setosa) is easily separable, the others are not – noisy dataothers are not – noisy data
Iris Dataset AnalysisIris Dataset Analysis
Petal Width
0
0.5
1
1.5
2
2.5
3
0 10 20 30 40 50 60
Number of Records (Integer)(Figure 4)
Iris-Setosa
Iris-Versicolor
Iris-Virginica
Sepal Width
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 10 20 30 40 50 60
Number of Records (Integers)(Figure 2)
Sep
al W
idth
(cm
)
Iris-Setosa
Iris-Versicolor
Iris-Verginica
Source: UCI Machine Learning RSource: UCI Machine Learning Repository epository
Wine DatasetWine Dataset
This data is the result of a chemical This data is the result of a chemical analysis of wines grown in the same analysis of wines grown in the same region in Italy but derived from three region in Italy but derived from three different varieties.different varieties.
153 instances with 13 constituents 153 instances with 13 constituents found in each of the three types of found in each of the three types of wines.wines.
Wine Dataset AnalysisWine Dataset Analysis
Flavinoids
0
1
2
3
4
5
6
0 10 20 30 40 50 60 70
Instance
Val
ue
Class 1
Class 2
Class 3
Ash
0
0.5
1
1.5
2
2.5
3
3.5
0 10 20 30 40 50 60 70
Instances
Val
ues
Class 1
Class 2
Class 3
Source: UCI Machine Learning RSource: UCI Machine Learning Repository epository
Diabetes DatasetDiabetes Dataset
Data is based on a population of women who Data is based on a population of women who were at least 21 years old of Pima Indian heritage were at least 21 years old of Pima Indian heritage and living near Phoenix in 1990and living near Phoenix in 1990
768 instances768 instances
9 attributes (Pregnancies, PG Concentration, 9 attributes (Pregnancies, PG Concentration, Diastolic BP, Tri Fold Thick, Serum Ins, BMI, DP Diastolic BP, Tri Fold Thick, Serum Ins, BMI, DP Function, Age, Diabetes)Function, Age, Diabetes)
Dataset has many missing values, only 532 Dataset has many missing values, only 532 instances are completeinstances are complete
Diabetes Dataset AnalysisDiabetes Dataset Analysis
PG Concentration
0
50
100
150
200
250
0 100 200 300 400 500 600
Instances
Val
ues Healthy
Sick
Diastlic BP
0
20
40
60
80
100
120
140
0 100 200 300 400 500 600
Instances
Val
ues Healthy
Sick
ClassificationClassification
Classification builds a model using a Classification builds a model using a training dataset with known classes training dataset with known classes of dataof data
That model is used to classify new, That model is used to classify new, unknown data into those classesunknown data into those classes
Classification TechniquesClassification Techniques
K-Nearest NeighborsK-Nearest Neighbors
Decision Tree Classification (ID3, Decision Tree Classification (ID3, C4.5)C4.5)
K-Nearest Neighbors K-Nearest Neighbors ExampleExample
A A
A
A
A
A
B
B
B
B
BX
A
A
A
A
BBB
B
A
B
• Easy to explain
• Simple to implement
• Sensitive to the selection of the classification population
• Not always conclusive for complex data
Source: IndelicatoSource: Indelicato
K-Nearest Neighbors K-Nearest Neighbors ExampleExample
MISCLASSIFICATION PERCENTAGMISCLASSIFICATION PERCENTAGESES
Iris DatasetIris Dataset
All AttributesAll Attributes Petal Length and Petal WidthPetal Length and Petal Width
SetosaSetosa 0/150 = 0%0/150 = 0% 0/150 = 0%0/150 = 0%
VersicolorVersicolor 0/150 = 0%0/150 = 0% 0/150 = 0%0/150 = 0%
VirginicaVirginica 9/150 = 6%9/150 = 6% 7/150 = 4.67%7/150 = 4.67%
TotalTotal 6%6% 4.67%4.67%
Wine DatasetWine Dataset
All AttributesAll Attributes Phenols, Flavanoids, OD280/OD315Phenols, Flavanoids, OD280/OD315
Class 1Class 1 0/153 = 0%0/153 = 0% 2/153 = 1.31%2/153 = 1.31%
Class 2Class 2 9/153 = 5.88%9/153 = 5.88% 30/153 = 19.61%30/153 = 19.61%
Class 3Class 3 0/153 = 0%0/153 = 0% 0/153 = 0%0/153 = 0%
TotalTotal 5.88%5.88% 20.92%20.92%
Decision Tree Example Decision Tree Example (C4.5)(C4.5)
C4.5 is a decision tree generating algorithm, based on the C4.5 is a decision tree generating algorithm, based on the ID3 algorithm. It contains several improvements, especially ID3 algorithm. It contains several improvements, especially needed for software implementation. needed for software implementation.
Choice of best splitting attribute is based on an entropy Choice of best splitting attribute is based on an entropy calculation.calculation.
These improvements include:These improvements include: Choosing an appropriate attribute selection measure. Choosing an appropriate attribute selection measure. Handling training data with missing attribute values. Handling training data with missing attribute values. Handling attributes with differing costs. Handling attributes with differing costs. Handling continuous attributes.Handling continuous attributes.
Source: SiedlerSource: Siedler
Decision Tree Example Decision Tree Example (C4.5)(C4.5)
Iris dataset Wine dataset
Accuracy 97.67% Accuracy 86.7%
Decision Tree Example Decision Tree Example (C4.5)(C4.5)
C4.5 produces a complex tree (195 nodes)C4.5 produces a complex tree (195 nodes)
The simplified (pruned) tree reduces the classification accuracyThe simplified (pruned) tree reduces the classification accuracy
Diabetes dataset
Before PruningBefore Pruning After PruningAfter Pruning
SizeSize ErrorsErrors SizeSize ErrorsErrors
195195 40 (5.2%)40 (5.2%) 6969 102 (13.3%)102 (13.3%)
AccuracyAccuracy 94.8%94.8% 86.7%86.7%
Association RulesAssociation Rules
Association rules are used to show the Association rules are used to show the relationships between data items. relationships between data items.
Purchasing one product when another Purchasing one product when another product is purchased is an example of an product is purchased is an example of an association rule.association rule.
They do not represent any causality or They do not represent any causality or correlation.correlation.
Association Rule TechniquesAssociation Rule Techniques
Market Basket AnalysisMarket Basket Analysis
TerminologyTerminology
Transaction databaseTransaction database
Association rule – implication {A, B} Association rule – implication {A, B} ═> {C}═> {C}
Support - % of transactions in which {A, B, C} occursSupport - % of transactions in which {A, B, C} occurs
Confidence – ratio of the number of transactions that Confidence – ratio of the number of transactions that contain {A, B, C}contain {A, B, C} to the number of transactions that to the number of transactions that contain contain {A, B}{A, B}
Source: UCI Machine Learning RSource: UCI Machine Learning Repository epository
Association Rule ExampleAssociation Rule Example1984 United States Congressional Voting Records Database
Attribute Information:
1. Class Name: 2 (democrat, republican) 2. handicapped-infants: 2 (y,n) 3. water-project-cost-sharing: 2 (y,n) 4. adoption-of-the-budget-resolution: 2 (y,n) 5. physician-fee-freeze: 2 (y,n) 6. El-Salvador-aid: 2 (y,n) 7. religious-groups-in-schools: 2 (y,n) 8. anti-satellite-test-ban: 2 (y,n) 9. aid-to-Nicaraguan-contras: 2 (y,n) 10. MX-missile: 2 (y,n) 11. immigration: 2 (y,n) 12. synfuels-corporation-cutback: 2 (y,n) 13. education-spending: 2 (y,n) 14. superfund-right-to-sue: 2 (y,n) 15. crime: 2 (y,n) 16. duty-free-exports: 2 (y,n) 17. export-administration-act-south-africa: 2 (y,n)
Rules:
{budget resolution = no, MX-missile = no,aid to El Salvador = yes} {Republican} confidence 91.0%
{budget resolution = yes, MX-missile = yes,aid to El Salvador = no} {Democrat} confidence 97.5%
{crime = yes, right-to-sue = yes,Physician fee freeze = yes} {Republican} confidence 93.5%
{crime = no, right-to-sue = no,Physician fee freeze = no} {Democrat} confidence 100.0%
ClusteringClustering
Clustering is similar to classification in that Clustering is similar to classification in that data are grouped. data are grouped.
Unlike classification, the groups are not Unlike classification, the groups are not predefined; they are discovered.predefined; they are discovered.
Grouping is accomplished by finding Grouping is accomplished by finding similarities between data according to similarities between data according to characteristics found in the actual data.characteristics found in the actual data.
Clustering TechniquesClustering Techniques
K-Means ClusteringK-Means Clustering
Neural Network Clustering (SOM)Neural Network Clustering (SOM)
K-Means ExampleK-Means Example
The K-Means algorithm is an method The K-Means algorithm is an method to cluster objects based on their to cluster objects based on their attributes into k partitions. attributes into k partitions.
It assumes that the It assumes that the kk clusters exhibit clusters exhibit normal distributions. normal distributions.
The objective it tries to achieve is to The objective it tries to achieve is to minimize the variance within the minimize the variance within the clusters.clusters.
K-Means ExampleK-Means Example
Cluster 1Cluster 1 Cluster 2Cluster 2 Cluster 3Cluster 346 Versicolor46 Versicolor
3 Virginica3 Virginica
Cluster mean 4.22857Cluster mean 4.22857
4 Versicolor4 Versicolor
47 Virginica47 Virginica
Cluster mean 5.55686Cluster mean 5.55686
50 Setosa50 Setosa
Cluster mean 1.46275Cluster mean 1.46275
Cluster 1Cluster 1 Cluster 2Cluster 2 Cluster 3Cluster 347 Versicolor47 Versicolor
49 Virginica49 Virginica
Mean 6.30, 2.89, 4.96, 1.70Mean 6.30, 2.89, 4.96, 1.70
21 Setosa21 Setosa
1 Virginica1 Virginica
Mean 4.59, 3.07, 1.44, 0.29Mean 4.59, 3.07, 1.44, 0.29
29 Setosa29 Setosa
3 Versicolor3 Versicolor
Mean 5.21, 3.53, 1.67, 0.35Mean 5.21, 3.53, 1.67, 0.35
Iris dataset, only the petal width attribute, Accuracy 95.33%Iris dataset, only the petal width attribute, Accuracy 95.33%
Iris dataset, all attributes, Accuracy 66.0Iris dataset, all attributes, Accuracy 66.0 %%
Cluster Cluster 11
Cluster Cluster 22
Cluster Cluster 33
Cluster Cluster 44
Cluster Cluster 55
Cluster Cluster 66
Cluster Cluster 77
23 23 VirginicaVirginica
1 Virginica1 Virginica 26 Setosa26 Setosa 12 12 VirginicaVirginica
24 24 VersicolorVersicolor
1 Virginica1 Virginica
26 26 VersicolorVersicolor
13 13 VirginicaVirginica
24 Setosa24 Setosa
Iris dataset, all attributes, Accuracy 90.67Iris dataset, all attributes, Accuracy 90.67 %%
Self-Organizing Map Self-Organizing Map ExampleExample
The Self-Organizing Map was first described by the Finnish The Self-Organizing Map was first described by the Finnish professor Teuvo Kohonen and is thus sometimes referred to as professor Teuvo Kohonen and is thus sometimes referred to as a Kohonen map. a Kohonen map.
SOM is especially good for visualizing high-dimensional data.SOM is especially good for visualizing high-dimensional data.
SOM maps input vectors onto a two-dimensional grid of nodes. SOM maps input vectors onto a two-dimensional grid of nodes.
Nodes that are close together have similar attribute values Nodes that are close together have similar attribute values and nodes that are far apart have different attribute values.and nodes that are far apart have different attribute values.
Self-Organizing Map Self-Organizing Map ExampleExample
Virginica Virginica Virginica Versicolor Setosa Setosa Setosa Setosa
Virginica Virginica Virginica Versicolor Setosa Setosa
Virginica Virginica Virginica Versicolor Versicolor Setosa Setosa Setosa Setosa
Virginica Virginica Versicolor Setosa Setosa Setosa
Virginica Virginica Virginica Versicolor Versicolor Setosa Setosa Setosa
Virginica Versicolor Versicolor Versicolor Versicolor Setosa
Virginica Virginica Versicolor Versicolor Versicolor Versicolor Versicolor
Virginica Virginica Virginica Versicolor Versicolor Versicolor Versicolor Versicolor Versicolor Versicolor
Virginica Versicolor Versicolor Versicolor Versicolor Versicolor Versicolor Versicolor Versicolor
Virginica Virginica Versicolor Versicolor Virginica Versicolor Versicolor
Iris Data
Self-Organizing Map Self-Organizing Map ExampleExample
Class-2 Class-2 Class-2 Class-2 Class-3 Class-2 Class-2 Class-3
Class-2 Class-2 Class-2 Class-2 Class-3 Class-3 Class-2 Class-3
Class-2 Class-2 Class-3 Class-2 Class-2 Class-2 Class-3
Class-2 Class-3 Class-3 Class-3 Class-3 Class-3 Class-1
Class-3 Class-3 Class-2 Class-3 Class-3 Class-3 Class-2 Class-1
Class-3 Class-3 Class-2 Class-3 Class-3 Class-3
Class-3 Class-3 Class-1 Class-1
Class-1 Class-1 Class-1 Class-1 Class-1
Class-2 Class-1 Class-1 Class-3 Class-1 Class-1
Class-2 Class-2 Class-1 Class-1 Class-1 Class-1
Wine Data
Self-Organizing Map Self-Organizing Map ExampleExample
Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy
Sick Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy
Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy
Healthy Healthy Sick Healthy Sick Healthy Healthy Healthy Healthy Healthy
Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy
Healthy Healthy Sick Healthy Healthy Healthy Healthy Healthy Healthy Healthy
Sick Sick Healthy Sick Sick Healthy Sick Healthy Healthy Healthy
Sick Healthy Healthy Sick Sick Healthy Healthy Healthy Healthy
Sick Sick Healthy Sick Healthy Sick Sick Healthy Healthy
Sick Healthy Sick Sick Sick Sick Sick Sick Healthy Sick
Diabetes Data
Source: McKeeSource: McKee
NFL Quarterback AnalysisNFL Quarterback Analysis
Data from 2005 for 42 NFL Data from 2005 for 42 NFL quarterbacksquarterbacks
Preprocessed data to normalize for a Preprocessed data to normalize for a full 16 game regular seasonfull 16 game regular season
Used SOM to cluster individuals based Used SOM to cluster individuals based on performance and descriptive dataon performance and descriptive data
Source: McKeeSource: McKee
NFL Quarterback AnalysisNFL Quarterback Analysis
QB Passing Rating Overall Clustering
Data Mining Stories - Data Mining Stories - RevisitedRevisited
Credit card fraud detectionCredit card fraud detection
NSA telephone network analysisNSA telephone network analysis
Supply chain managementSupply chain management
Social Issues of Data MiningSocial Issues of Data Mining
Impacts on personal privacy and confidentialityImpacts on personal privacy and confidentiality
Classification and clustering is similar to profilingClassification and clustering is similar to profiling
Association rules resemble logical implicationsAssociation rules resemble logical implications
Data mining is an imperfect process subject to Data mining is an imperfect process subject to interpretationinterpretation
ConclusionConclusion
Why data mining?Why data mining?
Example data setsExample data sets
Data mining methodsData mining methods
Example application of data miningExample application of data mining
Social issues of data miningSocial issues of data mining
What on earth would a man do with himself if something did not stand in his way? - H.G. Wells
I don’t think necessity is the mother of invention – invention, in my opinion, arises directly from idleness, probably also from laziness, to save oneself trouble. - Agatha Christie, from “An Autobiography, Pt III, Growing Up”
ReferencesReferences
Dunham, Margaret, Data Mining Introductory and Advanced Topics, Pearson Dunham, Margaret, Data Mining Introductory and Advanced Topics, Pearson Education, Inc., 2003Education, Inc., 2003
Fisher, R.A., Fisher, R.A., The Use of Multiple Measurements in Taxonomic ProblemsThe Use of Multiple Measurements in Taxonomic Problems, Annals of , Annals of Eugenics Eugenics 77, pp. 179-188 , pp. 179-188
Han, Jiawei, Data Mining: Concepts and Techniques, Elsevier Inc., 2006Han, Jiawei, Data Mining: Concepts and Techniques, Elsevier Inc., 2006
Indelicato, Nicolas, Analysis of the K-Nearest Neighbors Algorithm, MATH 4500: Indelicato, Nicolas, Analysis of the K-Nearest Neighbors Algorithm, MATH 4500: Foundations of Data Mining, 2004Foundations of Data Mining, 2004
McKee , Kevin, McKee , Kevin, The Self Organized Map Applied to 2005 NFL Quarterbacks, MATH MATH 4200: Data Mining Foundations, 20064200: Data Mining Foundations, 2006
Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science Irvine, CA: University of California, Department of Information and Computer Science
Seidler, Toby, The C4.5 Project: An Overview of the Algorithm with Results of Seidler, Toby, The C4.5 Project: An Overview of the Algorithm with Results of Experimentation, MATH 4500: Foundations of Data Mining, 2004 Experimentation, MATH 4500: Foundations of Data Mining, 2004