data warehousing fs 2007
DESCRIPTION
TRANSCRIPT
![Page 1: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/1.jpg)
Data WarehousingFS 2007
Dr. Jens-Peter Dittrichjens.dittrich@inf
www.inf.ethz.ch/~jensdi
Institute of Information Systems
![Page 2: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/2.jpg)
Announcement: Credit Suisse Workshop(June 8, 8:15 - 12:00)On June 8, Credit Suisse will present their IT architecture. This workshop
will take place from 8:15am - 12:00am in CAB G51 (regular lecture hall). Of
course, there will be breaks and refreshments served. If possible, please,
plan to attend the whole workshop.
![Page 3: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/3.jpg)
Data MiningBased on Tutorial Slides by
Gregory Piatetsky-Shapiro
Kdnuggets.com
![Page 4: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/4.jpg)
44© 2006 KDnuggets
Outline
Introduction
Data Mining Tasks
Classification & Evaluation
Clustering
Application Examples
![Page 5: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/5.jpg)
55© 2006 KDnuggets
Trends leading to Data Flood
More data is generated:
Web, text, images …
Business transactions, calls,...
Scientific data: astronomy,biology, etc
More data is captured:
Storage technology fasterand cheaper
DBMS can handle bigger DB
![Page 6: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/6.jpg)
66© 2006 KDnuggets
Big Data Examples
Europe's Very Long Baseline Interferometry(VLBI) has 16 telescopes, each of which produces1 Gigabit/second of astronomical data over a25-day observation session
storage and analysis a big problem
AT&T handles billions of calls per day
so much data, it cannot be all stored -- analysis has tobe done “on the fly”, on streaming data
![Page 7: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/7.jpg)
77© 2006 KDnuggets
Largest Databases in 2005
Winter Corp. 2005 CommercialDatabase Survey:
Max Planck Inst. forMeteorology , 222 TB
Yahoo ~ 100 TB (Largest DataWarehouse)
AT&T ~ 94 TBwww.wintercorp.com/VLDB/2005_TopTen_Survey/TopTenWinners_2005.asp
![Page 8: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/8.jpg)
88© 2006 KDnuggets
From terabytes to exabytes to …
UC Berkeley 2003 estimate: 5 exabytes (5 millionterabytes) of new data was created in 2002.
www.sims.berkeley.edu/research/projects/how-much-info-2003/
US produces ~40% of new stored data worldwide
2006 estimate: 161 exabytes (IDC study) www.usatoday.com/tech/news/2007-03-05-data_N.htm
2010 projection: 988 exabytes
![Page 9: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/9.jpg)
99© 2006 KDnuggets
From terabytes to exabytes to …
UC Berkeley 2003 estimate: 5 exabytes (5 millionterabytes) of new data was created in 2002.
www.sims.berkeley.edu/research/projects/how-much-info-2003/
US produces ~40% of new stored data worldwide
2006 estimate: 161 exabytes (IDC study) www.usatoday.com/tech/news/2007-03-05-data_N.htm
2010 projection: 988 exabytes
![Page 10: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/10.jpg)
1010© 2006 KDnuggets
Data Growth
In 2 years (2003 to 2005), the size of the largest database TRIPLED!
![Page 11: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/11.jpg)
1111© 2006 KDnuggets
Data Growth Rate
Twice as much information was created in 2002as in 1999 (~30% growth rate)
Other growth rate estimates even higher
Very little data will ever be looked at by a human
Knowledge Discovery is NEEDED to make senseand use of data.
![Page 12: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/12.jpg)
1212© 2006 KDnuggets
Knowledge Discovery DefinitionKnowledge Discovery in Data is the
non-trivial process of identifying
valid
novel
potentially useful
and ultimately understandable patterns in data.
from Advances in Knowledge Discovery and DataMining, Fayyad, Piatetsky-Shapiro, Smyth, andUthurusamy, (Chapter 1), AAAI/MIT Press 1996
![Page 13: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/13.jpg)
1313© 2006 KDnuggets
Related Fields
Statistics
MachineLearning
Databases
Visualization
Data Mining and Knowledge Discovery
![Page 14: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/14.jpg)
1414© 2006 KDnuggets
Statistics, Machine Learning andData Mining
Statistics more theory-based
more focused on testing hypotheses
Machine learning more heuristic
focused on improving performance of a learning agent
also looks at real-time learning and robotics – areas not part of datamining
Data Mining and Knowledge Discovery integrates theory and heuristics
focus on the entire process of knowledge discovery, including datacleaning, learning, and integration and visualization of results
Distinctions are fuzzy
![Page 15: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/15.jpg)
1515© 2006 KDnuggets
Historical Note:Many Names of Data Mining
Data Fishing, Data Dredging: 1960-
used by statisticians (as bad name)
Data Mining :1990 --
used in DB community, business
Knowledge Discovery in Databases (1989-)
used by AI, Machine Learning Community
also Data Archaeology, Information Harvesting,Information Discovery, Knowledge Extraction, ...
Currently: Data Mining and Knowledge Discovery are used interchangeably
![Page 16: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/16.jpg)
© 2006 KDnuggets
Data Mining Tasks
![Page 17: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/17.jpg)
1717© 2006 KDnuggets
Some Definitions
Instance (also Item or Record):
an example, described by a number of attributes,
e.g. a day can be described by temperature, humidityand cloud status
Attribute or Field
measuring aspects of the Instance, e.g. temperature
Class (Label)
grouping of instances, e.g. days good for playing
![Page 18: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/18.jpg)
1818© 2006 KDnuggets
Major Data Mining TasksClassification: predicting an item class
Clustering: finding clusters in data
Associations: e.g. A & B & C occur frequently
Visualization: to facilitate human discovery
Summarization: describing a group
Deviation Detection: finding changes
Estimation: predicting a continuous value
Link Analysis: finding relationships
…
![Page 19: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/19.jpg)
1919© 2006 KDnuggets
ClassificationLearn a method for predicting the instance class
from pre-labeled (classified) instances
Many approaches:Statistics,Decision Trees,Neural Networks,...
![Page 20: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/20.jpg)
2020© 2006 KDnuggets
ClusteringFind “natural” grouping of
instances given un-labeled data
![Page 21: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/21.jpg)
2121© 2006 KDnuggets
Association Rules &Frequent Itemsets
TID Produce
1 MILK, BREAD, EGGS
2 BREAD, SUGAR
3 BREAD, CEREAL
4 MILK, BREAD, SUGAR
5 MILK, CEREAL
6 BREAD, CEREAL
7 MILK, CEREAL
8 MILK, BREAD, CEREAL, EGGS
9 MILK, BREAD, CEREAL
TransactionsFrequent Itemsets:
Milk, Bread (4)Bread, Cereal (3)Milk, Bread, Cereal (2)…
Rules:Milk => Bread (66%)
![Page 22: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/22.jpg)
2222© 2006 KDnuggets
Visualization & Data Mining
Visualizing the data tofacilitate humandiscovery
Presenting thediscovered results in avisually "nice" way
![Page 23: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/23.jpg)
2323© 2006 KDnuggets
Summarization
Describe features of theselected group
Use natural languageand graphics
Usually in Combinationwith Deviation detectionor other methods
![Page 24: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/24.jpg)
2424© 2006 KDnuggets
Data Mining Central Quest
Find true patterns and avoid overfitting (finding seemingly signifcantbut really random patterns due to searching too many possibilites)
![Page 25: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/25.jpg)
© 2006 KDnuggets
Classification Methods
![Page 26: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/26.jpg)
2626© 2006 KDnuggets
ClassificationLearn a method for predicting the instance class
from pre-labeled (classified) instances
Many approaches:Regression,Decision Trees,Bayesian,Neural Networks,...
Given a set of points from classeswhat is the class of new point ?
![Page 27: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/27.jpg)
2727© 2006 KDnuggets
Classification: Linear Regression
Linear Regression
w0 + w1 x + w2 y >= 0
Regression computeswi from data tominimize squared errorto ‘fit’ the data
Not flexible enough
![Page 28: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/28.jpg)
2828© 2006 KDnuggets
Regression for Classification
Any regression technique can be used for classification
Training: perform a regression for each class, setting theoutput to 1 for training instances that belong to class, and 0for those that don’t
Prediction: predict class corresponding to model with largestoutput value (membership value)
For linear regression this is known as multi-responselinear regression
![Page 29: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/29.jpg)
2929© 2006 KDnuggets
Classification: Decision Trees
X
Y
if X > 5 then blueelse if Y > 3 then blueelse if X > 2 then greenelse blue
52
3
![Page 30: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/30.jpg)
3030© 2006 KDnuggets
DECISION TREEAn internal node is a test on an attribute.
A branch represents an outcome of the test, e.g.,Color=red.
A leaf node represents a class label or class labeldistribution.
At each node, one attribute is chosen to splittraining examples into distinct classes as much aspossible
A new instance is classified by following amatching path to a leaf node.
![Page 31: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/31.jpg)
3131© 2006 KDnuggets
Weather Data: Play or not Play?
NotruehighmildrainYesfalsenormalhotovercastYestruehighmildovercastYestruenormalmildsunnyYesfalsenormalmildrainYesfalsenormalcoolsunnyNofalsehighmildsunnyYestruenormalcoolovercastNotruenormalcoolrainYesfalsenormalcoolrainYesfalsehighmildrainYesfalsehighhotovercastNotruehighhotsunnyNofalsehighhotsunnyPlay?WindyHumidityTemperatureOutlook
Note:Outlook is theForecast,no relation to Microsoftemail program
![Page 32: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/32.jpg)
3232© 2006 KDnuggets
overcast
high normal falsetrue
sunny rain
No NoYes Yes
Yes
Example Tree for “Play?”
Outlook
HumidityWindy
![Page 33: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/33.jpg)
3333© 2006 KDnuggets
Classification: Neural Nets
Can select morecomplex regions
Can be more accurate
Also can overfit thedata – find patterns inrandom noise
![Page 34: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/34.jpg)
3434© 2006 KDnuggets
Classification: other approaches
Naïve Bayes
Rules
Support Vector Machines
Genetic Algorithms
…
See www.KDnuggets.com/software/
![Page 35: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/35.jpg)
© 2006 KDnuggets
Evaluation
![Page 36: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/36.jpg)
3636© 2006 KDnuggets
Evaluating which method works thebest for classification
No model is uniformly the best
Dimensions for Comparison
speed of training
speed of model application
noise tolerance
explanation ability
Best Results: Hybrid, Integrated models
![Page 37: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/37.jpg)
3737© 2006 KDnuggets
Comparison of MajorClassification Approaches
Train
time
Run
Time
Noise
Toler
ance
Can Use
Prior
Know-
ledge
Accuracy
on Customer
Modelling
Under-
standable
Decision
Trees
fast fast poor no medium medium
Rules med fast poor no medium good
Neural
Networks
slow fast good no good poor
Bayesian slow fast good yes good good
A hybrid method will have higher accuracy
![Page 38: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/38.jpg)
3838© 2006 KDnuggets
Evaluation of Classification Models
How predictive is the model we learned?
Error on the training data is not a good indicatorof performance on future data
The new data will probably not be exactly the same asthe training data!
Overfitting – fitting the training data too precisely- usually leads to poor results on new data
![Page 39: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/39.jpg)
3939© 2006 KDnuggets
Evaluation issues
Possible evaluation measures:
Classification Accuracy
Total cost/benefit – when different errors involvedifferent costs
Lift and ROC (Receiver operating characteristic) curves
Error in numeric predictions
How reliable are the predicted results ?
![Page 40: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/40.jpg)
4040© 2006 KDnuggets
Classifier error rate
Natural performance measure for classificationproblems: error rate
Success: instance’s class is predicted correctly
Error: instance’s class is predicted incorrectly
Error rate: proportion of errors made over the wholeset of instances
Training set error rate: is way too optimistic!
you can find patterns even in random data
![Page 41: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/41.jpg)
4141© 2006 KDnuggets
Evaluation on “LARGE” data
If many (>1000) examples are available,including >100 examples from each class
A simple evaluation will give useful results
Randomly split data into training and test sets (usually2/3 for train, 1/3 for test)
Build a classifier using the train set and evaluateit using the test set
![Page 42: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/42.jpg)
4242© 2006 KDnuggets
Classification Step 1:Split data into train and test sets
Results Known
++--+
THE PAST
Data
Training set
Testing set
![Page 43: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/43.jpg)
4343© 2006 KDnuggets
Classification Step 2:Build a model on a training set
Training set
Results Known
++--+
THE PAST
Data
Model Builder
Testing set
![Page 44: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/44.jpg)
4444© 2006 KDnuggets
Classification Step 3: Evaluate on test set (Re-train?)
Data
Predictions
Y N
Results Known
Training set
Testing set
++--+
Model BuilderEvaluate
+-+-
![Page 45: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/45.jpg)
4545© 2006 KDnuggets
Unbalanced data
Sometimes, classes have very unequal frequency
Attrition prediction: 97% stay, 3% attrite (in a month)
medical diagnosis: 90% healthy, 10% disease
eCommerce: 99% don’t buy, 1% buy
Security: >99.99% of Americans are not terrorists
Similar situation with multiple classes
Majority class classifier can be 97% correct, butuseless
![Page 46: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/46.jpg)
4646© 2006 KDnuggets
Handling unbalanced data –how?
If we have two classes that are veryunbalanced, then how can we evaluate ourclassifier method?
![Page 47: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/47.jpg)
4747© 2006 KDnuggets
Balancing unbalanced data, 1
With two classes, a good approach is to buildBALANCED train and test sets, and train modelon a balanced set
randomly select desired number of minority classinstances
add equal number of randomly selected majority class
How do we generalize “balancing” to multipleclasses?
![Page 48: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/48.jpg)
4848© 2006 KDnuggets
Balancing unbalanced data, 2
Generalize “balancing” to multiple classes
Ensure that each class is represented withapproximately equal proportions in train and test
![Page 49: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/49.jpg)
4949© 2006 KDnuggets
A note on parameter tuning
It is important that the test data is not used in any way tocreate the classifier
Some learning schemes operate in two stages:
Stage 1: builds the basic structure
Stage 2: optimizes parameter settings
The test data can’t be used for parameter tuning!
Proper procedure uses three sets: training data,validation data, and test data
Validation data is used to optimize parameters
![Page 50: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/50.jpg)
5050© 2006 KDnuggets
Making the most of the data
Once evaluation is complete, all the data can beused to build the final classifier
Generally, the larger the training data the betterthe classifier
The larger the test data the more accurate theerror estimate
![Page 51: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/51.jpg)
5151© 2006 KDnuggets
Classification:Train, Validation, Test split
Data
Predictions
Y N
Results Known
Training set
Validation set
++--+
Model BuilderEvaluate
+-+-
Final ModelFinal Test Set
+-+-
Final Evaluation
ModelBuilder
![Page 52: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/52.jpg)
5252© 2006 KDnuggets
Cross-validation
Cross-validation avoids overlapping test sets
First step: data is split into k subsets of equal size
Second step: each subset in turn is used for testing and theremaining k-1 subsets for training
This is called k-fold cross-validation
Often the subsets are stratified before the cross-validationis performed (stratified= same distribution on subsets)
The error estimates are averaged to yield an overall errorestimate
![Page 53: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/53.jpg)
5353© 2006 KDnuggets53
Cross-validation example:—Break up data into subsets of the same size—
—
—Hold aside one subsets for testing and use the rest for training
—
—Repeat
Test
![Page 54: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/54.jpg)
5454© 2006 KDnuggets
More on cross-validation
Standard method for evaluation: stratified ten-fold cross-validation
Why ten? Extensive experiments have shown thatthis is the best choice to get an accurate estimate
Stratification reduces the estimate’s variance
Even better: repeated stratified cross-validation
E.g. ten-fold cross-validation is repeated ten times andresults are averaged (reduces the variance)
![Page 55: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/55.jpg)
5555© 2006 KDnuggets
Direct Marketing Paradigm
Find most likely prospects to contact
Not everybody needs to be contacted
Number of targets is usually much smaller than number ofprospects
Typical Applications
retailers, catalogues, direct mail (and e-mail)
customer acquisition, cross-sell, attrition prediction
...
![Page 56: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/56.jpg)
5656© 2006 KDnuggets
Direct Marketing Evaluation
Accuracy on the entire dataset is not theright measure
Approach
develop a target model
score all prospects and rank them by decreasing score
select top P% of prospects for action
How do we decide what is the best subset ofprospects?
![Page 57: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/57.jpg)
5757© 2006 KDnuggets
Model-Sorted List
…4897N0.925
24222734…
3820247810241746
CustID
N0.06100…N0.1199…
…………
Age
Y0.934
……
Y0.943N0.952Y0.971
TargetScoreNo
Use a model to assign score to each customerSort customers by decreasing scoreExpect more targets (hits) near the top of the list
3 hits in top 5% ofthe list
If there 15 targetsoverall, then top 5has 3/15=20% oftargets
![Page 58: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/58.jpg)
5858© 2006 KDnuggets
CPH (Cumulative Pct Hits)
0
10
20
30
40
50
60
70
80
90
100
5
15
25
35
45
55
65
75
85
95
Random
5% of random list have 5% of targets Pct list
Cum
ulative % H
its
Definition:CPH(P,M)= % of all targetsin the first P% of the list scoredby model M,CPH frequently called Gains
![Page 59: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/59.jpg)
5959© 2006 KDnuggets
CPH: Random List vsModel-ranked list
0
10
20
30
40
50
60
70
80
90
100
5
15
25
35
45
55
65
75
85
95
Random
Model
5% of random list have 5% of targets,but 5% of model ranked list have 21% of targetsCPH(5%,model)=21%.
Pct list
Cum
ulative % H
its
![Page 60: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/60.jpg)
6060© 2006 KDnuggets
Lift
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
15
25
35
45
55
65
75
85
95
Lift
Lift(P,M) = CPH(P,M) / P
P -- percent of the list
Lift (at 5%) = 21% / 5% = 4.2betterthan random
Note: Some authorsuse “Lift” for whatwe call CPH.
![Page 61: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/61.jpg)
6161© 2006 KDnuggets
Lift – a measure of model quality
Lift helps us decide which models are better
If cost/benefit values are not available orchanging, we can use Lift to select a bettermodel.
Model with the higher Lift curve will generally bebetter
![Page 62: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/62.jpg)
© 2006 KDnuggets
Clustering
![Page 63: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/63.jpg)
6363© 2006 KDnuggets
Classification vs. ClusteringClassification: Supervised learning:Learns a method for predicting the
instance class from pre-labeled(classified) instances
![Page 64: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/64.jpg)
6464© 2006 KDnuggets
Clustering
Unsupervised learning:Finds “natural” grouping of
instances given un-labeled data
![Page 65: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/65.jpg)
6565© 2006 KDnuggets
Clustering Methods
Many different method and algorithms:
For numeric and/or symbolic data
Deterministic vs. probabilistic
Exclusive vs. overlapping
Hierarchical vs. flat
Top-down vs. bottom-up
![Page 66: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/66.jpg)
6666© 2006 KDnuggets
Clusters:exclusive vs. overlapping
a
k
j
i
h
g
f
ed
c
b
Simple 2-D representation
Non-overlapping
Venn diagram
Overlapping
a
k
j
i
h
g
f
ed
c
b
![Page 67: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/67.jpg)
6767© 2006 KDnuggets
Clustering Evaluation
Manual inspection
Benchmarking on existing labels
Cluster quality measures
distance measures
high similarity within a cluster, low across clusters
![Page 68: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/68.jpg)
6868© 2006 KDnuggets
The distance function
Simplest case: one numeric attribute A
Distance(X,Y) = A(X) – A(Y)
Several numeric attributes:
Distance(X,Y) = Euclidean distance between X,Y
Nominal attributes, i.e., no ordering on attributes:distance is set to 1 if values are different, 0 ifthey are equal
Are all attributes equally important?
Weighting the attributes might be necessary
![Page 69: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/69.jpg)
6969© 2006 KDnuggets
Simple Clustering: K-means
Works with numeric data only
1) Pick a number (K) of cluster centers (atrandom)
2) Assign every item to its nearest cluster center(e.g. using Euclidean distance)
3) Move each cluster center to the mean of itsassigned items
4) Repeat steps 2,3 until convergence (change incluster assignments less than a threshold)
![Page 70: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/70.jpg)
7070© 2006 KDnuggets
K-means example, step 1
c1
c2
c3
X
Y
Pick 3 initialclustercenters(randomly)
1
![Page 71: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/71.jpg)
7171© 2006 KDnuggets
K-means example, step 2
c1
c2
c3
X
Y
Assigneach pointto the closestclustercenter
![Page 72: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/72.jpg)
7272© 2006 KDnuggets
K-means example, step 3
X
Y
Moveeach clustercenterto the meanof each cluster
c2
c1
c3
c1
c2
c3
![Page 73: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/73.jpg)
7373© 2006 KDnuggets
K-means example, step 4a
X
YReassignpointsclosest to adifferent newcluster center
Q: Whichpoints arereassigned?
c1
c2
c3
![Page 74: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/74.jpg)
7474© 2006 KDnuggets
K-means example, step 4b
X
YReassignpointsclosest to adifferent newcluster center
Q: Whichpoints arereassigned?
c1
c2
c3
![Page 75: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/75.jpg)
7575© 2006 KDnuggets
K-means example, step 4c
X
YA: threepoints
c1
c3c2
1
32
![Page 76: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/76.jpg)
7676© 2006 KDnuggets
K-means example, step 4d
X
Yre-computeclustermeans
c1
c3c2
![Page 77: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/77.jpg)
7777© 2006 KDnuggets
K-means example, step 5
X
Y
move clustercenters tocluster means
c2
c1
c3
![Page 78: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/78.jpg)
7878© 2006 KDnuggets
Discussion, 1
What can be the problems with
K-means clustering?
![Page 79: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/79.jpg)
7979© 2006 KDnuggets
Discussion, 2
Result can vary significantly depending on initial choice of seeds(number and position)
Q: What can be done?
To increase chance of finding global optimum: restartwith different random seeds.
![Page 80: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/80.jpg)
8080© 2006 KDnuggets
K-means clustering summary
Advantages
Simple, understandable
items automaticallyassigned to clusters
Disadvantages
Must pick number ofclusters before hand
All items forced into acluster
Too sensitive to outliers
![Page 81: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/81.jpg)
8181© 2006 KDnuggets
K-means clustering - outliers ?
What can be done about outliers?
![Page 82: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/82.jpg)
8282© 2006 KDnuggets
K-means variations
K-medoids – instead of mean, use medians ofeach cluster
Mean of 1, 3, 5, 7, 9 is
Mean of 1, 3, 5, 7, 1009 is
Median of 1, 3, 5, 7, 1009 is
Median advantage: not affected by extreme values
For large databases, use sampling
5
205
5
![Page 83: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/83.jpg)
8383© 2006 KDnuggets
*Hierarchical clustering Bottom up
Start with single-instance clusters
At each step, join the two closest clusters
Design decision: distance between clusters E.g. two closest instances in clusters
vs. distance between means
Top down
Start with one universal cluster
Find two clusters
Proceed recursively on each subset
Can be very fast
Both methods produce adendrogram
g a c i e d k b j f h
![Page 84: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/84.jpg)
© 2006 KDnuggets
Data Mining Applications
![Page 85: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/85.jpg)
8585© 2006 KDnuggets
Problems Suitable for Data-Mining
require knowledge-based decisions
have a changing environment
have sub-optimal current methods
have accessible, sufficient, and relevant data
provides high payoff for the right decisions!
![Page 86: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/86.jpg)
8686© 2006 KDnuggets
Major Application Areas forData Mining Solutions
Advertising Bioinformatics Customer Relationship Management (CRM) Database Marketing Fraud Detection eCommerce Health Care Investment/Securities Manufacturing, Process Control Sports and Entertainment Telecommunications Web
![Page 87: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/87.jpg)
8787© 2006 KDnuggets
Application: Search Engines
Before Google, web search engines used mainlykeywords on a page – results were easily subjectto manipulation
Google's early success was partly due to itsalgorithm which uses mainly links to the page
Google founders Sergey Brin and Larry Page werestudents at Stanford in 1990s
Their research in databases and data mining ledto Google
![Page 88: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/88.jpg)
8888© 2006 KDnuggets
Microarrays: Classifying Leukemia Leukemia: Acute Lymphoblastic (ALL) vs Acute
Myeloid (AML), Golub et al, Science, v.286, 1999 72 examples (38 train, 34 test), about 7,000 genes
ALL AML
Visually similar, but genetically very different
Best Model: 97% accuracy,1 error (sample suspected mislabelled)
![Page 89: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/89.jpg)
8989© 2006 KDnuggets
Microarray Potential Applications
New and better molecular diagnostics Jan 11, 2005: FDA approved Roche Diagnostic AmpliChip, based
on Affymetrix technology
New molecular targets for therapy few new drugs, large pipeline, …
Improved treatment outcome Partially depends on genetic signature
Fundamental Biological Discovery finding and refining biological pathways
Personalized medicine ?!
![Page 90: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/90.jpg)
9090© 2006 KDnuggets
Application:Direct Marketing and CRM
Most major direct marketing companies are usingmodeling and data mining
Most financial companies are using customermodeling
Modeling is easier than changing customerbehaviour
Example Verizon Wireless reduced customer attrition rate (churn
rate) from 2% to 1.5%, saving many millions of $
![Page 91: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/91.jpg)
9191© 2006 KDnuggets
Application: e-Commerce
Amazon.com recommendations
if you bought (viewed) X, you are likely to buy Y
Netflix
If you liked "Monty Python and the Holy Grail",
you get a recommendation for "This is Spinal Tap"
Comparison shopping
Froogle, mySimon, Yahoo Shopping, …
![Page 92: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/92.jpg)
9292© 2006 KDnuggets
Application:Security and Fraud Detection
Credit Card Fraud Detection over 20 Million credit cards protected by
Neural networks (Fair, Isaac)
Securities Fraud Detection
NASDAQ KDD system
Phone fraud detection
AT&T, Bell Atlantic, British Telecom/MCI
![Page 93: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/93.jpg)
9393© 2006 KDnuggets
Data Mining, Privacy, and Security
TIA: Terrorism (formerly Total) InformationAwareness Program –
TIA program closed by Congress in 2003 because ofprivacy concerns
However, in 2006 we learn that NSA is analyzingUS domestic call info to find potential terrorists
Invasion of Privacy or Needed Intelligence?
![Page 94: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/94.jpg)
9494© 2006 KDnuggets
Criticism of Analytic Approachesto Threat Detection:Data Mining will
be ineffective - generate millions of false positives
and invade privacy
First, can data mining be effective?
![Page 95: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/95.jpg)
9595© 2006 KDnuggets
Can Data Mining and Statisticsbe Effective for Threat Detection? Criticism: Databases have 5% errors, so analyzing
100 million suspects will generate 5 million falsepositives
Reality: Analytical models correlate many items ofinformation to reduce false positives.
Example: Identify one biased coin from 1,000. After one throw of each coin, we cannot
After 30 throws, one biased coin will stand out withhigh probability.
Can identify 19 biased coins out of 100 million withsufficient number of throws
![Page 96: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/96.jpg)
9696© 2006 KDnuggets
Another Approach: Link Analysis
Can find unusual patterns in the network structure
![Page 97: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/97.jpg)
9797© 2006 KDnuggets
Analytic technology can be effective
Data Mining is just one additional tool to helpanalysts
Combining multiple models and link analysis canreduce false positives
Today there are millions of false positives withmanual analysis
Analytic technology has the potential to reducethe current high rate of false positives
![Page 98: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/98.jpg)
9898© 2006 KDnuggets
Data Mining with Privacy
Data Mining looks for patterns, not people!
Technical solutions can limit privacy invasion
Replacing sensitive personal data with anon. ID
Give randomized outputs
Multi-party computation – distributed data
…
Bayardo & Srikant, Technological Solutions forProtecting Privacy, IEEE Computer, Sep 2003
![Page 99: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/99.jpg)
9999© 2006 KDnuggets
19901998 2000 2002
Expectations
Performance
The Hype Curve forData Mining and KnowledgeDiscovery
Over-inflated expectations
Disappointment
Growing acceptanceand mainstreaming
rising expectations
2005
![Page 100: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/100.jpg)
100100© 2006 KDnuggets
Summary
Data Mining and Knowledge Discovery are neededto deal with the flood of data
Knowledge Discovery is a process!
Avoid overfitting (finding random patterns bysearching too many possibilities)
![Page 101: Data Warehousing FS 2007](https://reader034.vdocuments.us/reader034/viewer/2022051514/54b418b54a79599e1f8b470d/html5/thumbnails/101.jpg)
101101© 2006 KDnuggets
Additional Resources
www.KDnuggets.com data mining software, jobs, courses, etc
www.acm.org/sigkddACM SIGKDD – the professional society fordata mining