data mining: staying ahead in the information age a tutorial in data mining, yor11, cambridge, 29 th...
TRANSCRIPT
![Page 1: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/1.jpg)
Data Mining: Staying Ahead in the Information Age
A Tutorial in Data Mining, YOR11, Cambridge, 29th March 2000.
Robert BurbidgeComputer Science, UCL, London, UK.
http://www.cs.ucl.ac.uk/staff/r.burbidge
![Page 2: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/2.jpg)
Definition
‘We are drowning in information, but starving for knowledge’
John Naisbett
• Data Mining is the search for ‘nuggets’ of useful information
• Data Mining is an automated search for ‘interesting’ patterns in large databases
![Page 3: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/3.jpg)
Overview
DataPre-
ProcessingAnalysis
BusinessSolutions
Aims
Domain Knowledge
![Page 4: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/4.jpg)
Before We Begin ...
• Getting the Data
• Assessing Usefulness of the Data
• Noise in the Data
• Volume of Available Data
• Domain Knowledge and Expertise
![Page 5: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/5.jpg)
Getting the Data
• Are the data easily available?– What format are the
data in?
– Are the data in a live database or a data warehouse?
– Are the data online?
1010111....ID
0Xc2
Jones, H., 24
00011002210
GRsa4
7 8 3 2 1 0 .... 9 4 3 2 3 4 ...... .... ...... ... ..... .. .. ... . ..
objects
variables
![Page 6: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/6.jpg)
Assessing Usefulness of the Data
• Are the available data relevant to the task at hand?– E.g. to predict ice-cream sales information
about the FTSE would (probably) not be useful
• Are there missing factors which are likely to be predictive?– E.g. temperature is likely to be predictive of
ice-cream sales
![Page 7: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/7.jpg)
Noise in the Data
• Are the data contaminated by noise?– E.g. experimental error, typing mistakes,
corrupted storage media
• Can this be eliminated?– E.g. improved experimental set up, data
cleaning
• How seriously is this likely to affect the results?
![Page 8: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/8.jpg)
Volume of Available Data
• Are there enough data ...– ... to learn a useful concept?– ... to give statistically significant results?
• Should more data be collected?– More examples– More information about the examples– Meta data
![Page 9: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/9.jpg)
Domain Knowledge
• Domain knowledge can be incorporated into some techniques– To choose priors in Bayesian analysis– To encode invariances in the data– Expert systems
• Use of expertise can avoid blind search– Feature selection– Building a model
![Page 10: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/10.jpg)
Résumé 1
• Before we begin we must– Obtain the data– Make sure it’s useful– Make sure there’s enough– Identify available expert knowledge
• This is all pretty obvious– If you don’t do this you’re headed for trouble
![Page 11: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/11.jpg)
Pre-Processing
• Visualization
• Feature Selection
• Feature Extraction
• Feature Derivation
• Data Reduction
![Page 12: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/12.jpg)
Visualization
• Histogram plots– Identify Distributions
• Clustering– k-means
– Kohonen nets
– Relational
– Hierarchical
– Outlier detection
![Page 13: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/13.jpg)
Feature Selection
• Performance Measures– Filters
– Wrappers
• Search Algorithms– Exhaustive
– Branch-and-bound
– Mathematical Programming
– Stochastic
7 8 3 2 1 0 .... 9 4 3 2 3 4 ....
objects
variables
7 3 2 1 9 3 2 3
objects
variables
![Page 14: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/14.jpg)
Feature Extraction
• Domain knowledge– E.g. edges in images
• Informative features– Kohonen nets– Principle components analysis
• Useful for visualization– Projecting data to two or three dimensions– Identifying the number of clusters
![Page 15: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/15.jpg)
Feature Derivation
• Transforming continuous attributes to discrete attributes– Fuzzy or rough linguistic concepts– Binning
• Deriving numeric features– Products, ratios, differences, etc– E.g. taking differences of start and finish times,
taking ratios of price changes
![Page 16: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/16.jpg)
Data Reduction
• Large amounts of data require longer training times– Some data points are
more relevant than others
• Reducing the modality of a variable– Makes solutions more
easily interpretable
Support Vector Machine
![Page 17: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/17.jpg)
Résumé 2
• Assess the data statistically
• Visualize the data
• Identify, extract or create useful features
• Reduce the size of the problem if necessary
![Page 18: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/18.jpg)
Discovering Patterns and Rules
• Rule Induction
• Statistical Pattern Recognition
• Neural Networks
• Hybrid Systems
• Performance Analysis
![Page 19: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/19.jpg)
Rule Induction
• Discover rules that describe the data– e.g. marketing – who buys what?
• IF age > 55 AND income > 20 000 THEN holiday
• IF age < 40 AND age > 20 THEN pension
• Easy to understand – identifies important features
• Can be fuzzified• IF age_low AND income_high THEN car_high
![Page 20: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/20.jpg)
Statistical Pattern Recognition
• Model the underlying distribution– Classification
• Bayesian solution is optimal
• Gives confidence values
– Regression• Identifies useful features
• Robust techniques to handle noise
• Difficult in many practical applications
![Page 21: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/21.jpg)
Neural Networks
• Based on neuronal brain model
• Each neuron forms a weighted sum of its inputs
• Flexible learners• Prone to over-fitting • Messy optimization
problem
inputs
hiddenlayer
output
![Page 22: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/22.jpg)
Hybrid Systems
• Combine techniques for increased functionality and accuracy– function replacing
• neural network accurate but unreadable
• combine with a decision tree
– committee• multiple classifiers with different
set-ups• aggregate with a decision tree
inputs
NN1 NN2 NN3
Decision Tree
output
![Page 23: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/23.jpg)
Performance Analysis
• Accuracy– error rate– discrimination– variable costs
• Readability• Time
– training– using
ROC curve; Neyman Pearson at 20%
![Page 24: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/24.jpg)
Résumé 3
• Identify key criteria
• Assess data characteristics
• Choose an algorithm
• Set the parameters
• Try combining multiple techniques to improve results
• Assess statistical significance
![Page 25: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/25.jpg)
Post-Processing
• Understanding
• Significance
• Implementation
![Page 26: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/26.jpg)
Understanding
• What does it mean?– if easily understandable, does it make sense?– if numeric, how to interpret
• Which features were important?– sensitivity analysis
![Page 27: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/27.jpg)
Significance
• Are the results interesting?– are they new and unobvious?
• e.g. IF age > 100 THEN NOT pension
– are they relevant
• What is the significance?– are further studies required
• with more data specific to the discovered pattern
– change of business plan
![Page 28: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/28.jpg)
Implementation
• How to convince the money men– solid results– clear and concise
• How to test your hypothesis– experimental design– controlled studies to eliminate sampling bias
![Page 29: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/29.jpg)
Résumé 4
• Assess the usefulness of the results– Interpretability– Relevance to initial problem
• Identify the next step– Sales pitch– Further experiments– Field trials– Towards knowledge discovery
![Page 30: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/30.jpg)
Example Applications at UCL
• Intelligent fraud detection with Fuzzy GAs (Lloyd’s TSB)
• Drug Design by SVMs (SmithKline Beecham and Glaxo-Wellcome)
• Consumer Profiling with Bayes Nets (Unilever)
• Process Control (AstraZeneca)
![Page 31: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/31.jpg)
‘Data Snooping’ – A Warning
• Artefacts – ‘patterns’ that aren’t there• Sampling bias• Statistical tests may not show significance
– this does not mean results aren’t significant
• The extremum of a collection of Gaussians is highly skewed – beware coincidence
• Data mining is a dangerous tool in the wrong hands
![Page 32: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/32.jpg)
Summary
• Get the right data
• Use domain knowledge
• Pre-process the data
• Discover patterns and rules– machine learning– statistics
• Analyze results – but be wary
![Page 33: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/33.jpg)
Conclusions
With vast amounts of data available, it has become necessary to use automated techniquesAdvances in data processing, machine learning and statistics have made this possibleData mining is a necessary tool for business survival in the information age
![Page 34: Data Mining: Staying Ahead in the Information Age A Tutorial in Data Mining, YOR11, Cambridge, 29 th March 2000. Robert Burbidge Computer Science, UCL,](https://reader035.vdocuments.us/reader035/viewer/2022062515/56649c9e5503460f9495df9c/html5/thumbnails/34.jpg)
Internet Resources
• www.kdnuggetts.com
• www.data-miners.com
• www.crisp-dm.org• www.research.microsoft.com/profiles/fayyad
• www.cs.sfu.ca/~han
• etc...