data mining and knowledge discovery in databases
TRANSCRIPT
![Page 1: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/1.jpg)
Data Mining and Knowledge Discovery
in Databases
![Page 2: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/2.jpg)
Outline
• What is Data Mining and KDD?• Characteristics• Applications• Methods• Packages & Close Relatives
![Page 3: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/3.jpg)
What is Data Mining & KDD?
• “The process of identifying hidden patterns and relationships within data”
or
• “Data mining helps end users extract useful business information from large databases”
![Page 4: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/4.jpg)
What’s the Appeal?
• Hidden nuggets of valuable information buried deep within a mountain of otherwise unremarkable data
• Pervasive data• Seek competitive advantage
![Page 5: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/5.jpg)
The Challenge51020188905212001539458199000000001419881229448821996081621000000101000100000001100003111110000000001003130200000000000000202001000000000000000000000000000043438888888842424342433301220202220000101001000000044100000000110000000000000000010000010000000000000000000000000000000000000000000000000199810275102018960601200212694096800000015901998090337981199809173100100000100010000000110000320002000000100000001239900000000000020022220031310031200000000000000004243888888888842434242332121212222000000101100000024410000000001002000000000000000000001000000000000000000000000000000000000000000000000019981230510201897020320001862692920000004709199802135697119980227310000010010001000000000110110000002000010000000002101100010000000000010000000000001000110000000111003388882222331132334333000000110000011101001100102000100000000100000000100000000000000000000000000000000000000000000000000000000000000001998122151020189909302005200898673000001941019990112759811999012631001000101000100000000011111011111220101000001112300100100000010210002200000000002000000000000011133438888434242424342423300000011110000010110010000244100000000010020000000100101000000010000000000000000000000000000000000000000000000000199905255102018991227200935405158300000144841997052717971199706103100000010110010000000100000311120120000100100101200011110010000110100120000000000010000000000101013243888888888822424243310000000100210000111001001123010000001000002000100000000001100001000000001000001000000000000000010000000000000000019981117510201899122720093540515830000014484199705271797219980616310000001011001000000011010031111112100010000020221001222022002022122220100000000000000000101001100324343432132422142424233002100210000111101100000112231001100000100000010000000000110000100000000100000100000000000000000000000000000000001998122351020190001
![Page 6: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/6.jpg)
Process: Knowledge Discovery In Databases
database
database
datawarehouse
cleaning & integration
modify data selection
modify data selection
data mining
collect and transform
discoveredpatterns
data mining engines, models
evaluation &presentation
user interface and expert knowledge
domain
modify methods,
parameters
![Page 7: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/7.jpg)
Context
• Where you stand on Data Mining depends on where you sit:
• Business User
• Researcher
• Computer Scientist
![Page 8: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/8.jpg)
Data Mining Might Mean…
• Statistics• Visualization• Artificial intelligence• Machine learning• Database technology• Neural networks• Pattern recognition• Knowledge-based systems• Knowledge acquisition• Information retrieval• High performance computing• And so on...
![Page 9: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/9.jpg)
What’s needed?
• Suitable data• Computing power• Data mining software• Skilled operator who knows both the nature of
the data and the software tools• Reason, theory, or hunch
![Page 10: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/10.jpg)
Typical Applications of Data Mining & KDD
• Marketing• Market Basket Analysis• Customer Relationship Management• New Product Development
![Page 11: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/11.jpg)
Typical Applications of Data Mining & KDD
• Financial Services• Credit Approval• Fraud Detection• Marketing
![Page 12: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/12.jpg)
Typical Applications of Data Mining & KDD
• Health Care• Epidemiological Analysis - incidence and prevalence
of disease in large populations and detection of the source and cause of epidemics of infectious disease
• Knowledge for funding • Policy, programs
![Page 13: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/13.jpg)
Two Basic Approaches
• Supervised• A dependent or target variable
• Unsupervised• “Pure Data Mining”• Fewer assumptions• Typically used for clustering techniques
![Page 14: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/14.jpg)
Automation
• The ability to aim a tool at some data and push a button
• Some methods of KDD/Data mining are more suitable for automation than others
![Page 15: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/15.jpg)
Seven Basic Methods:
1. Decision Trees
2. (Artificial) Neural Networks
3. Cluster/Nearest Neighbour
4. Genetic Algorithms/Evolutionary Computing
5. Bayesian Networks
6. Statistics
7. Hybrids
![Page 16: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/16.jpg)
• Graphical representations of relationships with data
• Excel at Classification & Prediction Models
Decision Trees
![Page 17: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/17.jpg)
Sample of a Decision Tree
gender
femalemale
<65 >=65
married?age
yes nogood health?
yes no
- +
urban?
yes no
pet owner?
yes no
+ - - +
pet owner?
yes no
- +
![Page 18: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/18.jpg)
Decision Trees
• Strengths • Easily understood
and interpreted• Represent complexity
in a compact form• Handle non-linear
data well• Relatively well suited
to automation.
• Weaknesses• Large trees with large
numbers of variables become difficult to understand
• Missing data must be appropriately managed in construction and use of the models
![Page 19: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/19.jpg)
Neural Networks
• Derived from Artificial Intelligence Research• Modelled on the Human Neuron
![Page 20: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/20.jpg)
Neural Networks
Age Gender Income
Prediction
Hidden Layer
Input Variables
0.60.3
0.1
0.5
0.7
0.8 0.4Weights
Weights
0.3 0.2
![Page 21: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/21.jpg)
Neural Networks
• Strengths • Accuracy of prediction• Robust performance
with a wide variety of data types
• Weaknesses• Prone to overfitting• Poor clarity of model
![Page 22: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/22.jpg)
Clustering/Nearest Neighbour
• Aim to assign “like” records to a group• Groups assigned according to some target
variable or criteria• Nearest neighbour used for prediction
![Page 23: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/23.jpg)
Clustering/Nearest Neighbour
• Applications:• Text processing: search engines• Image processing: radiology/image processing• Fraud detection: outliers
![Page 24: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/24.jpg)
Clustering/Nearest Neighbour
• Strengths • Easily understood
and interpreted• Easily implemented in
basic situations
• Weaknesses• complex data not well
suited to automation (much preprocessing required)
![Page 25: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/25.jpg)
Genetic Algorithms/Evolutionary Computing
• Grounded in Darwin – applied using mathematics
• Require• a way to represent a solution to a problem • a way to test the “fitness” of the solution
• Solutions are mathematically “mutated”• Fittest solutions survive• Convergence
![Page 26: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/26.jpg)
Genetic Algorithms/Evolutionary Computing
• Strengths • Suited to novel
problems that are poorly understood
• Suitable where data is dirty or missing
• May be useful where other methods cannot be applied
• Weaknesses• Not easily automated• Require creativity in
their application
![Page 27: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/27.jpg)
Bayesian Networks
• Based on Bayes’ rule:• P(a|b) = P(b|a) * P(a) / P(b)
• Can construct networks of linked events, each with prior probabilities
![Page 28: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/28.jpg)
Bayesian Network Example
J.R. Shot
Bobby shot him
Just a dream
sequence
Mistress shot him
Wife shot him
Suicide
J. R. Treated
for Depressio
n
Bobby publicly
threatened
Producers
desperate for
ratings
Big fight between
wife, mistress
![Page 29: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/29.jpg)
Bayesian Networks• Strengths
• Clarity of the resulting models
• Good precision in predicting
• Easily adapt to new probabilities
• Weaknesses• Time consuming to
construct and maintain
• Poor at predicting rare events
![Page 30: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/30.jpg)
Statistics
• With an outcome or dependent variable:• Correlations• ANOVA• Regression
• Used by themselves or to confirm findings of another method
![Page 31: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/31.jpg)
Statistics• Strengths
• “Gold Standard” – valid and trusted in scientific circles
• Weaknesses• Limits findings to
those techniques that are applied and their associated limitations (normality, linearity, and so on)
![Page 32: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/32.jpg)
Hybrids
• Techniques used in combination• Example: use of a genetic algorithm to identify
target variables for inclusion in a neural network model
![Page 33: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/33.jpg)
Recap
• Data Mining is the core activity or method within a process of Knowledge Discovery in Databases
• Done in order to find useful information in large amounts of data not possible using “conventional” approaches
• Variety of methods• Knowledge of data domain, methods, as well
as creativity
![Page 34: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/34.jpg)
Data Mining Packages
• Major vendors of database/data management products (IBM, SPSS, Oracle PeopleSoft, SAS, and so on)
• Added as a component of turnkey packages• May incorporate several methods (SAS
Enterprise Miner)• Single method (TreeAge Software Inc.: a
dedicated decision tree product)
![Page 35: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/35.jpg)
How to implement?
• Do it yourself (you know the data domain)• Put a team together (domain and method
specialists)• Hire a consultant (who knows both your
domain and the tools)• Vertical markets in data mining
![Page 36: Data Mining and Knowledge Discovery in Databases](https://reader035.vdocuments.us/reader035/viewer/2022062216/56649e165503460f94b008e4/html5/thumbnails/36.jpg)
Close Relatives of Data Mining
• On-Line Analytical Processing (OLAP)• Pivot tables in spreadsheets• General statistical packages
• Intelligent Data Analysis – comprises the use of data mining methods in the analysis of “small” datasets