data mining techniques
TRANSCRIPT
![Page 1: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/1.jpg)
Data Mining Techniques
Wojtek Kowalczyk
www.cs.vu.nl/~wojtekwww.cs.vu.nl/~wojtek/DataMine
www.cs.vu.nl/ci/DataMine/DIANA
![Page 2: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/2.jpg)
2
Outline
• Organization of the course
• What is Data Mining?
• Course overview
• Data Mining Tasks
• Data Mining Cycle
• Data Mining Techniques
![Page 3: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/3.jpg)
3
Objectives of the course
• Provide an overview of most common algorithms
and techniques used in Data Mining (lectures)
• Provide an extensive “hands-on” experience with
applying these techniques (practicum)
• Provide a survey of typical (and future)
applications of data mining
![Page 4: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/4.jpg)
4
Organization of the course
• 12 lectures (1sp) + 3 assignments (3sp) (1sp=40hrs work)
• no exams; grades based on assignments (theory & practice)
• assignments on: 8.03, 12.04, 03.05
• deadlines: 3 weeks later: 5.04, 3.05, 24.05
• work in couples(?); registration obligatory (before 1.03)
by e-mail to [email protected]: DMT-registration
Body: Full name; e-mail address; student number; {AI|BWI|…}
Full name; e-mail address; student number; {AI|BWI|…}
![Page 5: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/5.jpg)
5
Materials
• Slides, notes, assignments:
www.cs.vu.nl/~wojtek/DataMine
• Book: “ Data Mining” by Ian H. Witten and Eibe Frank,
www.cs.waikato.ac.nz/~ml/weka/book.html
• Internet: www.kdnuggets.com
• Further readings from different perspectives:
- business aspects: Berry & Linoff
- theory: Hand, Mannila, Smyth;
Tan, Steinbach, Kumar
- latest: proceedings of KDD, PKDD, PAKDD, ML, ...
![Page 6: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/6.jpg)
6
Origins of Data Mining
• Every day the world creates a few exabytes of data
1 exabyte = 1000 petabytes1 petabyte = 1000 terabytes 1 terabyte = 1000 gigabytes
• Only 4% of the data is used for any purpose (IBM)
• If we could only do something useful with this data ...
➨ ... the field of DATA MINING is born
![Page 7: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/7.jpg)
7
Sources of data
• satellites (images)• business:
• banks, • telecom, • insurance, • retail• airlines, …
• internet (only a few terabytes at late 90’s)• libraries (e.g., Library of Congress: 20 TB - 3PB)• law enforcement agencies (FBI fingerprints DB: 1PB)• Bioinformatics:? RFID-tags? Homeland security?
![Page 8: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/8.jpg)
8
Typical data mining applications
• fraud detection (credit cards, telecom, insurance, taxes, …)
• credit scoring and control (“to give or not to give?”)
• marketing (mailing selection, modeling churn/retention,
attrition, cross-selling, market basket analysis, etc)
• Customer Relation Management (CRM)
• criminal investigations (text mining)
• ….
In Holland every citizen is “present” in
800-1000 databases !!!
![Page 9: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/9.jpg)
9
What is Data Mining ?
u Data mining is a step in the KDD process consisting of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data. (U. Fayyad, G. Piatetsky-Shapiro and P. Smyth, KDD-96)
u Data mining is an area in the intersection of machine learning, statistics, and databases.
(M. Holsheimer, M. Kersten, H. Mannila and H. Taivonen)
Data mining is the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns for a business advantage
(SAS Institute)
![Page 10: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/10.jpg)
10
Sorts of Data Mining Tasks
Predictive Data Mining (“ supervised” ):u Classificationu Regressionu Time series
Knowledge Discovery (“ unsupervised” ):u Deviation Detectionu Segmentationu Clusteringu Association Rulesu Summarizationu Visualization
![Page 11: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/11.jpg)
11
u Medical diagnosis: soft or hard contact lenses u Credit application scoring: grant a loan or not? u Fraud detection: is the transaction suspicious or not?u Direct mailing: who should be offered a given product?
u CPU- performance: how to configure computers?u Remote sensing: determine water pollution from spectral imagesu Load forecasting: predict future demand for electric poweru Intelligent ATM’s : how much cash will be there tomorrow?
u identify groups of similar credit card usersu automatically organize incoming e- mailsu characterize interests of an Internet useru etc.
Examples
![Page 12: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/12.jpg)
12
Contact lenses: a classification task
Can I use contact lenses?
Possible output: none, soft, hard.
Decision based on:- age- spectacle prescription- astigmatism- tear production rate
![Page 13: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/13.jpg)
13
Hypothetical Decision Table
age prescription astigmatism tear p.r. lensesyoung myope no reduced noneyoung myope no normal softyoung hypermetrope yes reduced none
pre-presbyopic myope no reduced nonepre-presbyopic hypermetrope yes normal softpre-presbyopic hypermetrope yes reduced none
presbyopic myope no normal hardpresbyopic myope no reduced nonepresbyopic hypermetrope yes reduced none
![Page 14: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/14.jpg)
14
Classifiers: classification procedures
•A set of “if-then” rules
•A decision tree
•A Neural Network
•A formula (e.g. “scoring model”)
•A classification procedure
![Page 15: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/15.jpg)
15
Figure 1.1 Rules for the contact lens data.
If tear production rate = reduced then recommendation = none.If age = young and astigmatic = no and tear production rate = normal
then recommendation = softIf age = pre-presbyopic and astigmatic = no and tear production
rate = normal then recommendation = softIf age = presbyopic and spectacle prescription = myope and
astigmatic = no then recommendation = noneIf spectacle prescription = hypermetrope and astigmatic = no and
tear production rate = normal then recommendation = softIf spectacle prescription = myope and astigmatic = yes and
tear production rate = normal then recommendation = hardIf age = young and astigmatic = yes and tear production rate =
normalthen recommendation = hard
If age = pre-presbyopic and spectacle prescription = hypermetropeand astigmatic = yes then recommendation = none
If age = presbyopic and spectacle prescription = hypermetropeand astigmatic = yes then recommendation = none
![Page 16: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/16.jpg)
16
Figure 1.2 Decision tree for the contact lens data.
![Page 17: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/17.jpg)
17
CPU performance: regression problem
Computer’s CPU performance (PRP) depends on a number of factors:
- cycle time (MYCT)- main memory (MMIN, MMAX)- cache (CACH)- number of channels (CHMIN, CHMAX)
Problem:express PRP as a function of all these factors.
![Page 18: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/18.jpg)
18
Figure 3.6(a) Models for the CPU performance data: linear regression.
PRP =- 56.1+ 0.049 MYCT+ 0.015 MMIN+ 0.006 MMAX+ 0.630 CACH- 0.270 CHMIN+ 1.46 CHMAX
![Page 19: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/19.jpg)
19
Figure 3.6(b) Models for the CPU performance data: regression tree.
![Page 20: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/20.jpg)
20
Figure 3.6(c) Models for the CPU performance data: model tree.
![Page 21: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/21.jpg)
21
Association Rules
A shop sells products a, b, …, zClients buy them in collections, e.g., {a, c}, {c, d, z}, …Each set is called a “transaction” or an “item set”
What are the most frequent item sets?
What are the most significant “ association rules” :e.g., {c, g}==>{z}
![Page 22: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/22.jpg)
22
Association Rules II
Rule Significance is measured in terms of:- support (percentage of transactions that match LHS)- confidence (accuracy of the rule)
Problems:• combinatorial explosion of item sets• huge number of rules• two conflicting performance measures
(we want rules to have big support and high accuracy)
There are efficient algorithms for finding rules !!!
![Page 23: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/23.jpg)
23
Interdependencies: Link Analysis
What influences what and to which extent?
Bayesian networks: graphical models of knowledge
Networks constructed from data and knowledge !!!
s
a
x
dr
h
s=smokerx=sexa=ageh=healthr=resistanced=live/death
![Page 24: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/24.jpg)
24
Putting similar things together: Clustering
Example: Credit card users might be clustered according to the way the use their cards:
• frequent/seldom usage• domestic/foreign transactions• high/low amounts of money• transactions of specific type• …
Then for every group another fraud detection systemmay be developed. Or various products might be offered…
![Page 25: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/25.jpg)
25
Characteristics of the data:
Huge quantitiesRedundancyIrrelevancyBad quality:
u missing valuesu incompletenessu inconsistencyu errorsu outdatedu outliers
High dimensionalityUnstructured (e.g. textual)
![Page 26: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/26.jpg)
26
Data Mining Cycle
• Problem understanding and formulating• Identification of relevant data• Data gathering• Data cleaning
• Data preprocessing• Model building• Model analysis
• Model implementation• Model maintenance
![Page 27: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/27.jpg)
27
Accents
1) Algorithms & Techniques2) Technical skills (AWK, Matlab, Weka)3) Performance Challenge4) Applications5) Recent Developments (text mining,
web mining, mining data streams, etc.)
![Page 28: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/28.jpg)
28
Data Preprocessing
• exploratory data analysis• discretization and grouping of values
• reduction of dimensionality• feature extraction
• treatment of missing values and outliers• sampling
![Page 29: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/29.jpg)
29
Model Building
• Rule Induction• Decision Trees • Bayesian Classifiers• Regression Trees• Association Rules• Instance-based learning• Clustering Algorithms• Combining models: Bagging, Boosting, Stacking, etc.
![Page 30: Data Mining Techniques](https://reader033.vdocuments.us/reader033/viewer/2022052621/557c1883d8b42af2418b4eaf/html5/thumbnails/30.jpg)
30
To remember:
•There are various definitions of “Data Mining” •Most common tasks of Data Mining are:
• Classification,
• Regression/ numerical prediction, • Discovery of Associations, • Clustering
• The road “from data to results” involves many steps• The course covers 3 aspects of DM:
• data preprocessing
• model building• model evaluation