tutorial on data mining
DESCRIPTION
Tutorial on Data Mining. Workshop of the Indian Database Research Community Sunita Sarawagi School of IT, IIT Bombay. Data mining. Process of semi-automatically analyzing large databases to find interesting and useful patterns - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/1.jpg)
Tutorial on Data Mining
Workshop of the Indian Database Research Community
Sunita Sarawagi
School of IT, IIT Bombay
![Page 2: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/2.jpg)
Data mining
• Process of semi-automatically analyzing large databases to find interesting and useful patterns
• Overlaps with machine learning, statistics, artificial intelligence and databases but– more scalable in number of features and instances
– more automated to handle heterogeneous data
![Page 3: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/3.jpg)
Outline• Applications• Usage scenarios• Overview of operations• Mining research groups• Relevance in India• Ten research problems
![Page 4: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/4.jpg)
Applications
• Customer relationship management:– identify those who are likely to leave for a competitor.
• Targeted marketing: identify likely responders to promotions
• Fraud detection: telecommunications, financial transactions
• Manufacturing and production:
• Medicine: disease outcome, effectiveness of treatments
• Molecular/Pharmaceutical: identify new drugs
• Scientific data analysis:
• Web site/store design and promotion
![Page 5: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/5.jpg)
Usage scenarios• Data warehouse mining:
– assimilate data from operational sources
– mine static data
• Mining log data• Continuous mining: example in process control• Stages in mining:
– data selection pre-processing: cleaning transformation mining result evaluation visualization
![Page 6: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/6.jpg)
Some basic operations
• Predictive:– Regression
– Classification
• Descriptive:– Clustering / similarity matching
– Association rules and variants
– Deviation detection
![Page 7: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/7.jpg)
Classification
• Given old data about customers and payments, predict new applicant’s loan eligibility.
AgeSalaryProfessionLocationCustomer type
Previous customers Classifier Decision rules
Salary > 5 L
Prof. = Exec
New applicant’s data
Good/bad
![Page 8: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/8.jpg)
Classification methods
Goal: Predict class Ci = f(x1, x2, .. Xn)• Regression: (linear or any other polynomial)
– a*x1 + b*x2 + c = Ci.
• Nearest neighour• Decision tree classifier: divide decision space into
piecewise constant regions.• Probabilistic/generative models• Neural networks: partition by non-linear
boundaries
![Page 9: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/9.jpg)
• Define proximity between instances, find neighbors of new instance and assign majority class
• Case based reasoning: when attributes are more complicated than real-valued.
Nearest neighbor
• Cons– Slow during application.– No feature selection.– Notion of proximity vague
• Pros+ Fast training
![Page 10: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/10.jpg)
• Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels.
Decision trees
Salary < 1 M
Prof = teacher
Good
Age < 30
BadBad Good
![Page 11: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/11.jpg)
Algorithm for tree building• Greedy top-down construction.
Gen_Tree (Node, data)
make node a leaf? Yes Stop
Find best attribute and best split on attribute
Partition data on split condition
For each child j of node Gen_Tree (node_j, data_j)
Selectioncriteria
![Page 12: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/12.jpg)
Split criteria• K classes, set of S instances partitioned into r
subsets. Instance Sj has fraction pij instances of class j.
• Information entropy:
• Gini index:
r
j
k
iijij
j ppS
S
1 1
log
)(1 1
21
r
j
k
iij
j pS
S0 1
Impurity
1/4
Gini
r =1, k=2
![Page 13: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/13.jpg)
Scalable algorithm• Input: table of records• Vertically partition data and sort on
<attribute value, class>• Finding best split:
– Scan and maintain class counts in memory and find gini incrementally.
• Performing split:– Use split attribute to build
rid to L/R hash in memory.
– Divide other attributes using above hash table.
rid A1 A2 A3 C
A1 C rid A2 C rid A3 C rid
![Page 14: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/14.jpg)
Issues• Preventing overfitting
– Occam’s razor: • prefer the simplest hypothesis that fits the data
– Tree pruning methods:• Cross validation with separate test data• Minimum description length (MDL) criteria
• Multi attribute tests on nodes to handle correlated attributes– Linear multivariate– Non-linear multivariate e.g. a neural net at each node.
• Methods of handling missing values
![Page 15: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/15.jpg)
Pros and Cons of decision trees
• Cons– Cannot handle complicated relationship between features– simple decision boundaries– problems with lots of missing data
• Pros+ Reasonable training time+ Fast application+ Easy to interpret+ Easy to implement+ Can handle large number of features
More information: http://www.stat.wisc.edu/~limt/treeprogs.html
![Page 16: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/16.jpg)
Neural networks• Useful for learning complex data like handwriting,
speech and image recognition
Neural networkClassification tree
Decision boundaries:
![Page 17: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/17.jpg)
Neural network• Set of nodes connected by directed weighted
edges
Hidden nodes
Output nodes
x1
x2
x3
x1
x2
x3
w1
w2
w3
y
n
iii
ey
xwo
1
1)(
)(1
Basic NN unit A more typical NN
![Page 18: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/18.jpg)
Pros and Cons of Neural Network
• Cons– Slow training time– Hard to interpret – Hard to implement: trial and error for choosing number of nodes
• Pros+ Can learn more complicated class boundaries+ Fast application+ Can handle large number of features
Conclusion: Use neural nets only if decision trees/NN fail.
![Page 19: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/19.jpg)
Bayesian learning• Assume a probability model on generation of data.
• Apply bayes theorem to find most likely class as:
• Naïve bayes: Assume attributes conditionally independent given class value
• Easy to learn probabilities by counting, • Useful in some domains e.g. text
)(
)()|(max)|(max :class predicted
dp
cpcdpdcpc jj
cj
c jj
n
iji
j
ccap
dp
cpc
j 1
)|()(
)(max
![Page 20: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/20.jpg)
Bayesian belief network• Find joint probability over set of variables making
use of conditional independence whenever known
• Learning parameters hard when hidden units: use gradient descent / EM algorithms
• Learning structure of network harder
b bb
a d
eC
ad ad ad ad
0.1 0.2 0.3 0.4
0.3 0.2 0.1 0.5Variable e independent
of d given b
![Page 21: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/21.jpg)
Clustering
• Unsupervised learning when old data with class labels not available e.g. when introducing a new product to a customer base
• Group/cluster existing customers based on time series of payment history such that similar customers in same cluster.
• Identify micro-markets and develop policies for each • Key requirement: Need a good measure of similarity
between instances
![Page 22: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/22.jpg)
Distance functions• Numeric data: euclidean, manhattan distances • Categorical data: 0/1 to indicate presence/absence
followed by– Hamming distance (# dissimilarity)
– Jaccard coefficients: #similarity in 1s/(# of 1s)
– data dependent measures: similarity of A and B depends on co-occurance with C.
• Combined numeric and categorical data:– weighted normalized distance:
![Page 23: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/23.jpg)
Distance functions on high dimensional data
• Example: Time series, Text, Images
• Euclidian measures make all points equally far
• Reduce number of dimensions:– choose subset of original features using random projections,
feature selection techniques
– transform original features using statistical methods like Principal Component Analysis
• Define domain specific similarity measures: e.g. for images define features like number of objects, color histogram; for time series define shape based measures.
• Define non-distance based (model-based) clustering methods:
![Page 24: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/24.jpg)
Clustering methods• Hierarchical clustering
– agglomerative Vs divisive
– single link Vs complete link
• Partitional clustering
– distance-based: K-means
– model-based: EM– density-based:
![Page 25: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/25.jpg)
Partitional methods: K-means• Criteria: minimize sum of square of distance
• Between each point and centroid of the cluster.
• Between each pair of points in the cluster
• Algorithm:– Select initial partition with K clusters: random, first K, K
separated points
– Repeat until stabilization:• Assign each point to closest cluster center
• Generate new cluster centers
• Adjust clusters by merging/splitting
![Page 26: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/26.jpg)
Properties• May not reach global optima• Converges fast in practice: guaranteed for certain
forms of optimization function • Complexity: O(KndI):
– I number of iterations, n number of points, d number of dimensions, K number of clusters.
• Database research on scalable algorithms:– Birch: one/two pass of data by keeping R-tree like
index in memory [Sigmod 96]
–
![Page 27: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/27.jpg)
Model based clustering
• Assume data generated from K probability distributions. Need to find distribution parameters.
EM algorithm: K Gaussian mixtures
• Iterate between two steps
– Expectation step: assign points to clusters
– Maximation step: estimate model parameters
)),,(Pr() ( 2ikki dNcdP
iki
ikii
k cdP
cdPd
) (
) (
![Page 28: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/28.jpg)
Association rules• Given set T of groups of items
• Example: set of baskets of items purchased
• Goal: find all rules on itemsets of the form a-->b such that– support of a and b > user threshold s
– conditional probability (confidence) of b given a > user threshold c
• Example: Milk --> bread
• Lot of work done on scalable algorithms
Milk, cerealTea, milk
Tea, rice, bread
cereal
T
![Page 29: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/29.jpg)
Variants• High confidence may not imply high correlation• Use correlations. Find expected support and large
departures from that interesting.– Brin et al. Limited attempt.
– More complete work in statistical literature on contingency tables.
• Still too many rules, need to prune... • Does not imply causality as in Bayesian networks
![Page 30: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/30.jpg)
Prevalent Interesting• Analysts already know
about prevalent rules
• Interesting rules are those that deviate from prior expectation
• Mining’s payoff is in finding surprising phenomena
1995
1998
Milk andcereal selltogether!
Zzzz... Milk andcereal selltogether!
![Page 31: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/31.jpg)
What makes a rule surprising?• Does not match prior
expectation– Correlation between milk
and cereal remains roughly constant over time
• Cannot be trivially derived from simpler rules– Milk 10%, cereal 10%
– Milk and cereal 10% … surprising
– Eggs 10%
– Milk, cereal and eggs 0.1% … surprising!
– Expected 1%
![Page 32: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/32.jpg)
Applications of fast itemset countingFind correlated events: • Applications in medicine: find redundant tests• Cross selling in retail, banking• Improve predictive capability of classifiers that
assume attribute independence• New similarity measures of categorical attributes
[Mannila et al, KDD 98]
![Page 33: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/33.jpg)
Mining market• Around 20 to 30 mining tool vendors: 1/5th the size of
OLAP market.
• Major players:– Clementine,
– IBM’s Intelligent Miner,
– SGI’s MineSet,
– SAS’s Enterprise Miner.
• All pretty much the same set of tools• Many embedded products: fraud detection, electronic commerce
applications
![Page 34: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/34.jpg)
Integrating mining with DBMS• Need to
– intermix operations
– iterate through results
– flexibly query and filter results and data
• Existing file-based, batched approach not satisfactory. • Research challenge: Identify a collection of primitive,
composable operators like in relational DBMS and build a “mining engine”
![Page 35: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/35.jpg)
OLAP Mining integration• OLAP (On Line Analytical Processing)
– Multidimensional view of data: factors are dimensions, quantity to be analyszed: measures/cells.
– Facilitates fast interactive exploration of multidimensional aggregates.
• OLAP products provide a minimal set of tools for analysis:
• Heavy reliance on manual operations for analysis: – tedious and error-prone on large multidimensional data
• Ideal platform for vertical integration of mining but needs to be interactive instead of batch.
![Page 36: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/36.jpg)
State of art in mining OLAP integration
• Decision trees [Information discovery, Cognos]– find factors influencing high profits
• Clustering [Pilot software]– segment customers to define hierarchy on that dimension
• Time series analysis: [Seagate’s Holos]– Query for various shapes along time: spikes, outliers etc
• Multi-level Associations [Han et al.]– find association between members of dimensions
![Page 37: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/37.jpg)
New approach Identify complex operations with specific OLAP needs in mind
(what does an analyst need?) rather than looking at mining operations and choosing what fits
Two examples:• Exceptions in data to guide exploration:
– One reason for manual exploration is to make sure that there are no surprises.
– Pre-mines abnormalities in data and points them out to analysts using highlights at aggregate levels
• Reasons for specific why questions at aggregate level– most compactly represent the answer that user can quickly assimilate
![Page 38: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/38.jpg)
Vertical integration: Mining on the web• Web log analysis for site design:
– what are popular pages,
– what links are hard to find.
• Electronic stores sales enhancements:
– recommendations, advertisement: – Collaborative filtering: Net perception, Wisewire
– Inventory control: what was a shopper looking for and could not find..
![Page 39: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/39.jpg)
Research problems• Automatic model selection: different ways of
solving same problem, which one to use?• Automatic classification of complex data types
especially time series data.• Refreshing mined results: explaining and
modeling changes along time• Quality of mined results: guarding against wrong
conclusions, chance discovering• Incorporating domain knowledge to filter results
and improve result quality
![Page 40: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/40.jpg)
Research problems• Close integration with data sources to be mined• Distributed mining across multiple relations at a
single site or spread across multiple sites.• Integration with other data analysis tools: example
statistical tools, OLAP and SQL querying• Interactive data mining: toolkit of micro operators• Mixed media mining: link textual reports with
images and numeric fields
![Page 41: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/41.jpg)
Relevance in India
• Emerging application areas especially in the banking, retail industry and manufacturing processes
• Mining large scientific databases: export laws might require indigeneous technology
• Rich research area with interesting algorithm components -- just need to implement.
• Too expensive to purchase US/Europe products
![Page 42: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/42.jpg)
Need to build usable prototypes not simply tweak algorithms for publications.
![Page 43: Tutorial on Data Mining](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814cec550346895db9e9b2/html5/thumbnails/43.jpg)
Summary• What is data mining and an overview of the various
operations:– Classification: regression, nearest neighbour, neural
network, bayesian
– Clustering: distance based (k-means), distribution based(EM)
– Itemset counting
• Several operations: challenge is choosing the right operation for the problem
• New directions and identification of research problems