data mining and machine learning a brief introduction
TRANSCRIPT
![Page 1: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/1.jpg)
Data mining and machine learning
A brief introduction
![Page 2: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/2.jpg)
Outline A brief introduction to learning algorithms
Classification algorithms Clustering algorithms
Addressing privacy issues in learning Single dataset publishing Distributed multiple datasets How data is partitioned
![Page 3: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/3.jpg)
A quick review
Machine learning algorithms Supervised learning (classification)
Training data have class labels Find the boundary between classes
Unsupervised learning (clustering) Training data have no labels Similarity measure is the key Grouping records based on the similarity
measure
![Page 4: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/4.jpg)
A quick review
Good tutorials http://www.cs.utexas.edu/~mooney/cs39
1L/ “Top 10 data mining algorithms”
www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf
We will review the basic ideas of some algorithms
![Page 5: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/5.jpg)
C4.5 decision tree (classification)
Based on ID3 algorithm Convert decision tree to rule set
From the root to a leave a rule
Prune the rules Cross validation
Split data to N folds
training validating testingIn each round
For choosing the best parameters
Testing the generalization power
Final result: the average of N testing results
![Page 6: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/6.jpg)
Naïve bayes (classification)
Two classes: 0/1, feature vector: x (x1,x2,…, xn)
Apply bayes rule:
Assume independentfeatures :
Easy to count f(xi|class label) with the training data
![Page 7: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/7.jpg)
K nearest neighbor (classification)
“instance-based learning”
Classifying the point
Decision area: Dz
More general: kernel methods
![Page 8: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/8.jpg)
Linear classifier (classification)
wTx + b = 0
wTx + b < 0wTx + b > 0
f(x) = sign(wTx + b)
Examples:•Perceptron•Linear discriminant analysis(LDA)
![Page 9: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/9.jpg)
There are infinite number of linear separatorsWhich one is optimal?
![Page 10: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/10.jpg)
Support Vector Machine (classification)
Distance from example xi to the separator is
Examples closest to the hyperplane are support vectors. Margin ρ of the separator is the distance between support
vectors.
w
xw br i
T
r
ρ Maximizing:
Extended to handle:1. Nonlinear2. Noisy margin3. Large datasets
![Page 11: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/11.jpg)
Boosting (classification)
Classifier ensembles Average prediction of a set of classifiers
trained on the same set of data H(x) = sum hi (x)
Weighting learning examples for a new classifier hi(x) based on previous classifiers Emphasis on incorrectly predicted examples
Intuition Sample weighting Averaging can reduce the variance of prediction
![Page 12: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/12.jpg)
AdaBoost Freund Y, Schapire RE (1997) A decision-theoretic
generalization of on-line learning and an application to boosting. J Comput Syst Sci
![Page 13: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/13.jpg)
Gradient boosting J. Friedman: stochastic gradient boosting,
http://citeseer.ist.psu.edu/old/126259.html
![Page 14: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/14.jpg)
Clustering
Definition of similarity measures Point-wise
Euclidean Cosine ( document similarity) Correlation …
Set-wise Min/max distance between two sets Entropy based (categorical data)
![Page 15: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/15.jpg)
Types of clustering algorithm Hierarchical
1. Merging most similar pairs each step2. Until reaching desired number of clusters
Partitioning (k-means)1. Set initial centroids 2. Partition the data3. Adjust the centroids4. Iterate on 2 and 3 until converging
Other classification of algorithms Aglommerative (bottom-up) methods Divisive (partitional, top-down)
![Page 16: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/16.jpg)
Challenges in Clustering
Efficiency of the algorithm –large datasets Linear-cost algorithms: k-means However, the costs of many algorithms
are quadratic Perform a three-phase processing
1. Sampling2. Clustering3. Labeling
![Page 17: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/17.jpg)
Challenges in Clustering
Irregularly shaped clusters and noises
![Page 18: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/18.jpg)
Sample clustering algorithms Typical ones
Kmeans Expectation-Maximization (EM)
A lot of clustering algorithms addressing different challenges Good survey:
AK Jain etc. Data Clustering: A Review, ACM Computing Surveys, 1999
![Page 19: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/19.jpg)
Kmeans illustration
Randomly select centroids Assign cluster label of each point
according to the distance to the centroids
![Page 20: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/20.jpg)
kmeans
Recalculate the centroids Reclustering
Repeat, until the cluster labels do not change, or the changes of centroids are very small
![Page 21: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/21.jpg)
PPDM issues
How data is collected Single party releases data Multiparty collaboratively mining data
Pooling data Cryptographic protocols
How data is partitioned Horizontally vertically
![Page 22: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/22.jpg)
Single party
Data perturbation Rakesh00, for decision tree Chen05, for many classifiers and
clustering algorithms
Anonymization Top-down/bottom-up: decision tree
![Page 23: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/23.jpg)
Multiple parties
Party 1
data
Party 2
data
Party n
dataserver
data
user 1 user 1 user 1
Perturbeddata
network
Service-based computing Peer-to-peer computing
•Perturbation & anonymization•Papers: 89,92,94,185,
•Cryptographic approaches•Papers: 95-99,104,107,108
![Page 24: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/24.jpg)
How data is partitioned Horizontally partitioned
All additive (and some multiplicative) perturbation methods
Protocols Kmeans, svm, naïve bayes, bayesian network…
Vertically partitioned All additive perturbation methods Protocols
Kmeans, bayesian network…
![Page 25: Data mining and machine learning A brief introduction](https://reader035.vdocuments.us/reader035/viewer/2022062314/56649e495503460f94b3d665/html5/thumbnails/25.jpg)
Challenges and opportunities
Many modeling methods have no privacy-preserving version Cost of protocol based approaches Limitation of column-based additive
perturbation Complexity
PP Methods that can be applied to a class of DM algorithms E.g., geometric perturbation