statistical machine learning on large scale datakrishnarajpm.com/bigdata/pramod.pdf · a cousin-...
TRANSCRIPT
![Page 1: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/1.jpg)
Statistical Machine Learning on Large Scale
Data
By
Pramod N
![Page 2: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/2.jpg)
Questions!!? What? Why? Where? How? ML-201
![Page 3: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/3.jpg)
Why would machine learning techniques work?
Data Data Data – Nature of Data
Modelling of different kinds of problem to
form relevant data
Size of Data available for Training
![Page 4: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/4.jpg)
What?: Statistical Machine Learning Exploits the nature of data Solution to cope with uncertainty Noisy data but fall under a distribution Employs basic probability decisions( Bayes
Theorem)
![Page 5: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/5.jpg)
Why?: Use Statistical ML
Models are mature Mathematical support to optimize Can handle multi dimensionality
efficiently
![Page 6: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/6.jpg)
Types and Models
SVM- Support Vector MachinesLibsvmliblinear
GMM- Gaussian Mixture ModelsK-meansISODATA
![Page 7: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/7.jpg)
Support Vector Machines Mechanism Developed by Vladimir N.
Vapnik as early as 1979 current implementations are based on
Vapnik and Corinna Cortes’ work in 1995
Separation by hyperplanes Maximize margin of classification
![Page 8: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/8.jpg)
Classification
Choice of boundary
8
Are these really “equally valid”?
How?: SVMs
Hyperplane
![Page 9: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/9.jpg)
Max Margin
How can we pick which is best?
Maximize the size of the margin.
9
Are these really “equally valid”?
Small Margin
Large Margin
Hyperplane
![Page 10: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/10.jpg)
Support Vectors
Support Vectors are those input points (vectors) closest to the decision boundary
They are vectors They “support” the
decision hyperplane
10
![Page 11: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/11.jpg)
Support Vectors
The decision hyperplane:
Decision and Margin Function:
11
![Page 12: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/12.jpg)
How do we represent the size of the margin in terms
of w? There must at least one
point that lies on each support hyperplanes
12
If not, we could define a larger margin support hyperplane that does touch the nearest point(s).
![Page 13: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/13.jpg)
Summary
![Page 14: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/14.jpg)
Summary- Multi Class
![Page 15: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/15.jpg)
libSVM
an integrated software for support vector classification
supports multi-class classification Both C++ and Java sources Python, R, MATLAB, Perl, Ruby, Weka,
Common LISP, CLISP, Haskell, LabVIEW, and PHP interfaces. C# .NET code and CUDA extension is available.
![Page 16: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/16.jpg)
libSVM- command line options
svm-train [input] [options]optimization finished, #iter = 87
nu = 0.471645
obj = -67.299458, rho = 0.203495
nSV = 88, nBSV = 72
Total nSV = 88
svm-predict [input] [model] [output]Output labels written to output
Accuracy = 83% (83/100) (classification)
![Page 17: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/17.jpg)
libSVM-Input Format1 1:0 2:1 3:1 4:239 5:486 6:8 7:8 8:2 9:2 10:7 11:6
1 1:2 2:1 3:3 4: 5:486 6:8 7:8 8:2 9:2 10:7 11:6
1 1:0 2:1 3:5 4:1367 5:335 6:3 7:3 8:2 9:2 10:21 11:72
6 1:182 2:1 3:6 4:1511 5:2957 6:1 7:1 8:2 9:2 10:1 11:3
1 1:0 2:1 3:1 4:239 5:486 6:8 7:8 8:2 9:2 10:7 11:6
6 1:2 2:1 3:3 4: 5:486 6:8 7:8 8:2 9:2 10:7 11:6
1 1:0 2:1 3:5 4:1367 5:335 6:3 7:3 8:2 9:2 10:21 11:72
6 1:182 2:1 3:6 4:1511 5:2957 6:1 7:1 8:2 9:2 10:1 11:3
[label] [index1]:[value1] [index2]:[value2] ...
[label] [index1]:[value1] [index2]:[value2] ...
![Page 18: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/18.jpg)
libSVM- Types -S <int> Set type of SVM (default: 0)
0 = C-SVC 1 = nu-SVC 2 = one-class SVM 3 = epsilon-SVR 4 = nu-SVR
C-svc : binary classification One class svm : anomaly detection,high-
dimensional distribution -ϵ Support Vector Regression ( -ϵ SVR), nu-
SVR
![Page 19: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/19.jpg)
libSVM- kernels
K <int> Set type of kernel function (default: 2)
○ 0 = linear: u'*v ○ 1 = polynomial: (gamma*u'*v + coef0)^degree○ 2 = radial basis function: exp(-gamma*|u-v|^2)○ 3 = sigmoid: tanh(gamma*u'*v + coef0)
![Page 20: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/20.jpg)
A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes a library and
command-line tools for the learning task. libSVM and LIBLINEAR share similar usage as
well as application program interfaces (APIs) and hence ease of use
models after training are quite different (in particular, LIBLINEAR stores w in the model, but LIBSVM does not.)
![Page 21: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/21.jpg)
When to Use LIBLINEAR and libSVM
Number of instances << number of features - linear kernel or LIBLINEAR preferred
Both numbers of instances and features are large- LIBLINEAR is preferred
Number of instances >> number of features – non-linear kernels are preferred(RBF)
![Page 22: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/22.jpg)
Where to find help?
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
http://www.csie.ntu.edu.tw/~cjlin/liblinear/
Documentation API Tools Datasets
![Page 23: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/23.jpg)
GMM- Gaussian Mixture Models
Gaussian Distribution
Carl Friedrich Gauss invented the normal distribution in 1809 as a way to rationalize the method of least squares
![Page 24: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/24.jpg)
GMM
D=1
D=2
![Page 25: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/25.jpg)
7
What is a Gaussian mixture model?
• Problem:
Given a set of data X ={x1,x2,...,xN} drawn from anunknown distribution (probably aGMM), estimate the
parameters θ of the GMM model that fits the data.
• Solution:
Maximize the likelihood p(X |θ) of the data with regard
to the model parameters?
![Page 26: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/26.jpg)
The Expectation-Maximization algorithm
• Basic ideas of the EM algorithm:
the maximization of the likelihood.-Introduce a hidden variable such that its knowledge would simplify
- At each iteration:
• E-Step: Estimate the distribution of the hidden variable giventhe data and the current value of the parameters.
• M-Step: Maximize the joint distribution of the data and thehidden variable.
8
One of the most popular approaches to maximize the likelihoodis to use theExpectation-Maximization (EM) algorithm.
![Page 27: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/27.jpg)
9
The EM for the GMM (graphical view 1)
Hidden variable: for each point, which Gaussian generated it?
![Page 28: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/28.jpg)
10
The EM for the GMM (graphical view 2)
E-Step: for each point, estimate the probability that each Gaussiangenerated it.
![Page 29: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/29.jpg)
11
The EM for the GMM (graphical view 3)
M-Step: modify the parameters according to the hidden variable tomaximize the likelihood of the data (and the hidden variable).
![Page 30: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/30.jpg)
ISODATA
What is ISODATA? Iterative Self- Organizing Data Analysis Technique
![Page 31: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/31.jpg)
Properties ISODATA is a method of semi-
unsupervised classification Don’t need to know the number of
clusters Algorithm splits and merges clusters User defines threshold values for
parameters Algorithm runs for many iterations until
threshold is reached
![Page 32: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/32.jpg)
How ISODATA works? Cluster centers are randomly placed and
pixels are assigned based on the shortest distance to center method
The standard deviation within each cluster, and the distance between cluster centers is calculated
Clusters are split if one or more standard deviation is greater than the user-defined threshold
Clusters are merged if the distance between them is less than the user-defined threshold
![Page 33: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/33.jpg)
Visualize
![Page 34: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/34.jpg)
ISODATA- Continued
A second iteration is performed with the new cluster centers
Further iterations are performed until:the average inter-center distance falls below
the user-defined thresholdthe average change in the inter-center
distance between iterations is less than a threshold
the maximum number of iterations is reached
![Page 35: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/35.jpg)
ALSO!!! Highlights
Clusters associated with fewer than the user-specified minimum number of pixels are eliminated
Lone pixels are either put back in the pool for reclassification, or ignored as “unclassifiable”
![Page 36: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/36.jpg)
Drawbacks of ISODATA
May be time consuming if data is very unstructured
Algorithm can spiral out of control leaving only one class
![Page 37: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/37.jpg)
Advantages of ISODATA
Don’t need to know much about the data beforehand
Little user effort required ISODATA is very effective at identifying
spectral clusters in data
![Page 38: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/38.jpg)
Other Tools
Apache Mahout- Scalable machine learning libraryImproved clustering and classification
algorithm
R- with hadoop pluginImplentation of most of the standard
algorithm
![Page 39: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/39.jpg)
Challenges
Sparse Learning in High Dimensions Semi-Supervised Learning Computation and Risk Structured Prediction Heavily dependent on PAST!!
![Page 40: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/40.jpg)
Future Scope
Lots of area can benefit from statistical learning
More parallel implementations of algorithmsOn massively parallel platformsDistributed platformsGPU computing
![Page 41: Statistical Machine Learning on Large Scale Datakrishnarajpm.com/bigdata/pramod.pdf · A Cousin- LIBLINEAR A Library for Large Linear Classification The LIBLINEAR package includes](https://reader034.vdocuments.us/reader034/viewer/2022042118/5e96cd81aa42d9649223c398/html5/thumbnails/41.jpg)
Thank You