a brief introduction and issues on the classification problem jin mao postdoc, school of...
DESCRIPTION
Classification Examples Spam filtering Fraud detection Self-piloting automobileTRANSCRIPT
![Page 1: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/1.jpg)
A Brief Introduction and Issues on the Classification Problem
Jin MaoPostdoc, School of Information, University of Arizona
Sept 18, 2015
![Page 2: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/2.jpg)
Outline
![Page 3: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/3.jpg)
Classification Examples
Spam Email filtering
Fraud detection
Self-piloting automobile
![Page 4: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/4.jpg)
The Classification Problem
![Page 5: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/5.jpg)
The Classification Problem
![Page 6: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/6.jpg)
The Classification Problem
![Page 7: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/7.jpg)
Classic Classifiers
Naïve Bayes
Decision Tree : J48(C4.5)
KNN
RandomForest
SVM : SMO, LibSVM
Neural Network
…
![Page 8: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/8.jpg)
How to Choose the Classifier?
Observe your data: amount, features
Your application: precision/recall, explainable, incremental,
complexity
Decision Tree is easy to understand, but can't predict numerical values and is
slow.
Naïve Bayes is robust for somehow, easy to increment.
Neural networks and SVM are "black boxes“. SVM is fast to predict yes or no.
!Never Mind: You can try all of them.
Model Selection with Cross Validation
![Page 9: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/9.jpg)
How to Choose the Classifier?
![Page 10: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/10.jpg)
How to Choose the Classifier?
Discussions: http://stackoverflow.com/questions/2595176/when-to-choose-which-m
achine-learning-classifier http://stats.stackexchange.com/questions/7610/top-five-classifiers-to-t
ry-first http://nlp.stanford.edu/IR-book/html/htmledition/choosing-what-kind-of
-classifier-to-use-1.html http://www.researchgate.net/post/How_to_decide_the_best_classifier_b
ased_on_the_data-set_provided
![Page 11: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/11.jpg)
Train Your Classifier
![Page 12: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/12.jpg)
Obtain Training Set
Instances should be labeled.From running systems in practiceAnnotate by multi-experts (Inter-rater agreement)Crowdsourcing (Google’s Captcha)…
![Page 13: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/13.jpg)
Obtain Training Set
Large EnoughMore data can reduce the noisesThe benefit of enough data even can dominate
that of the classification algorithmsRedundant data will do little help.Selection Strategies: nearest neighbors, ordered removals,
random sampling, particle swarms or evolutionary methods
![Page 14: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/14.jpg)
Obtain Training Set
Unbalanced Training Instances for Different ClassesEvaluation: For simple measures, precision/recall,only the
instances of the marjority class (class with many samples), this measure still gives you a high rate. (AUC is better.)
No enough information for the features to find the class boundaries.
![Page 15: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/15.jpg)
Obtain Training Set
Strategies:divide into L distinct clusters, train L predictors, and
average them as the final one. Generate synthetic data for rare class. SMOTEReduce the imbalance level. Cut down the majority class…
![Page 16: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/16.jpg)
Obtain Training Set
More materials https://www.quora.com/In-classification-how-do-you-handle-an-
unbalanced-training-set http://stats.stackexchange.com/questions/57259/highly-
unbalanced-test-data-set-and-balanced-training-data-in-classification
He, Haibo, and Edwardo Garcia. "Learning from imbalanced data." Knowledge and Data Engineering, IEEE Transactions on 21, no. 9 (2009): 1263-1284.
![Page 17: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/17.jpg)
Feature Selection
WhyUnrelated Features noise, heavy computationInterdependent Features redundant featuresBetter Model
http://machinelearningmastery.com/an-introduction-to-feature-selection/Guyon and Elisseeff in “An Introduction to Variable and Feature Selection”
(PDF)
![Page 18: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/18.jpg)
Feature Selection
Feature Selection MethodFilter methods: apply a statistical measure to assign a scoring
to each feature. E.g., the Chi squared test, information gain and correlation coefficient scores.
Wrapper methods: consider the selection of a set of features as a search problem.
Embedded methods: learn which features best contribute to the accuracy of the model while the model is being created. LASSO, Elastic Net and Ridge Regression.
![Page 19: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/19.jpg)
Evaluation Method
Basic Evaluation Method Precision
Confusion matrix
Per-class accuracy
AUC(Area Under the Curve) The ROC curve shows the sensitivity of the classifier by plotting the rate of true positives
to the rate of false positives
![Page 20: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/20.jpg)
Evaluation Method
Cross Validation Random Subsampling
K-fold Cross Validation
Leave-one-out Cross Validation
![Page 21: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/21.jpg)
Cross Validation
Random Subsampling
![Page 22: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/22.jpg)
Cross Validation
K-fold Cross Validation
![Page 23: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/23.jpg)
Cross Validation
Leave-one-out Cross Validation
![Page 24: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/24.jpg)
Cross Validation
Three-way data splits
![Page 25: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/25.jpg)
Apply the Classifier
Save the Model
Make the Model dynamic
![Page 26: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015](https://reader036.vdocuments.us/reader036/viewer/2022062413/5a4d1b2c7f8b9ab05999972c/html5/thumbnails/26.jpg)
Thank you!