a brief introduction and issues on the classification problem jin mao postdoc, school of...
DESCRIPTION
Classification Examples Spam filtering Fraud detection Self-piloting automobileTRANSCRIPT
A Brief Introduction and Issues on the Classification Problem
Jin MaoPostdoc, School of Information, University of Arizona
Sept 18, 2015
Outline
Classification Examples
Spam Email filtering
Fraud detection
Self-piloting automobile
The Classification Problem
The Classification Problem
The Classification Problem
Classic Classifiers
Naïve Bayes
Decision Tree : J48(C4.5)
KNN
RandomForest
SVM : SMO, LibSVM
Neural Network
…
How to Choose the Classifier?
Observe your data: amount, features
Your application: precision/recall, explainable, incremental,
complexity
Decision Tree is easy to understand, but can't predict numerical values and is
slow.
Naïve Bayes is robust for somehow, easy to increment.
Neural networks and SVM are "black boxes“. SVM is fast to predict yes or no.
!Never Mind: You can try all of them.
Model Selection with Cross Validation
How to Choose the Classifier?
How to Choose the Classifier?
Discussions: http://stackoverflow.com/questions/2595176/when-to-choose-which-m
achine-learning-classifier http://stats.stackexchange.com/questions/7610/top-five-classifiers-to-t
ry-first http://nlp.stanford.edu/IR-book/html/htmledition/choosing-what-kind-of
-classifier-to-use-1.html http://www.researchgate.net/post/How_to_decide_the_best_classifier_b
ased_on_the_data-set_provided
Train Your Classifier
Obtain Training Set
Instances should be labeled.From running systems in practiceAnnotate by multi-experts (Inter-rater agreement)Crowdsourcing (Google’s Captcha)…
Obtain Training Set
Large EnoughMore data can reduce the noisesThe benefit of enough data even can dominate
that of the classification algorithmsRedundant data will do little help.Selection Strategies: nearest neighbors, ordered removals,
random sampling, particle swarms or evolutionary methods
Obtain Training Set
Unbalanced Training Instances for Different ClassesEvaluation: For simple measures, precision/recall,only the
instances of the marjority class (class with many samples), this measure still gives you a high rate. (AUC is better.)
No enough information for the features to find the class boundaries.
Obtain Training Set
Strategies:divide into L distinct clusters, train L predictors, and
average them as the final one. Generate synthetic data for rare class. SMOTEReduce the imbalance level. Cut down the majority class…
Obtain Training Set
More materials https://www.quora.com/In-classification-how-do-you-handle-an-
unbalanced-training-set http://stats.stackexchange.com/questions/57259/highly-
unbalanced-test-data-set-and-balanced-training-data-in-classification
He, Haibo, and Edwardo Garcia. "Learning from imbalanced data." Knowledge and Data Engineering, IEEE Transactions on 21, no. 9 (2009): 1263-1284.
Feature Selection
WhyUnrelated Features noise, heavy computationInterdependent Features redundant featuresBetter Model
http://machinelearningmastery.com/an-introduction-to-feature-selection/Guyon and Elisseeff in “An Introduction to Variable and Feature Selection”
(PDF)
Feature Selection
Feature Selection MethodFilter methods: apply a statistical measure to assign a scoring
to each feature. E.g., the Chi squared test, information gain and correlation coefficient scores.
Wrapper methods: consider the selection of a set of features as a search problem.
Embedded methods: learn which features best contribute to the accuracy of the model while the model is being created. LASSO, Elastic Net and Ridge Regression.
Evaluation Method
Basic Evaluation Method Precision
Confusion matrix
Per-class accuracy
AUC(Area Under the Curve) The ROC curve shows the sensitivity of the classifier by plotting the rate of true positives
to the rate of false positives
Evaluation Method
Cross Validation Random Subsampling
K-fold Cross Validation
Leave-one-out Cross Validation
Cross Validation
Random Subsampling
Cross Validation
K-fold Cross Validation
Cross Validation
Leave-one-out Cross Validation
Cross Validation
Three-way data splits
Apply the Classifier
Save the Model
Make the Model dynamic
Thank you!