a brief introduction and issues on the classification problem jin mao postdoc, school of...

A Brief Introduction and Issues on the Classification Problem

Jin MaoPostdoc, School of Information, University of Arizona

Sept 18, 2015

Outline

Classification Examples

Spam Email filtering

Fraud detection

Self-piloting automobile

The Classification Problem

Classic Classifiers

Naïve Bayes

Decision Tree ： J48(C4.5)

KNN

RandomForest

SVM ： SMO, LibSVM

Neural Network

…

How to Choose the Classifier?

Observe your data: amount, features

Your application: precision/recall, explainable, incremental,

complexity

Decision Tree is easy to understand, but can't predict numerical values and is

slow.

Naïve Bayes is robust for somehow, easy to increment.

Neural networks and SVM are "black boxes“. SVM is fast to predict yes or no.

!Never Mind: You can try all of them.

Model Selection with Cross Validation


Discussions: http://stackoverflow.com/questions/2595176/when-to-choose-which-m

achine-learning-classifier http://stats.stackexchange.com/questions/7610/top-five-classifiers-to-t

ry-first http://nlp.stanford.edu/IR-book/html/htmledition/choosing-what-kind-of

-classifier-to-use-1.html http://www.researchgate.net/post/How_to_decide_the_best_classifier_b

ased_on_the_data-set_provided

http://stackoverflow.com/questions/2595176/when-to-choose-which-machine-learning-classifier

http://stackoverflow.com/questions/2595176/when-to-choose-which-machine-learning-classifier

http://stats.stackexchange.com/questions/7610/top-five-classifiers-to-try-first

http://stats.stackexchange.com/questions/7610/top-five-classifiers-to-try-first

http://nlp.stanford.edu/IR-book/html/htmledition/choosing-what-kind-of-classifier-to-use-1.html

http://nlp.stanford.edu/IR-book/html/htmledition/choosing-what-kind-of-classifier-to-use-1.html

http://www.researchgate.net/post/How_to_decide_the_best_classifier_based_on_the_data-set_provided

http://www.researchgate.net/post/How_to_decide_the_best_classifier_based_on_the_data-set_provided

Train Your Classifier

Obtain Training Set

Instances should be labeled.From running systems in practiceAnnotate by multi-experts (Inter-rater agreement)Crowdsourcing (Google’s Captcha)…

Obtain Training Set

Large EnoughMore data can reduce the noisesThe benefit of enough data even can dominate

that of the classification algorithmsRedundant data will do little help.Selection Strategies: nearest neighbors, ordered removals,

random sampling, particle swarms or evolutionary methods

Obtain Training Set

Unbalanced Training Instances for Different ClassesEvaluation: For simple measures, precision/recall,only the

instances of the marjority class (class with many samples), this measure still gives you a high rate. (AUC is better.)

No enough information for the features to find the class boundaries.

Obtain Training Set

Strategies:divide into L distinct clusters, train L predictors, and

average them as the final one. Generate synthetic data for rare class. SMOTEReduce the imbalance level. Cut down the majority class…

Obtain Training Set

More materials https://www.quora.com/In-classification-how-do-you-handle-an-

unbalanced-training-set http://stats.stackexchange.com/questions/57259/highly-

unbalanced-test-data-set-and-balanced-training-data-in-classification

He, Haibo, and Edwardo Garcia. "Learning from imbalanced data." Knowledge and Data Engineering, IEEE Transactions on 21, no. 9 (2009): 1263-1284.

Feature Selection

WhyUnrelated Features noise, heavy computationInterdependent Features redundant featuresBetter Model

http://machinelearningmastery.com/an-introduction-to-feature-selection/Guyon and Elisseeff in “An Introduction to Variable and Feature Selection”

(PDF)

Feature Selection

Feature Selection MethodFilter methods: apply a statistical measure to assign a scoring

to each feature. E.g., the Chi squared test, information gain and correlation coefficient scores.

Wrapper methods: consider the selection of a set of features as a search problem.

Embedded methods: learn which features best contribute to the accuracy of the model while the model is being created. LASSO, Elastic Net and Ridge Regression.

Evaluation Method

Basic Evaluation Method Precision

Confusion matrix

Per-class accuracy

AUC(Area Under the Curve) The ROC curve shows the sensitivity of the classifier by plotting the rate of true positives

to the rate of false positives

Evaluation Method

Cross Validation Random Subsampling

K-fold Cross Validation

Leave-one-out Cross Validation

Cross Validation

Random Subsampling

Cross Validation

K-fold Cross Validation

Cross Validation

Leave-one-out Cross Validation

Cross Validation

Three-way data splits

Apply the Classifier

Save the Model

Make the Model dynamic

Thank you!

a brief introduction and issues on the classification problem jin mao postdoc, school of...

Documents