bias variance analysis

Systematic Improvement of Binary Classification Models using BiasVariance Analysis Binary Classification (building a model that distinguishes between two classes) is pretty much the hammer of machine learning. A lot of real world problems can be framed as binary classification problems. Lets say you data points labelled categorized into two classes (canonically referred to as positive and negative classes). Also, you have a model for categorization which business users say is not good enough. Its your job to improve the model. Where do you start? The subject matter of the post is the systematic improvement of a binary classification model using BiasVariance analysis. The first step is to setup a harness for crossvalidation (10 fold are recommended) to determine whether you have a more of bias (underfitting) problem or more of a variance (overfitting problem) problem using a class distribution invariant metric (recommended approach is F1Score) . So, lets say you do a 10 fold cross validation and determine the mean (and the variance) of the training F1 Score ( ) and test F1 Score F T rain (denoted by ) over the folds. There are only four possibilities. F T est 1. There is no significant difference between your and and they are sufficiently high enough for the business requirement. F T rain F T est You are having a Ground Truth problem. 2. There is no significant difference between your and and they are much below the business requirement. You are having F T rain F T est a Bias Problem. 3. Your is high enough for the business requirement but your is significantly lower than your . You are having a F T rain F T est F T rain Variance Problem. 4. Your is too low for the business requirement and your is significantly lower. You are having both a Bias as well as a F T rain F T est Variance problem. Steps to fix variance problem (in recommended order): 1. Use Regularization. 2. Revisit feature representation. 3. Revisit feature normalization. 4. Do feature selection. 5. Do feature extraction. 6. Do homogeneous model ensembles. 7. Analyze the convergence of the underlying optimization problem. 8. Add more training data. Steps to fix bias problem (in recommended order): 1. Add more features. 2. Explore a richer hypothesis space. 3. Explore an alternative hypothesis space. 4. Revisit feature normalization. 5. Revisit feature representation. 6. Do heterogeneous model ensembles. 7. Analyze the convergence of the underlying optimization problem. 8. Check, measure and fix the label noise. 9. Add more training data. Overall Guidelines: 1. Make sure you don’t have a ground truth problem. 2. Fix the variance problem before the bias problem. 3. Don’t ship out a model with a variance problem. 4. Maintain a research notebook, that records how you went about improving the model. A short description and a git SHA for each significant improvement that allows your colleagues to reproduce your results is recommended.

Upload: nikhil-ketkar

Post on 13-Jul-2016

6 views

Category:

Documents

4 download

Report

Download

Embed Size (px):

DESCRIPTION

Binary Classification (building a model that distinguishes between two classes) is pretty much the hammer of machine learning. A lot of real world problems can be framed as binary classification problems. Lets say you data points labelled categorized into two classes (canonically referred to as positive and negative classes). Also, you have a model for categorization which business users say is not good enough. Its your job to improve the model. Where do you start? The subject matter of the post is the systematic improvement of a binary classification model using Bias Variance analysis.

TRANSCRIPT

Systematic Improvement of Binary Classification Models using BiasVariance Analysis

Binary Classification (building a model that distinguishes between two classes) is pretty much the hammer of machine learning. A lot of real

world problems can be framed as binary classification problems. Lets say you data points labelled categorized into two classes (canonically

referred to as positive and negative classes). Also, you have a model for categorization which business users say is not good enough. Its your job to improve the model. Where do you start? The subject matter of the post is the systematic improvement of a binary classification model

using BiasVariance analysis.

The first step is to setup a harness for crossvalidation (10 fold are recommended) to determine whether you have a more of bias (underfitting)

problem or more of a variance (overfitting problem) problem using a class distribution invariant metric (recommended approach is F1Score) .

So, lets say you do a 10 fold cross validation and determine the mean (and the variance) of the training F1 Score ( ) and test F1 ScoreFTrain

(denoted by ) over the folds. There are only four possibilities.FTest

1. There is no significant difference between your and and they are sufficiently high enough for the business requirement.FTrain FTest

You are having a Ground Truth problem. 2. There is no significant difference between your and and they are much below the business requirement. You are havingFTrain FTest

a Bias Problem. 3. Your is high enough for the business requirement but your is significantly lower than your . You are having aFTrain FTest FTrain

Variance Problem. 4. Your is too low for the business requirement and your is significantly lower. You are having both a Bias as well as aFTrain FTest

Variance problem.

Steps to fix variance problem (in recommended order):

1. Use Regularization.

2. Revisit feature representation.

3. Revisit feature normalization.

4. Do feature selection.

5. Do feature extraction.

6. Do homogeneous model ensembles.

7. Analyze the convergence of the underlying optimization problem.

8. Add more training data.

Steps to fix bias problem (in recommended order):

1. Add more features.