Download - Bias Variance Analysis
Systematic Improvement of Binary Classification Models using BiasVariance Analysis
Binary Classification (building a model that distinguishes between two classes) is pretty much the hammer of machine learning. A lot of real
world problems can be framed as binary classification problems. Lets say you data points labelled categorized into two classes (canonically
referred to as positive and negative classes). Also, you have a model for categorization which business users say is not good enough. Its your job to improve the model. Where do you start? The subject matter of the post is the systematic improvement of a binary classification model
using BiasVariance analysis.
The first step is to setup a harness for crossvalidation (10 fold are recommended) to determine whether you have a more of bias (underfitting)
problem or more of a variance (overfitting problem) problem using a class distribution invariant metric (recommended approach is F1Score) .
So, lets say you do a 10 fold cross validation and determine the mean (and the variance) of the training F1 Score ( ) and test F1 ScoreFTrain
(denoted by ) over the folds. There are only four possibilities.FTest
1. There is no significant difference between your and and they are sufficiently high enough for the business requirement.FTrain FTest
You are having a Ground Truth problem. 2. There is no significant difference between your and and they are much below the business requirement. You are havingFTrain FTest
a Bias Problem. 3. Your is high enough for the business requirement but your is significantly lower than your . You are having aFTrain FTest FTrain
Variance Problem. 4. Your is too low for the business requirement and your is significantly lower. You are having both a Bias as well as aFTrain FTest
Variance problem.
Steps to fix variance problem (in recommended order):
1. Use Regularization.
2. Revisit feature representation.
3. Revisit feature normalization.
4. Do feature selection.
5. Do feature extraction.
6. Do homogeneous model ensembles.
7. Analyze the convergence of the underlying optimization problem.
8. Add more training data.
Steps to fix bias problem (in recommended order):
1. Add more features.
2. Explore a richer hypothesis space.
3. Explore an alternative hypothesis space.
4. Revisit feature normalization.
5. Revisit feature representation.
6. Do heterogeneous model ensembles.
7. Analyze the convergence of the underlying optimization problem.
8. Check, measure and fix the label noise.
9. Add more training data.
Overall Guidelines:
1. Make sure you don’t have a ground truth problem.
2. Fix the variance problem before the bias problem.
3. Don’t ship out a model with a variance problem.
4. Maintain a research notebook, that records how you went about improving the model. A short description and a git SHA for each
significant improvement that allows your colleagues to reproduce your results is recommended.