bias variance analysis

1
Systematic Improvement of Binary Classification Models using BiasVariance Analysis Binary Classification (building a model that distinguishes between two classes) is pretty much the hammer of machine learning. A lot of real world problems can be framed as binary classification problems. Lets say you data points labelled categorized into two classes (canonically referred to as positive and negative classes). Also, you have a model for categorization which business users say is not good enough. Its your job to improve the model. Where do you start? The subject matter of the post is the systematic improvement of a binary classification model using BiasVariance analysis. The first step is to setup a harness for crossvalidation (10 fold are recommended) to determine whether you have a more of bias (underfitting) problem or more of a variance (overfitting problem) problem using a class distribution invariant metric (recommended approach is F1Score) . So, lets say you do a 10 fold cross validation and determine the mean (and the variance) of the training F1 Score ( ) and test F1 Score F T rain (denoted by ) over the folds. There are only four possibilities. F T est 1. There is no significant difference between your and and they are sufficiently high enough for the business requirement. F T rain F T est You are having a Ground Truth problem. 2. There is no significant difference between your and and they are much below the business requirement. You are having F T rain F T est a Bias Problem. 3. Your is high enough for the business requirement but your is significantly lower than your . You are having a F T rain F T est F T rain Variance Problem. 4. Your is too low for the business requirement and your is significantly lower. You are having both a Bias as well as a F T rain F T est Variance problem. Steps to fix variance problem (in recommended order): 1. Use Regularization. 2. Revisit feature representation. 3. Revisit feature normalization. 4. Do feature selection. 5. Do feature extraction. 6. Do homogeneous model ensembles. 7. Analyze the convergence of the underlying optimization problem. 8. Add more training data. Steps to fix bias problem (in recommended order): 1. Add more features. 2. Explore a richer hypothesis space. 3. Explore an alternative hypothesis space. 4. Revisit feature normalization. 5. Revisit feature representation. 6. Do heterogeneous model ensembles. 7. Analyze the convergence of the underlying optimization problem. 8. Check, measure and fix the label noise. 9. Add more training data. Overall Guidelines: 1. Make sure you don’t have a ground truth problem. 2. Fix the variance problem before the bias problem. 3. Don’t ship out a model with a variance problem. 4. Maintain a research notebook, that records how you went about improving the model. A short description and a git SHA for each significant improvement that allows your colleagues to reproduce your results is recommended.

Upload: nikhil-ketkar

Post on 13-Jul-2016

6 views

Category:

Documents


4 download

DESCRIPTION

Binary Classification (building a model that distinguishes between two classes) is pretty much the hammer of machine learning. A lot of real world problems can be framed as binary classification problems. Lets say you data points labelled categorized into two classes (canonically referred to as positive and negative classes). Also, you have a model for categorization which business users say is ​not good enough.​ Its your job to improve the model. Where do you start? The subject matter of the post is the systematic improvement of a binary classification model using Bias ­Variance analysis.

TRANSCRIPT

Page 1: Bias Variance Analysis

Systematic Improvement of Binary Classification Models using Bias­Variance Analysis 

 

Binary Classification (building a model that distinguishes between two classes) is pretty much the hammer of machine learning. A lot of real 

world problems can be framed as binary classification problems. Lets say you data points labelled categorized into two classes (canonically 

referred to as positive and negative classes). Also, you have a model for categorization which business users say is not good enough. Its your job to improve the model. Where do you start? The subject matter of the post is the systematic improvement of a binary classification model 

using Bias­Variance analysis. 

 

The first step is to setup a harness for cross­validation (10 fold are recommended) to determine whether you have a more of bias (underfitting) 

problem or more of a variance (overfitting problem) problem using a class distribution invariant metric (recommended approach is F1­Score) . 

So, lets say you do a 10 fold cross validation and determine the mean (and the variance) of the training F1 Score ( ) and test F1 ScoreFTrain  

(denoted by  ) over the folds. There are only four possibilities.FTest  

1. There is no significant difference between your   and   and they are sufficiently high enough for the business requirement.FTrain FTest  

You are having a Ground Truth problem. 2. There is no significant difference between your   and   and they are much below the business requirement.  You are havingFTrain FTest  

a Bias Problem. 3. Your   is high enough for the business requirement but your  is significantly lower than your  . You are having aFTrain FTest FTrain  

Variance Problem. 4. Your   is too low for the business requirement and your is significantly lower. You are having both a Bias as well as aFTrain FTest  

Variance problem.  

Steps to fix variance problem (in recommended order): 

1. Use Regularization. 

2. Revisit feature representation. 

3. Revisit feature normalization. 

4. Do feature selection. 

5. Do feature extraction. 

6. Do homogeneous model ensembles. 

7. Analyze the convergence of the underlying optimization problem. 

8. Add more training data. 

 

Steps to fix bias problem (in recommended order): 

1. Add more features. 

2. Explore a richer hypothesis space. 

3. Explore an alternative hypothesis space. 

4. Revisit feature normalization. 

5. Revisit feature representation. 

6. Do heterogeneous model ensembles. 

7. Analyze the convergence of the underlying optimization problem. 

8. Check, measure and fix the label noise. 

9. Add more training data. 

 

Overall Guidelines: 

1. Make sure you don’t have a ground truth problem. 

2. Fix the variance problem before the bias problem. 

3. Don’t ship out a model with a variance problem. 

4. Maintain a research notebook, that records how you went about improving the model. A short description and a git SHA for each 

significant improvement that allows your colleagues to reproduce your results is recommended. 

 

 

Nikhil Ketkar