comparison of data mining algorithms on bioinformatics dataset

Comparison of Data Mining Algorithms on Bioinformatics

DatasetMelissa K. Carroll

Advisor: Sung-Hyuk Cha

March 4, 2003

Overview

• Began as independent study project completed with Dr. Cha in Spring 2002

• Initial goal: Compare data mining algorithms on a public bioinformatics dataset

• Later: evaluate stacked generalization approach

• Organization of presentation

– Introduction to task

– Base models and performance

– “Stacked” models and performance

– Conclusion and Future Work

Introduction: Data Mining

• Application of machine learning algorithms to large databases

• Often used to generate models to classify future data based on “training” dataset of known classifications

• If data is organized well, domain knowledge is not necessary for the data mining practitioner

Introduction: Bioinformatics and Protein Localization

• Bioinformatics: use of computational methods e.g. data mining to provide insights in molecular biology

• Have large databases of information about genes; want to figure out the function of their encoded proteins

• Proteins are expressed in a specific tissue, cell type, or subcellular component (localization)

• Knowledge of protein localization can shed light on protein’s function

Introduction

Introduction: KDD Cup Dataset

• KDD Cup: Annual data mining competition sponsored by ACM SIGKDD

• Training set with target variable supplied and test set with target variable missing supplied

• Participants submit predictions for test set’s target variable

• Submissions with the highest accuracy rate (correct predictions/total instances in test set) win

• Test set’s target variable is publicly available once competition is over

• 2001 competition focused on bioinformatics including a protein localization task

• Dataset consisted of various information about anonymized genes of a particular organism including class, phenotype, chromosome, whether essential, and other genes with which it interacts

• Purpose of this project: compare data mining algorithms on KDD Cup 2001 protein localization dataset

Introduction: KDD Cup Dataset Continued

Methods

• Simplify dataset: reduce number of variables to facilitate working with commercial data mining package (SAS Enterprise Miner)

• Decided to eliminate variables pertaining to interactions between genes– were more of these variables than other types– sophisticated relational algorithm was necessary

to take full advantage of them• Correspondingly, decreased number of target values

Frequency of Classes in KDD Cup Training Set

Localization Frequency Percent Cumulative Frequency

Cumulative Percent

nucleus 366 42.46 366 42.46cytoplasm 192 22.27 558 64.73mitochondria 69 8 627 72.74cytoskeleton 58 6.73 685 79.47er 43 4.99 728 84.45plasma membrane 43 4.99 771 89.44golgi 35 4.06 806 93.5vacuole 18 2.09 824 95.59transport vesicles 17 1.97 841 97.56peroxisome 10 1.16 851 98.72endosome 4 0.46 855 99.19integral membrane 3 0.35 858 99.54extracellular 2 0.23 860 99.77cell wall 1 0.12 861 99.88lipid particles 1 0.12 862 100

Frequency of Classes in KDD Cup Test Set

Localization Frequency Percent Cumulative Frequency

Cumulative Percent

nucleus 174 45.55 174 45.55cytoplasm 66 17.28 240 62.83mitochondria 35 9.16 275 71.99cytoskeleton 30 7.85 305 79.84er 25 6.54 330 86.39plasma membrane 18 4.71 348 91.1golgi 12 3.14 360 94.24transport vesicles 9 2.36 369 96.6vacuole 8 2.09 377 98.69peroxisome 3 0.79 380 99.48extracellular 2 0.52 382 100

Methods Continued

• Created subsets by selecting only instances whose target was among nucleus, cytoplasm, and mitochondria, and only non-relational variables

• Divided training subset into two random subsets of 314 and 313 instances (training and validation)

• Two actual training datasets this training set:– non-sampled raw data (314 instances)– sampled dataset in which each target value appeared

in equal amounts and contained frequency variable

• A variable was excluded as an input if:– more than 50% of data missing (none excluded)– effectively unary (274 variables excluded)– in hierarchy and not most detailed (none

excluded)• Resulting training sets: 171 variables (170 binary,

1 non-binary categorical)• No missing values in any variables

Methods Continued

Base Models

Models

• Artificial Neural Network– Fully connected feedforward network– One input node for each dummy variable from 171

inputs– 1 hidden node and 2 output nodes: dummy values

for nucleus and mitochondria– 191 randomly initialized weights– Trained using Dual Quasi-Newton Optimization to

minimize misclassification rate of training set

• Decision Tree

– Used CHAID-like algorithm with a chi-squared p value splitting criterion of 0.2 and model selection based on proportion of instances correctly classified

• Hybrid ANN/Tree

– Difficult for ANN to learn with so many variables

– Used decision tree as feature selector to determine variables to use in training ANN

Models Continued

Models Continued

• Nearest Neighbor

– Simple Nearest Neighbor algorithm: assigned each instance in dataset to be predicted to class of instance in training set which matched on the greatest number of variables

– Match defined as having the exact same value

– In case of ties, value from among possible classes that occurred most frequently in raw training set was used, including when applying to the equally distributed training set

• Accuracy rates

• Statistical comparisons– Hybrid Tree-ANN significantly better for non-sampled than

equally distributed on test dataset (p < 0.01)– Non-sampled dataset Hybrid Tree-ANN not significantly

better than non-sampled Tree (p < =0.06) but significantly better than non-sampled ANN (p < 0.05)

Preliminary Results

ANN Tree NN Hybrid

Non-Sampled Validation 65.81 72.2 73.8 71.25Test 64.73 65.09 70.55 71.27

Equal Dist Validation 62.94 61.66 64.86 62.3Test 66.91 59.64 65.09 61.82

Reference Point for Results

• Highest accuracy rate on actual test: 71.1%• Next 5 between 68.5% and 70.6%• My accuracy rates just slightly off due to gene

with two localizations• Actual competition required prediction with many

more possible values for target variable• However, actual competitors had more variables

with which to work (relational ones)

“Stacked” Models

Stacking

• Method for combining models• Not as common as other methods and no standard way

of doing• Part of training set used to train level-0, or base, models

as usual• Dataset built from predictions of base models on

remainder of set (validation set in this project) • Level-1 model derived from this prediction dataset,

rather than training set predictions to prevent model from going with overfit models

Methods Continued

• Level 1 ANN– Same as Level 0 ANN (used Levenberg-Marquardt

Optimization because fewer weights)• Level 1 Decision Tree

– Same as Level 0 Tree• Level 1 Naïve Bayes

– Calculated likelihood of each target value based on Bayes rule applied to level-0 predictions

– Predicted value with highest likelihood

NUCLEUS 8.0% 2MITOCHONDRIA 76.0% 19CYTOPLASM 16.0% 4TOTAL 25

MITOCHONDRIACYTOPLASM/NUCLEUS

NN



CYTOPLASM MITOCHONDRIA NUCLEUS

HYBRID

CYTOPLASM/MITOCHONDRIA


NUCLEUS


ANN


Results of Stacking Approach Continued

• Accuracy rates

• Statistical comparisons

– For non-sampled, all level 1 models significantly better than level 0 ANN

– For equally distributed, no level 1 models significantly better than level 0 ANN

– For non-sampled, no level 1 models significantly better than level 0 NN on same dataset

Level 1Bayes

Level 1ANN

Level 1 Tree

Non-Sampled Test 72 71.27 71.64Equal Dist Test 70.18 65.82 67.64

Conclusion and Future Work

Conclusion

• Stacked generalization produced more accurate predictors of test data than base models overall, though not necessarily significantly so– Consistent with intuition and other findings

• Nearest Neighbor and Hybrid Tree ANN more accurate than ANN and Tree alone, though not necessarily significantly so– May need better trained ANN and tree

Conclusion Continued

• Three types of level-1 models performed comparably– Other research suggests linear models may

work best for stacking, so Bayesian might be expected to perform best

– A priori-type search on prediction dataset before Bayesian training to reject conclusions without enough support may improve performance of Bayesian

Conclusion Continued

• Non-sampled training dataset (with target distribution found in raw data) produced more accurate models than equally distributed training dataset– Sample size may have been too small– Could try without weight variable since it’s

likely that prior probabilities aren’t known (unless localization of all genes for this organism are known)

Future Work

• Use cross-validation to obtain better estimates of error, both overall and for creating the level-1 training dataset

– Dividing training into two may have resulted in too few instances and inputs

• Changing stacking approach

– Use posterior probablities instead of predictions

– Use different or modified algorithms (more linear, add a priori to bayesian)

– Use a level-2 model on these level-1 models

Future Work Continued

• Stratify training and validation datasets to keep distribution the same as in the original training set

• Run chi-squares on all combinations of models and adjust for multiple comparisons (cross-validation usually preferred method)

• Try on complete KDD Cup dataset

comparison of data mining algorithms on bioinformatics dataset

Documents