comparison of data mining algorithms on bioinformatics dataset

29
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003

Upload: tameka

Post on 10-Jan-2016

35 views

Category:

Documents


1 download

DESCRIPTION

Comparison of Data Mining Algorithms on Bioinformatics Dataset. Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003. Overview. Began as independent study project completed with Dr. Cha in Spring 2002 Initial goal: Compare data mining algorithms on a public bioinformatics dataset - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Comparison of Data Mining Algorithms on Bioinformatics

DatasetMelissa K. Carroll

Advisor: Sung-Hyuk Cha

March 4, 2003

Page 2: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Overview

• Began as independent study project completed with Dr. Cha in Spring 2002

• Initial goal: Compare data mining algorithms on a public bioinformatics dataset

• Later: evaluate stacked generalization approach

• Organization of presentation

– Introduction to task

– Base models and performance

– “Stacked” models and performance

– Conclusion and Future Work

Page 3: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Introduction: Data Mining

• Application of machine learning algorithms to large databases

• Often used to generate models to classify future data based on “training” dataset of known classifications

• If data is organized well, domain knowledge is not necessary for the data mining practitioner

Page 4: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Introduction: Bioinformatics and Protein Localization

• Bioinformatics: use of computational methods e.g. data mining to provide insights in molecular biology

• Have large databases of information about genes; want to figure out the function of their encoded proteins

• Proteins are expressed in a specific tissue, cell type, or subcellular component (localization)

• Knowledge of protein localization can shed light on protein’s function

Page 5: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Introduction

Page 6: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Introduction: KDD Cup Dataset

• KDD Cup: Annual data mining competition sponsored by ACM SIGKDD

• Training set with target variable supplied and test set with target variable missing supplied

• Participants submit predictions for test set’s target variable

• Submissions with the highest accuracy rate (correct predictions/total instances in test set) win

• Test set’s target variable is publicly available once competition is over

Page 7: Comparison of Data Mining Algorithms on Bioinformatics Dataset

• 2001 competition focused on bioinformatics including a protein localization task

• Dataset consisted of various information about anonymized genes of a particular organism including class, phenotype, chromosome, whether essential, and other genes with which it interacts

• Purpose of this project: compare data mining algorithms on KDD Cup 2001 protein localization dataset

Introduction: KDD Cup Dataset Continued

Page 8: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Methods

• Simplify dataset: reduce number of variables to facilitate working with commercial data mining package (SAS Enterprise Miner)

• Decided to eliminate variables pertaining to interactions between genes– were more of these variables than other types– sophisticated relational algorithm was necessary

to take full advantage of them• Correspondingly, decreased number of target values

Page 9: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Frequency of Classes in KDD Cup Training Set

Localization Frequency Percent Cumulative Frequency

Cumulative Percent

nucleus 366 42.46 366 42.46cytoplasm 192 22.27 558 64.73mitochondria 69 8 627 72.74cytoskeleton 58 6.73 685 79.47er 43 4.99 728 84.45plasma membrane 43 4.99 771 89.44golgi 35 4.06 806 93.5vacuole 18 2.09 824 95.59transport vesicles 17 1.97 841 97.56peroxisome 10 1.16 851 98.72endosome 4 0.46 855 99.19integral membrane 3 0.35 858 99.54extracellular 2 0.23 860 99.77cell wall 1 0.12 861 99.88lipid particles 1 0.12 862 100

Page 10: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Frequency of Classes in KDD Cup Test Set

Localization Frequency Percent Cumulative Frequency

Cumulative Percent

nucleus 174 45.55 174 45.55cytoplasm 66 17.28 240 62.83mitochondria 35 9.16 275 71.99cytoskeleton 30 7.85 305 79.84er 25 6.54 330 86.39plasma membrane 18 4.71 348 91.1golgi 12 3.14 360 94.24transport vesicles 9 2.36 369 96.6vacuole 8 2.09 377 98.69peroxisome 3 0.79 380 99.48extracellular 2 0.52 382 100

Page 11: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Methods Continued

• Created subsets by selecting only instances whose target was among nucleus, cytoplasm, and mitochondria, and only non-relational variables

• Divided training subset into two random subsets of 314 and 313 instances (training and validation)

• Two actual training datasets this training set:– non-sampled raw data (314 instances)– sampled dataset in which each target value appeared

in equal amounts and contained frequency variable

Page 12: Comparison of Data Mining Algorithms on Bioinformatics Dataset

• A variable was excluded as an input if:– more than 50% of data missing (none excluded)– effectively unary (274 variables excluded)– in hierarchy and not most detailed (none

excluded)• Resulting training sets: 171 variables (170 binary,

1 non-binary categorical)• No missing values in any variables

Methods Continued

Page 13: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Base Models

Page 14: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Models

• Artificial Neural Network– Fully connected feedforward network– One input node for each dummy variable from 171

inputs– 1 hidden node and 2 output nodes: dummy values

for nucleus and mitochondria– 191 randomly initialized weights– Trained using Dual Quasi-Newton Optimization to

minimize misclassification rate of training set

Page 15: Comparison of Data Mining Algorithms on Bioinformatics Dataset

• Decision Tree

– Used CHAID-like algorithm with a chi-squared p value splitting criterion of 0.2 and model selection based on proportion of instances correctly classified

• Hybrid ANN/Tree

– Difficult for ANN to learn with so many variables

– Used decision tree as feature selector to determine variables to use in training ANN

Models Continued

Page 16: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Models Continued

• Nearest Neighbor

– Simple Nearest Neighbor algorithm: assigned each instance in dataset to be predicted to class of instance in training set which matched on the greatest number of variables

– Match defined as having the exact same value

– In case of ties, value from among possible classes that occurred most frequently in raw training set was used, including when applying to the equally distributed training set

Page 17: Comparison of Data Mining Algorithms on Bioinformatics Dataset

• Accuracy rates

• Statistical comparisons– Hybrid Tree-ANN significantly better for non-sampled than

equally distributed on test dataset (p < 0.01)– Non-sampled dataset Hybrid Tree-ANN not significantly

better than non-sampled Tree (p < =0.06) but significantly better than non-sampled ANN (p < 0.05)

Preliminary Results

ANN Tree NN Hybrid

Non-Sampled Validation 65.81 72.2 73.8 71.25Test 64.73 65.09 70.55 71.27

Equal Dist Validation 62.94 61.66 64.86 62.3Test 66.91 59.64 65.09 61.82

Page 18: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Reference Point for Results

• Highest accuracy rate on actual test: 71.1%• Next 5 between 68.5% and 70.6%• My accuracy rates just slightly off due to gene

with two localizations• Actual competition required prediction with many

more possible values for target variable• However, actual competitors had more variables

with which to work (relational ones)

Page 19: Comparison of Data Mining Algorithms on Bioinformatics Dataset

“Stacked” Models

Page 20: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Stacking

• Method for combining models• Not as common as other methods and no standard way

of doing• Part of training set used to train level-0, or base, models

as usual• Dataset built from predictions of base models on

remainder of set (validation set in this project) • Level-1 model derived from this prediction dataset,

rather than training set predictions to prevent model from going with overfit models

Page 21: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Methods Continued

• Level 1 ANN– Same as Level 0 ANN (used Levenberg-Marquardt

Optimization because fewer weights)• Level 1 Decision Tree

– Same as Level 0 Tree• Level 1 Naïve Bayes

– Calculated likelihood of each target value based on Bayes rule applied to level-0 predictions

– Predicted value with highest likelihood

Page 22: Comparison of Data Mining Algorithms on Bioinformatics Dataset

NUCLEUS 8.0% 2MITOCHONDRIA 76.0% 19CYTOPLASM 16.0% 4TOTAL 25

MITOCHONDRIACYTOPLASM/NUCLEUS

NN

NUCLEUS 23.0% 14MITOCHONDRIA 6.6% 4CYTOPLASM 70.5% 43TOTAL 61

NUCLEUS 0.0% 0MITOCHONDRIA 100.0% 2CYTOPLASM 0.0% 0TOTAL 2

CYTOPLASM MITOCHONDRIA NUCLEUS

HYBRID

CYTOPLASM/MITOCHONDRIA

NUCLEUS 42.6% 23MITOCHONDRIA 13.0% 7CYTOPLASM 44.4% 24TOTAL 54

NUCLEUS

NUCLEUS 87.7% 150MITOCHONDRIA 1.2% 2CYTOPLASM 11.1% 19TOTAL 171

ANN

NUCLEUS 60.4% 189MITOCHONDRIA 10.9% 34CYTOPLASM 28.75% 90TOTAL 313

Page 23: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Results of Stacking Approach Continued

• Accuracy rates

• Statistical comparisons

– For non-sampled, all level 1 models significantly better than level 0 ANN

– For equally distributed, no level 1 models significantly better than level 0 ANN

– For non-sampled, no level 1 models significantly better than level 0 NN on same dataset

Level 1Bayes

Level 1ANN

Level 1 Tree

Non-Sampled Test 72 71.27 71.64Equal Dist Test 70.18 65.82 67.64

Page 24: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Conclusion and Future Work

Page 25: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Conclusion

• Stacked generalization produced more accurate predictors of test data than base models overall, though not necessarily significantly so– Consistent with intuition and other findings

• Nearest Neighbor and Hybrid Tree ANN more accurate than ANN and Tree alone, though not necessarily significantly so– May need better trained ANN and tree

Page 26: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Conclusion Continued

• Three types of level-1 models performed comparably– Other research suggests linear models may

work best for stacking, so Bayesian might be expected to perform best

– A priori-type search on prediction dataset before Bayesian training to reject conclusions without enough support may improve performance of Bayesian

Page 27: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Conclusion Continued

• Non-sampled training dataset (with target distribution found in raw data) produced more accurate models than equally distributed training dataset– Sample size may have been too small– Could try without weight variable since it’s

likely that prior probabilities aren’t known (unless localization of all genes for this organism are known)

Page 28: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Future Work

• Use cross-validation to obtain better estimates of error, both overall and for creating the level-1 training dataset

– Dividing training into two may have resulted in too few instances and inputs

• Changing stacking approach

– Use posterior probablities instead of predictions

– Use different or modified algorithms (more linear, add a priori to bayesian)

– Use a level-2 model on these level-1 models

Page 29: Comparison of Data Mining Algorithms on Bioinformatics Dataset

Future Work Continued

• Stratify training and validation datasets to keep distribution the same as in the original training set

• Run chi-squares on all combinations of models and adjust for multiple comparisons (cross-validation usually preferred method)

• Try on complete KDD Cup dataset