presented by: renikko alleyne

34
A Comprehensive Evaluation of Multicategory Classification Methods for Microarray Gene Expression Cancer Diagnosis Presented by: Renikko Alleyne

Upload: skyla

Post on 02-Feb-2016

47 views

Category:

Documents


0 download

DESCRIPTION

A Comprehensive Evaluation of Multicategory Classification Methods for Microarray Gene Expression Cancer Diagnosis. Presented by: Renikko Alleyne. Outline. Motivation Major Concerns Methods SVMs Non-SVMs Ensemble Classification Datasets Experimental Design Gene Selection - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Presented by: Renikko Alleyne

A Comprehensive Evaluation of Multicategory Classification

Methods for Microarray Gene Expression Cancer Diagnosis

Presented by: Renikko Alleyne

Page 2: Presented by: Renikko Alleyne

Outline• Motivation

• Major Concerns

• Methods

– SVMs

– Non-SVMs

– Ensemble Classification

• Datasets

• Experimental Design

• Gene Selection

• Performance Metrics

• Overall Design

• Results

• Discussion & Limitations

• Contributions

• Conclusions

Page 3: Presented by: Renikko Alleyne

Why?

Clinical Applications of

Gene Expression Microarray Technology

Gene Discovery Disease Diagnosis Drug Discovery

Prediction of clinical outcomes

in response to treatment

Cancer Infectious Diseases

Page 4: Presented by: Renikko Alleyne

GEMS (Gene Expression Model Selector)

Creation of powerful and reliable cancer diagnostic models

Equip with best classifier, gene selection, and cross-validation methods

Evaluation of major algorithms for multicategory classification, gene selection methods, ensemble classifier methods & 2 cross validation designs

11 datasets spanning 74 diagnostic categories & 41 cancer types & 12 normal tissue types

Microarray data

Page 5: Presented by: Renikko Alleyne

Major Concerns

• The studies conducted limited experiments in terms of the number of classifiers, gene selection algorithms, number of datasets and types of cancer involved.

• Cannot determine which classifier performs best.

• It is poorly understood what are the best combinations of classification and gene selection algorithms across most array-based cancer datasets.

• Overfitting.

• Underfitting.

Page 6: Presented by: Renikko Alleyne

Goals for the Development of an Automated System that creates high-quality diagnostic models for use in clinical applications

• Investigate which classifier currently available for gene expression diagnosis performs the best across many cancer types

• How classifiers interact with existing gene selection methods in datasets with varying sample size, number of genes and cancer types

• Whether it is possible to increase diagnostic performance further using meta-learning in the form of ensemble classification

• How to parameterize the classifiers and gene selection procedures to avoid overfitting

Page 7: Presented by: Renikko Alleyne

Why use Support Vector Machines (SVMs)?

• Achieve superior classification performance compared to other learning algorithms

• Fairly insensitive to the curse of dimensionality

• Efficient enough to handle very large-scale classification in both sample and variables

Page 8: Presented by: Renikko Alleyne

How SVMs Work

• Objects in the input space are mapped using a set of mathematical functions (kernels).

• The mapped objects in the feature (transformed) space are linearly separable, and instead of drawing a complex curve, an optimal line (maximum-margin hyperplane) can be found to separate the two classes.

Page 9: Presented by: Renikko Alleyne

SVM Classification Methods

SVMs

Binary SVMs Multiclass SVMs

OVR OVO DAGSVM WW SW

Page 10: Presented by: Renikko Alleyne

Binary SVMs

• Main idea is to identify the maximum-margin hyperplane that separates training instances.

• Selects a hyperplane that maximizes the width of the gap between the two classes.

• The hyperplane is specified by support vectors.

• New classes are classified depending on the side of the hyperplane they belong to.

Support Vector

Hyperplane

Page 11: Presented by: Renikko Alleyne

1. Multiclass SVMs: one-versus-rest (OVR)

• Simplest MC-SVM

• Construct k binary SVM classifiers: – Each class (positive) vs all

other classes (negatives).

• Computationally Expensive because there are k quadratic programming (QP) optimization problems of size n to solve.

Page 12: Presented by: Renikko Alleyne

2. Multiclass SVMs: one-versus-one (OVO)

• Involves construction of binary SVM classifiers for all pairs of classes

• A decision function assigns an instance to a class that has the largest number of votes (Max Wins strategy)

• Computationally less expensive

Page 13: Presented by: Renikko Alleyne

3. Multiclass SVMs: DAGSVM

• Constructs a decision tree

• Each node is a binary SVM for a pair of classes

• k leaves: k classification decisions

• Non-leaf (p, q): two edges– Left edge: not p decision

– Right edge: not q decision

Page 14: Presented by: Renikko Alleyne

4 & 5. Multiclass SVMs: Weston & Watkins (WW) and Crammer & Singer (CS)

• Constructs a single classifier by maximizing the margin between all the classes simultaneously

• Both require the solution of a single QP problem of size

(k-1)n, but the CS MC-SVM uses less slack variables in the constraints of the optimization problem, thereby making it computationally less expensive

Page 15: Presented by: Renikko Alleyne

Non-SVM Classification Methods

Non-SVMs

KNN NN PNN

Page 16: Presented by: Renikko Alleyne

K-Nearest Neighbors (KNN)

• For each case to be classified, locate the k closest members of the training dataset.

• A Euclidean Distance measure is used to calculate the distance between the training dataset members and the target case.    

• The weighted sum of the

variable of interest is found for the k nearest neighbors.

• Repeat this procedure for the other target set cases.

?

?

Page 17: Presented by: Renikko Alleyne

Backpropagation Neural Networks (NN) & Probabilistic Neural Networks (PNNs)

• Back Propagation Neural Networks:– Feed forward neural networks with

signals propagated forward through the layers of units.

– The unit connections have weights which are adjusted when there is an error, by the backpropagation learning algorithm.

• Probabilistic Neural Networks:– Design similar to NNs except that the

hidden layer is made up of a competitive layer and a pattern layer and the unit connections do not have weights.

Page 18: Presented by: Renikko Alleyne

Ensemble Classification Methods

In order to improve performance:

Classifier 1

Ensembled Classifiers

Techniques: Major Voting, Decision Trees, MC-SVM (OVR, OVO, DAGSVM)

Classifier 2 Classifier N

Output 1 Output NOutput 2

Page 19: Presented by: Renikko Alleyne

Datasets & Data Preparatory Steps

• Nine multicategory cancer diagnosis datasets

• Two binary cancer diagnosis datasets

• All datasets were produced by oligonucleotide-based technology

• The oligonucleotides or genes with absent calls in all samples were excluded from analysis to reduce any noise.

Page 20: Presented by: Renikko Alleyne

Datasets

Page 21: Presented by: Renikko Alleyne

Experimental Designs

• Two Experimental Designs to obtain reliable performance estimates and avoid overfitting.

• Data split into mutually exclusive sets.

• Outer Loop estimates performance by: – Training on all splits but one (use for

testing).

• Inner Loop determines the best parameter of the classifier.

Page 22: Presented by: Renikko Alleyne

Experimental Designs

• Design I uses stratified 10 fold cross-validation in both loops while Design II uses 10 fold cross-validation in its inner loop and leave-one-out-cross-validation in its outer loop.

• Building the final diagnostic model involves:– Finding the best parameters for the classification using a single loop

of cross-validation– Building the classifier on all data using the previously found best

parameters– Estimating a conservative bound on the classifier’s accuracy by using

either Designs

Page 23: Presented by: Renikko Alleyne

Gene Selection

Gene Selection Methods

Ratio of genes between-categories to within-category sum of squares (BW)

Signal-to-noise scores

(S2N)

Kruskal-Wallis non-parametric one-way

ANOVA (KW)

S2N-OVR S2N-OVO

Page 24: Presented by: Renikko Alleyne

Performance Metrics

• Accuracy– Easy to interpret– Simplifies statistical testing– Sensitive to prior class probabilities– Does not describe the actual difficulty of the decision problem

for unbalanced distributions

• Relative classifier information (RCI)– Corrects for the differences in:

• Prior probabilities of the diagnostic categories

• Number of categories

Page 25: Presented by: Renikko Alleyne

Overall Research Design

Stage 1:Conducted a Factorial design involving datasets & classifiers w/o gene selection

Stage 2: Conducted a Factorial Design w/ gene selection using datasets for which the full gene sets yielded poor performance

2.6 million diagnostic models generated

Selection of one model for each combination of algorithm and dataset

Page 26: Presented by: Renikko Alleyne

Statistical Comparison among classifiersTo test that differences b/t the best method and the other methods are non-random

Null Hypothesis: Classification algorithm X is as good as Y

Obtain permutation distribution of XY ∆ by repeatedly rearranging the

outcomes of X and Y at random

Compute the p-value of XY ∆ being greater than or equal to observed difference XY

∆ over 10000 permutations

If p < 0.05 Reject H0

Algorithm X is not as good as Y in terms of classification accuracy

If p > 0.05 Accept H0

Algorithm X is as good as Y in terms of classification accuracy

Page 27: Presented by: Renikko Alleyne

Performance Results (Accuracies) without Gene Selection Using Design I

Page 28: Presented by: Renikko Alleyne

Performance Results (RCI) without Gene Selection Using Design I

Page 29: Presented by: Renikko Alleyne

Total Time of Classification Experiments w/o gene selection for all 11 datasets and two experimental designs

• Executed in a Matlab R13 environment on 8 dual-CPU workstations connected in a cluster.

• Fastest MC-SVMs: WW & CS• Fastest overall algorithm: KNN

• Slowest MC-SVM: OVR• Slowest overall algorithms: NN

and PNN

Page 30: Presented by: Renikko Alleyne

Performance Results (Accuracies) with Gene Selection Using Design I

Applied the 4 gene selection methods to the 4 most challenging datasets

Imp

rov

em

en

t b

y g

en

e s

ele

cti

on

Page 31: Presented by: Renikko Alleyne

Performance Results (RCI) with Gene Selection Using Design I

Applied the 4 gene selection methods to the 4 most challenging datasets

Imp

rov

em

en

t b

y g

en

e s

ele

cti

on

Page 32: Presented by: Renikko Alleyne

Discussion & Limitations

• Limitations:– Use of the two performance metrics– Choice of KNN, PNN and NN classifiers

• Future Research:– Improve existing gene selection procedures with the selection of

optimal number of genes by cross-validation– Applying multivariate Markov blanket and local neighborhood

algorithms– Extend comparisons with more MC-SVMs as they become

available– Updating GEMS system to make it more user-friendly.

Page 33: Presented by: Renikko Alleyne

Contributions of Study

• Conducted the most comprehensive systematic evaluation to date of multicategory diagnosis algorithms applied to the majority of multicategory cancer-related gene expression human datasets.

• Creation of the GEMS system that automates the experimental procedures in the study in order to:– Develop optimal classification models for the domain of cancer

diagnosis with microarray gene expression data.– Estimate their performance in future patients.

Page 34: Presented by: Renikko Alleyne

Conclusions

• MSVMs are the best family of algorithms for these types of data and medical tasks. They outperform non-SVM machine learning techniques

• Among MC-SVM methods OVR, CS and WW are the best w.r.t classification performance

• Gene selection can improve the performance of MC and non-SVM methods

• Ensemble classification does not further improve the classification performance of the best MC-SVM methods