classification of cancerous and non cancerous tissues
DESCRIPTION
Binary classification of cancerous and non-cancerous tissuesTRANSCRIPT
Cancerous Tissue Classification (Using Microarray Gene Expression)
Meenal Goyal Pankhuri Goyal
Background
● Decoding gene expression is an important active research area in molecular biology and bioinformatics.
● Microarray technology used to get gene expression level in different cells.
● Applications:○ Tissue classification (Cancer vs non-cancer) ○ Identify novel targets for drug design.○ Extract patterns and analyse.
Problem
● Binary classification of cancerous and normal tissue.
● Investigate feature selection and classification (supervised and unsupervised) algorithms.
● Improves the diagnosis, prognosis, and treatment planning by cancer detection in early stages.
● Challenges:○ High dimension of the input features.○ Limited number of tissue samples.
Dataset
● GSE3 (renal clear cell carcinoma):○ Modality: numeric○ # features: 36,864 genes○ # samples: 81 cancerous and 90 normal
● High dimensional feature space, not sparse.
● Cell ( i, j ) represents expression level of gene j in tissue i.
GSE3
Feature Selection1. T-Test2. Volcano Plot3. mRmR4. PCA5. Weighted kmeans (fisher
weights)
Supervised Learning (KNN, SVM, Boosting)
Unsupervised Learning (K-means, hierarchical learning)
Model GSE3
Resulting error rate and accuracy
Classification Pipeline
Feature SelectionMethods
T-Test● T scores:
● Null hypothesis: Both classes have equal mean.
● Pvalues : Probability of that observation if null hypothesis is true.
● Features with Pvalues <= 0.01 are selected.
● GSE3 data (916 features).
Volcano Plot
GSE3 dataset Pvalues < 0.01 Fold change = 2 Features extracted : 492
Minimum redundancy-maximum relevance (MRMR)
● F-test value is defined by
● Top 20 features are selected from the f-test score. ● Rest 130 features extracted using linear incremental
search algorithm : MRMR-FDM
● Total features selected for GSE3 data : 150
PCA
● Top 3000 dimensions are selected for GSE3 from two sample t-test for PCA analysis.
Features selected : 170
Weighted-kMeans (using Fisher Weights)
● Top 10,000 features selected from two sample t-test for Fisher analysis.
● Fisher score calculated by F(w) = (u1 - u2)
2
(s12 + s2
2)● Weighted - kmeans applied on feature space using
fisher values as weights. ● Centroid from each cluster is selected as a desired
feature. ● Total features for GSE3 dataset : 200
ClassificationAlgorithms
K- nearest neighbours (k-NN)
● Test / Train data divided using○ Holdout -> test : train = 1 : 1○ Kfold -> k : 5, test : train = 1 : 5
● K parameter varied from k=1 to 10.● Distance metric : Euclidean
KNN misclassification error rate plotted for all feature selection methods
Support Vector Machine (SVM)
● Test/ Train data divided using○ Holdout -> test : train = 0.2○ Kfold -> k=5, test : train = 1: 5
● Kernel functions used ○ Linear○ Polynomial : order = 2○ Radial
● c parameter varied from 0.01 to 0.3.( for linear kernel, holdout method)
Misclassification error rate vs c-parameter for all feature selection methods. (Linear kernel)
Accuracy matrix for SVM
T- Test
Volcano Plot
MRMR
PCA
Weighted- kMeans (using Fisher weights)
Best accuracy observed in linear kernel for all cases.
Adaboost
● Test/ Train divided using Holdout with ratio 1 : 1.● Weak Learner = Decision Tree● Number of weak learners used = 100
K-Means
● Test / Train set divided as○ Holdout -> test : train = 1 : 1○ Kfold -> k =5, test : train = 1 : 5
● K parameter varied from k=1 to 5.
Objective function
Misclassification error vs k for all feature selection methods.
Hierarchical Clustering
● Some cancer types can contain an arbitrary number of subtypes and usually it is unknown how many or what subtypes a specific cancer has.
● Green, black, and red colors in the heat maps indicate a low, medium, and high expression of the corresponding gene in the sample.
● Lower accuracy rate as compared to other algorithms.
T-test Volcano Plots MRMR
PCA Weighted kMeans
References● http://cs229.stanford.edu/proj2012/ChenPopicLiu-
CancerousTissueClassificationUsingMicroarrayGeneExpression.pdf
● http://www.sciencedirect.com/science/article/pii/S1532046411000037
● http://in.mathworks.com/help/bioinfo/ug/exploring-gene-expression-data.html
● http://arxiv.org/pdf/1103.3434.pdf
Thank you