seminar slides

Feature Subset Selection from Gene Expression Dataset

ShivaniM.Tech.(C.S.E.)

Guided byDr. Dhaval Patel

Developments in Biological domain

• The recent developments in biotechnology have allowed the high throughput data generation from biological samples.

• Biological data can be generated at many different levels

1. Genomics(DNA)2. Functional genomics (RNA )3. Proteomics (proteins)4. Biomedical research

Data Acquisition –Microarray Technology

• DNA Microarray is a collection of microscopic DNA spots attached to a solid surface .

• Expression levels of large number of genes can be measured simultaneously.

Data Storage -Gene Expression Matrix

Samples, Conditions, etc.

Genes

Class Labels

Typical Size: 20000 x 300

Challenges Some problems related to knowledge discovery

techniques when applied on high dimensional gene expression data are:

• Large number of Attributes (often redundant)

• Very few samples.

• Increased Complexity of Classification Algorithms

• Notion of Distance / Similarity for Clustering

Algorithms

Feature Subset Selection

• Given a set of n features, the role of feature selection is to select a subset of size d (d<n) that leads to the smallest classification error.

• Fundamentally different from dimensionality reduction (e.g PCA or LDA)

Feature Subset Selection MethodsFeature subset selection

Classification

Filter

**Rank based

**Space search based

Wrapper

Embedded

Clustering

**Message based

Classification of feature subset selection methods based on the papers I read.

Rank based Filter Method : steps

iPcc – Iterative Pearson Correlation Coefficient [1]

g1 g2 g3 g4 g5 g6S1 1 1 1 0 0 0S2 0 0 0 1 1 1s3 1 0 1 0 1 0

s1 s2 s3s1 1 -1 0.3s2 -1 1 -

0.3s3 0.3 -

0.31

s1 s2 s3s1 1 -1 1s2 -1 1 -1s3 1 -1 1

Input matrix

Calculating Pearson correlation

T- order correlation feature space

Iteratively employ correlation

Disease class discovery & prediction

Advantages of correlation features are as follows :

• Even with noisy gene expression profiles, iPcc can successfully highlight the similarities among samples.

• The accuracy of classification and clustering algorithms are improved.

• iPcc gives good result for disease class discovery on real prostate cancer data set with multiple classes.

Constructing features using mutual information[2]

• Using gene information interaction for dimension reduction.

• Blaie Hanczar et. al. proposed algorithm to capture mutual information and synergic interactions between gene pair and class.– Algorithm

• For each gene Gi and the class C, mutual information I(Gi,C) is calculated.

• Gene Gi* with highest mutual information is paired with each gene and the mutual information between the gene pair and class I(Gi*Gj, C) is calculated

• The pair of genes (GiGj) maximizing the mutual

information is selected and the genes are taken off from the set.

• These steps are iterated m times to obtain m gene pair.

Feature construction from gene pairs From a pair of genes (Gi,Gj) a new feature Ai,j is constructed

by the following steps :• The set of M instances in E {e1. . . eM} are projected onto the

two dimensional space defined by the expression of Gi and Gj.

• The value Ai,j(x) of the feature Ai,j for every point x in this space is defined as

A(x) = - ~ – (1- ) = -1 + 2

where na(x) is number of samples belonging to class Ca out of k nearest neighbor of x.

• The value of newly constructed feature comes in the range of -1 and +1.

• When a sample is close to samples belonging to class Ca(respectively) then value is near +1 and if the sample is near class Cb then the values is near -1. Finally we have a set of m features.

Space search based filter method

Subset Generation

Subset Evaluation (relevancy and redundancy )

Termination

Condition

Complete set of features

Candidate set

Current best subset

No

Optimal feature subsetDrawback - High computational cost of the subset search makes them inefficient for high dimensional data

Feature selection via analysis of relevance and redundancy[3]

Selecting relevant features • Calculate symmetrical uncertainity between feature Fi and class

C. SU(X,Y) =

Here, H(X) is entropy of variable X defined as H(X) =

IG(X|Y) is called information gain , defined as IG(X|Y) = H(X) - H(X|Y)

• The relevant features with correlation more than the threshold are selected and stored in sorted order in S’list.

Relevance Analysis

Redundancy Analysis

RelevantSubset

OriginalSet

SelectedSubset

(P(

))

Removing redundant features• The feature with highest correlation is chosen as

predominant feature and is used to remove other features for which it forms markov blanket .

• A feature Fj forms markov blanket for Fi iff SUj,c >= SUi,c

and SUi,j >= SUi,c

• The algorithm stops when no more predominant feature can be selected.

F1 F2 F3 F4 F5 F6

Filter Method: Limitations

• Fails to capture the intrinsic properties of genes- Interdependence between genes - redundant

features are sometimes useful• Ignores the performance of the classification algorithm

- Some features do not work well with a particular algorithm

• Some genes only differentiate between one or two classes

- For skewed sample distributions with 3 or more classes overall ranking give poor results

Unsupervised feature selection[4]

It requires two key steps1. Attribute similarity metric –

Maximal information coefficient• One dimension of a 2-dimensional set of elements is

taken as x and value of the other dimension as y.• The x -values of first dimension are divided into x bins

and y - values of second dimension are divided into y bins , named as x-by-y grid G.

• One x-by-y grid, G is taken and the distribution of set D is divided and named as D|G.

• Characteristic matrix is constructed as

Where where is the mutual information of

• MIC can be calculated as

2. Clustering Algorithm – Affinity propagation clustering

It takes similarity metric as input and iterates following steps

• Calculating responsibilities given the availabilities - Initially availabilities are set to zero, The value of responsibilities are calculated as

r(i,k) shows how good point k serves as centroid for point i.

• Updating all the availabilities given the responsibilities - Availabilities are calculated as

a(i,k) shows how suitable it is for point I to choose point k as its centroid.

• Combining the responsibilities and availabilities to generate cluster - For every iteration, the point j that maximizes is regarded as centroid for point i.

• Termination conditionThe algorithm terminates when changes in the message falls below previously set threshold. Final result is list of all the centroids which is an optimal subset of features.

Extracting biological relevance of gene list using DAVID[5]

• Database for Annotation, Visualization and Integrated Discovery.

• Released in 2003(Dennis et al., Genome Biol.; Hosack et al., Genome Biol.)

• Extracts biological features/meaning associated with large gene list.

What is Gene Ontology ?

• The Gene Ontology project provides controlled vocabularies of defined terms representing gene product properties of any organism.

• GO Term - Each GO term within the ontology has a term name, which may be a word or string of words; a unique alphanumeric identifier; a definition with cited sources; and a namespace indicating the domain to which it belongs.

The 3 Gene Ontologies

Biological Process – molecular

events

Cellular components – parts of cell or its

extracellular environment

Molecular Function –

gene product activity at

molecular level

P-value & enrichment calculation

Gene list : subset of features to be tested.Background list : chosen from options provided

by DAVID

N = all genes (universe)M = all genes belonging to a pathwayn = your gene listm = genes of your gene list that belongs to the pathway

Application of feature subset selection – Breast Cancer Subtyping

Breast Cancer Subtyping

• Gene expression profiling has revealed that breast cancer is highly heterogeneous.

• It is critical to classify it into different subtypes both for research purposes as well as for clinical practice. Subtyping schemes are :

Tumor SubtypesMolecular

Basal

Luminal A

Luminal B

HER 2 enriched

Normal

Clinical

Grade 1

Grade 2

Grade 3

Metastasis

Metastatic

Non metastatic

Challenges in analyzing micro-array datasets

• Microarray data is noisy and heterogeneous, different clustering techniques tends to find different clusters for the same type of dataset.

• Gene signatures also known as “biomarkers” do not correlate the underlying biology of the tumour with the clinical outcome; hence they do not fulfil the purpose of identifying right drug and right treatment strategy for the patient.

Breast Cancer Subtyping using Multiple Datasets [6]

Consensus clustering • Aims at finding similar clusters across multiple datasets. It

identifies partitions in the datasets which should comply by following three properties :

• Dataset partitions should be supported by a same set of biomarkers Partition the samples of one dataset D1 into classes c1 and c2. Using the biomarkers which are identified for the partitions of D1 and Nearest Neighbour classification model, partitions are generated in other datasets.

• Classification accuracy of samples should be high across all datasets

10 fold cross validation is performed on all data sets Di. The overall accuracy accD is

=

• Proportion of samples falling in a clusters should be similar in all datasets

Mean of samples belonging to class Ci in all datasets is calculated as

=

• The following equation is used for calculating balance :B = +

• Finally, combining all the three required characteristics, the objective function is:

obj = * *

• The objective function is for optimizing the count of biomarkers for D1 (nmarkers),high average classification accuracy across datasets (accD) and a similar proportion of samples falling in a cluster across multiple datasets.

Associating molecular properties and clinical outcome[7]

• Gene signature correlate the molecular properties of tumors , aggressiveness markers such as grade, survival outcomes and response to therapy (predictive).

• Gene signatures like PAM 50, are used clinically for classifying breast cancer samples.

• But the classification is more related to molecular subtype rather than to prognostic or predictive outcome.

• The author used a large cohort of breast cancer tumours from The Cancer Genome Atlas to evaluate 8 biological pathways and 3 prognostic signatures across variety of classification scenarios.

Signature Type Signature Number of genes

Benchmark PAM50 50

Prognostic

MammaPrint 70

Wang-76 76

OncotypeDX 21

Blood 14

Biological Pathways

Cell cycle 114

DDR 182

Netch 48

PI3K 346

RAS 227

RB 159

TGF-b 80

Wnt 139

Results• Biological pathways which are not suppose to

determine prognosis results, gave good results while distinguishing survival outcome.

• Similarly the prognostic signatures designed for estimating survival outcomes, showed fine results for subtyping the samples on the basis of molecular properties.

• The biological pathways which are best in distinguishing survival outcome and prognostic signatures best in distinguishing molecular subtype are combined to give 12 gene signatures.

• These 12 genes gave better classification performance than the signatures used independently, and were next to PAM50.

Conclusion• It can be concluded that a constructive prediction model should

consider anatomical, histopathological, molecular and prognosis parameters for better drug discovery as well as clinical strategies.

Conclusion

• Filter methods are preferred for feature subset selection because they are computationally less expensive than wrapper and embedded methods and are scalable to high dimensional datasets.

• The selected set of gene signatures cannot predict breast cancer subtype accurately across heterogeneous datasets.

• They are also unable to predict the clinical outcome of a sample by linking it with its molecular properties.

• Future goal would be to generate optimal set of features which can accurately classify breast cancer sample into its subtype and predict its clinical outcome

References

[1] X. Ren, Y. Wang, X. S. Zhang and Q. Jin, “iPcc: a novel feature extraction method for accurate disease class discovery and prediction”, Nucleic Acids Research, 2013.[2] B. Hanczar, J. D. Zucker, C. henegar and L. Saitta, “Feature construction from synergic pairs to improve microarray-based classification”, Bioinformatics, vol. 23, issue 21, pp. 2866-2872, 2007.[3] Lei, Yu and H. Liu, “Efficient Feature Selection via Analysis of Relevance and Redundancy”, ACM The Journal of Machine Learning Research, vol. 5 pp. 1205-1224, 2004.[4] X. Zhao, W. Deng and Y. Shi. “Feature Selection with Attributes Clustering by Maximal Information Coefficient”, Procedia Computer Science, vol 17, pp. 70 – 79, 2013.

[5] G. Dennis Jr., B.T. Sherman, D.A. Hosack, J. Yang, M.W. Baseler, H. C. lane and R. A. Lempicki. “David: Database for Annotation, Visualization, and Integrated Discovery. Genome Biology”, vol. 4, issue 5, 2003[6] Alexandre Mendis, “Identification of Breast Cancer Subtypes Using Multiple Gene Expression Microarray Datasets”, Advances in Artificial Intelligence, vol. 7106, pp. 92-101, 2011[7] A. T. Fard, S. Srihari and M. A. Rogan, “Breast cancer classification: linking molecular mechanism to disease prognosis”, Briefings in bioinformatics, 2014.

seminar slides

Data & Analytics

gene gi

feature construction

constructed feature

role of feature selection

new feature ai

gene pairsfrom

gene information interaction

biological data