seminar slides
TRANSCRIPT
Feature Subset Selection from Gene Expression Dataset
ShivaniM.Tech.(C.S.E.)
Guided byDr. Dhaval Patel
Developments in Biological domain
• The recent developments in biotechnology have allowed the high throughput data generation from biological samples.
• Biological data can be generated at many different levels
1. Genomics(DNA)2. Functional genomics (RNA )3. Proteomics (proteins)4. Biomedical research
Data Acquisition –Microarray Technology
• DNA Microarray is a collection of microscopic DNA spots attached to a solid surface .
• Expression levels of large number of genes can be measured simultaneously.
Data Storage -Gene Expression Matrix
Samples, Conditions, etc.
Genes
Class Labels
Typical Size: 20000 x 300
Challenges Some problems related to knowledge discovery
techniques when applied on high dimensional gene expression data are:
• Large number of Attributes (often redundant)
• Very few samples.
• Increased Complexity of Classification Algorithms
• Notion of Distance / Similarity for Clustering
Algorithms
Feature Subset Selection
• Given a set of n features, the role of feature selection is to select a subset of size d (d<n) that leads to the smallest classification error.
• Fundamentally different from dimensionality reduction (e.g PCA or LDA)
Feature Subset Selection MethodsFeature subset selection
Classification
Filter
**Rank based
**Space search based
Wrapper
Embedded
Clustering
**Message based
Classification of feature subset selection methods based on the papers I read.
Rank based Filter Method : steps
iPcc – Iterative Pearson Correlation Coefficient [1]
g1 g2 g3 g4 g5 g6S1 1 1 1 0 0 0S2 0 0 0 1 1 1s3 1 0 1 0 1 0
s1 s2 s3s1 1 -1 0.3s2 -1 1 -
0.3s3 0.3 -
0.31
s1 s2 s3s1 1 -1 1s2 -1 1 -1s3 1 -1 1
Input matrix
Calculating Pearson correlation
T- order correlation feature space
Iteratively employ correlation
Disease class discovery & prediction
Advantages of correlation features are as follows :
• Even with noisy gene expression profiles, iPcc can successfully highlight the similarities among samples.
• The accuracy of classification and clustering algorithms are improved.
• iPcc gives good result for disease class discovery on real prostate cancer data set with multiple classes.
Constructing features using mutual information[2]
• Using gene information interaction for dimension reduction.
• Blaie Hanczar et. al. proposed algorithm to capture mutual information and synergic interactions between gene pair and class.– Algorithm
• For each gene Gi and the class C, mutual information I(Gi,C) is calculated.
• Gene Gi* with highest mutual information is paired with each gene and the mutual information between the gene pair and class I(Gi*Gj, C) is calculated
• The pair of genes (GiGj) maximizing the mutual
information is selected and the genes are taken off from the set.
• These steps are iterated m times to obtain m gene pair.
Feature construction from gene pairs From a pair of genes (Gi,Gj) a new feature Ai,j is constructed
by the following steps :• The set of M instances in E {e1. . . eM} are projected onto the
two dimensional space defined by the expression of Gi and Gj.
• The value Ai,j(x) of the feature Ai,j for every point x in this space is defined as
A(x) = - ~ – (1- ) = -1 + 2
where na(x) is number of samples belonging to class Ca out of k nearest neighbor of x.
• The value of newly constructed feature comes in the range of -1 and +1.
• When a sample is close to samples belonging to class Ca(respectively) then value is near +1 and if the sample is near class Cb then the values is near -1. Finally we have a set of m features.
Space search based filter method
Subset Generation
Subset Evaluation (relevancy and redundancy )
Termination
Condition
Complete set of features
Candidate set
Current best subset
No
Optimal feature subsetDrawback - High computational cost of the subset search makes them inefficient for high dimensional data
Feature selection via analysis of relevance and redundancy[3]
Selecting relevant features • Calculate symmetrical uncertainity between feature Fi and class
C. SU(X,Y) =
Here, H(X) is entropy of variable X defined as H(X) =
IG(X|Y) is called information gain , defined as IG(X|Y) = H(X) - H(X|Y)
• The relevant features with correlation more than the threshold are selected and stored in sorted order in S’list.
Relevance Analysis
Redundancy Analysis
RelevantSubset
OriginalSet
SelectedSubset
(P(
))
Removing redundant features• The feature with highest correlation is chosen as
predominant feature and is used to remove other features for which it forms markov blanket .
• A feature Fj forms markov blanket for Fi iff SUj,c >= SUi,c
and SUi,j >= SUi,c
• The algorithm stops when no more predominant feature can be selected.
F1 F2 F3 F4 F5 F6
Filter Method: Limitations
• Fails to capture the intrinsic properties of genes- Interdependence between genes - redundant
features are sometimes useful• Ignores the performance of the classification algorithm
- Some features do not work well with a particular algorithm
• Some genes only differentiate between one or two classes
- For skewed sample distributions with 3 or more classes overall ranking give poor results
Unsupervised feature selection[4]
It requires two key steps1. Attribute similarity metric –
Maximal information coefficient• One dimension of a 2-dimensional set of elements is
taken as x and value of the other dimension as y.• The x -values of first dimension are divided into x bins
and y - values of second dimension are divided into y bins , named as x-by-y grid G.
• One x-by-y grid, G is taken and the distribution of set D is divided and named as D|G.
• Characteristic matrix is constructed as
Where where is the mutual information of
• MIC can be calculated as
2. Clustering Algorithm – Affinity propagation clustering
It takes similarity metric as input and iterates following steps
• Calculating responsibilities given the availabilities - Initially availabilities are set to zero, The value of responsibilities are calculated as
r(i,k) shows how good point k serves as centroid for point i.
• Updating all the availabilities given the responsibilities - Availabilities are calculated as
a(i,k) shows how suitable it is for point I to choose point k as its centroid.
• Combining the responsibilities and availabilities to generate cluster - For every iteration, the point j that maximizes is regarded as centroid for point i.
• Termination conditionThe algorithm terminates when changes in the message falls below previously set threshold. Final result is list of all the centroids which is an optimal subset of features.
Extracting biological relevance of gene list using DAVID[5]
• Database for Annotation, Visualization and Integrated Discovery.
• Released in 2003(Dennis et al., Genome Biol.; Hosack et al., Genome Biol.)
• Extracts biological features/meaning associated with large gene list.
What is Gene Ontology ?
• The Gene Ontology project provides controlled vocabularies of defined terms representing gene product properties of any organism.
• GO Term - Each GO term within the ontology has a term name, which may be a word or string of words; a unique alphanumeric identifier; a definition with cited sources; and a namespace indicating the domain to which it belongs.
The 3 Gene Ontologies
Biological Process – molecular
events
Cellular components – parts of cell or its
extracellular environment
Molecular Function –
gene product activity at
molecular level
P-value & enrichment calculation
Gene list : subset of features to be tested.Background list : chosen from options provided
by DAVID
N = all genes (universe)M = all genes belonging to a pathwayn = your gene listm = genes of your gene list that belongs to the pathway
Application of feature subset selection – Breast Cancer Subtyping
Breast Cancer Subtyping
• Gene expression profiling has revealed that breast cancer is highly heterogeneous.
• It is critical to classify it into different subtypes both for research purposes as well as for clinical practice. Subtyping schemes are :
Tumor SubtypesMolecular
Basal
Luminal A
Luminal B
HER 2 enriched
Normal
Clinical
Grade 1
Grade 2
Grade 3
Metastasis
Metastatic
Non metastatic
Challenges in analyzing micro-array datasets
• Microarray data is noisy and heterogeneous, different clustering techniques tends to find different clusters for the same type of dataset.
• Gene signatures also known as “biomarkers” do not correlate the underlying biology of the tumour with the clinical outcome; hence they do not fulfil the purpose of identifying right drug and right treatment strategy for the patient.
Breast Cancer Subtyping using Multiple Datasets [6]
Consensus clustering • Aims at finding similar clusters across multiple datasets. It
identifies partitions in the datasets which should comply by following three properties :
• Dataset partitions should be supported by a same set of biomarkers Partition the samples of one dataset D1 into classes c1 and c2. Using the biomarkers which are identified for the partitions of D1 and Nearest Neighbour classification model, partitions are generated in other datasets.
• Classification accuracy of samples should be high across all datasets
10 fold cross validation is performed on all data sets Di. The overall accuracy accD is
=
• Proportion of samples falling in a clusters should be similar in all datasets
Mean of samples belonging to class Ci in all datasets is calculated as
=
• The following equation is used for calculating balance :B = +
• Finally, combining all the three required characteristics, the objective function is:
obj = * *
• The objective function is for optimizing the count of biomarkers for D1 (nmarkers),high average classification accuracy across datasets (accD) and a similar proportion of samples falling in a cluster across multiple datasets.
Associating molecular properties and clinical outcome[7]
• Gene signature correlate the molecular properties of tumors , aggressiveness markers such as grade, survival outcomes and response to therapy (predictive).
• Gene signatures like PAM 50, are used clinically for classifying breast cancer samples.
• But the classification is more related to molecular subtype rather than to prognostic or predictive outcome.
• The author used a large cohort of breast cancer tumours from The Cancer Genome Atlas to evaluate 8 biological pathways and 3 prognostic signatures across variety of classification scenarios.
Signature Type Signature Number of genes
Benchmark PAM50 50
Prognostic
MammaPrint 70
Wang-76 76
OncotypeDX 21
Blood 14
Biological Pathways
Cell cycle 114
DDR 182
Netch 48
PI3K 346
RAS 227
RB 159
TGF-b 80
Wnt 139
Results• Biological pathways which are not suppose to
determine prognosis results, gave good results while distinguishing survival outcome.
• Similarly the prognostic signatures designed for estimating survival outcomes, showed fine results for subtyping the samples on the basis of molecular properties.
• The biological pathways which are best in distinguishing survival outcome and prognostic signatures best in distinguishing molecular subtype are combined to give 12 gene signatures.
• These 12 genes gave better classification performance than the signatures used independently, and were next to PAM50.
Conclusion• It can be concluded that a constructive prediction model should
consider anatomical, histopathological, molecular and prognosis parameters for better drug discovery as well as clinical strategies.
Conclusion
• Filter methods are preferred for feature subset selection because they are computationally less expensive than wrapper and embedded methods and are scalable to high dimensional datasets.
• The selected set of gene signatures cannot predict breast cancer subtype accurately across heterogeneous datasets.
• They are also unable to predict the clinical outcome of a sample by linking it with its molecular properties.
• Future goal would be to generate optimal set of features which can accurately classify breast cancer sample into its subtype and predict its clinical outcome
References
[1] X. Ren, Y. Wang, X. S. Zhang and Q. Jin, “iPcc: a novel feature extraction method for accurate disease class discovery and prediction”, Nucleic Acids Research, 2013.[2] B. Hanczar, J. D. Zucker, C. henegar and L. Saitta, “Feature construction from synergic pairs to improve microarray-based classification”, Bioinformatics, vol. 23, issue 21, pp. 2866-2872, 2007.[3] Lei, Yu and H. Liu, “Efficient Feature Selection via Analysis of Relevance and Redundancy”, ACM The Journal of Machine Learning Research, vol. 5 pp. 1205-1224, 2004.[4] X. Zhao, W. Deng and Y. Shi. “Feature Selection with Attributes Clustering by Maximal Information Coefficient”, Procedia Computer Science, vol 17, pp. 70 – 79, 2013.
[5] G. Dennis Jr., B.T. Sherman, D.A. Hosack, J. Yang, M.W. Baseler, H. C. lane and R. A. Lempicki. “David: Database for Annotation, Visualization, and Integrated Discovery. Genome Biology”, vol. 4, issue 5, 2003[6] Alexandre Mendis, “Identification of Breast Cancer Subtypes Using Multiple Gene Expression Microarray Datasets”, Advances in Artificial Intelligence, vol. 7106, pp. 92-101, 2011[7] A. T. Fard, S. Srihari and M. A. Rogan, “Breast cancer classification: linking molecular mechanism to disease prognosis”, Briefings in bioinformatics, 2014.