seminar slides

41
Feature Subset Selection from Gene Expression Dataset Shivani M.Tech.(C.S.E.) Guided by Dr. Dhaval Patel

Upload: pannicle

Post on 09-Aug-2015

60 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Seminar Slides

Feature Subset Selection from Gene Expression Dataset

ShivaniM.Tech.(C.S.E.)

Guided byDr. Dhaval Patel

Page 2: Seminar Slides

Developments in Biological domain

• The recent developments in biotechnology have allowed the high throughput data generation from biological samples.

• Biological data can be generated at many different levels

1. Genomics(DNA)2. Functional genomics (RNA )3. Proteomics (proteins)4. Biomedical research

Page 3: Seminar Slides

Data Acquisition –Microarray Technology

• DNA Microarray is a collection of microscopic DNA spots attached to a solid surface .

• Expression levels of large number of genes can be measured simultaneously.

Page 4: Seminar Slides

Data Storage -Gene Expression Matrix

Samples, Conditions, etc.

Genes

Class Labels

Typical Size: 20000 x 300

Page 5: Seminar Slides

Challenges Some problems related to knowledge discovery

techniques when applied on high dimensional gene expression data are:

• Large number of Attributes (often redundant)

• Very few samples.

• Increased Complexity of Classification Algorithms

• Notion of Distance / Similarity for Clustering

Algorithms

Page 6: Seminar Slides

Feature Subset Selection

• Given a set of n features, the role of feature selection is to select a subset of size d (d<n) that leads to the smallest classification error.

• Fundamentally different from dimensionality reduction (e.g PCA or LDA)

Page 7: Seminar Slides

Feature Subset Selection MethodsFeature subset selection

Classification

Filter

**Rank based

**Space search based

Wrapper

Embedded

Clustering

**Message based

Classification of feature subset selection methods based on the papers I read.

Page 8: Seminar Slides

Rank based Filter Method : steps

Page 9: Seminar Slides

iPcc – Iterative Pearson Correlation Coefficient [1]

g1 g2 g3 g4 g5 g6S1 1 1 1 0 0 0S2 0 0 0 1 1 1s3 1 0 1 0 1 0

s1 s2 s3s1 1 -1 0.3s2 -1 1 -

0.3s3 0.3 -

0.31

s1 s2 s3s1 1 -1 1s2 -1 1 -1s3 1 -1 1

Input matrix

Calculating Pearson correlation

T- order correlation feature space

Iteratively employ correlation

Disease class discovery & prediction

Page 10: Seminar Slides

Advantages of correlation features are as follows :

• Even with noisy gene expression profiles, iPcc can successfully highlight the similarities among samples.

• The accuracy of classification and clustering algorithms are improved.

• iPcc gives good result for disease class discovery on real prostate cancer data set with multiple classes.

Page 11: Seminar Slides

Constructing features using mutual information[2]

• Using gene information interaction for dimension reduction.

• Blaie Hanczar et. al. proposed algorithm to capture mutual information and synergic interactions between gene pair and class.– Algorithm

• For each gene Gi and the class C, mutual information I(Gi,C) is calculated.

• Gene Gi* with highest mutual information is paired with each gene and the mutual information between the gene pair and class I(Gi*Gj, C) is calculated

Page 12: Seminar Slides

• The pair of genes (GiGj) maximizing the mutual

information is selected and the genes are taken off from the set.

• These steps are iterated m times to obtain m gene pair.

Page 13: Seminar Slides

Feature construction from gene pairs From a pair of genes (Gi,Gj) a new feature Ai,j is constructed

by the following steps :• The set of M instances in E {e1. . . eM} are projected onto the

two dimensional space defined by the expression of Gi and Gj.

• The value Ai,j(x) of the feature Ai,j for every point x in this space is defined as

A(x) = - ~ – (1- ) = -1 + 2

where na(x) is number of samples belonging to class Ca out of k nearest neighbor of x.

Page 14: Seminar Slides

• The value of newly constructed feature comes in the range of -1 and +1.

• When a sample is close to samples belonging to class Ca(respectively) then value is near +1 and if the sample is near class Cb then the values is near -1. Finally we have a set of m features.

Page 15: Seminar Slides

Space search based filter method

Subset Generation

Subset Evaluation (relevancy and redundancy )

Termination

Condition

Complete set of features

Candidate set

Current best subset

No

Optimal feature subsetDrawback - High computational cost of the subset search makes them inefficient for high dimensional data

Page 16: Seminar Slides

Feature selection via analysis of relevance and redundancy[3]

Selecting relevant features • Calculate symmetrical uncertainity between feature Fi and class

C. SU(X,Y) =

Here, H(X) is entropy of variable X defined as H(X) =

IG(X|Y) is called information gain , defined as IG(X|Y) = H(X) - H(X|Y)

• The relevant features with correlation more than the threshold are selected and stored in sorted order in S’list.

Relevance Analysis

Redundancy Analysis

RelevantSubset

OriginalSet

SelectedSubset

(P(

))

Page 17: Seminar Slides

Removing redundant features• The feature with highest correlation is chosen as

predominant feature and is used to remove other features for which it forms markov blanket .

• A feature Fj forms markov blanket for Fi iff SUj,c >= SUi,c

and SUi,j >= SUi,c

• The algorithm stops when no more predominant feature can be selected.

F1 F2 F3 F4 F5 F6

Page 18: Seminar Slides

Filter Method: Limitations

• Fails to capture the intrinsic properties of genes- Interdependence between genes - redundant

features are sometimes useful• Ignores the performance of the classification algorithm

- Some features do not work well with a particular algorithm

• Some genes only differentiate between one or two classes

- For skewed sample distributions with 3 or more classes overall ranking give poor results

Page 19: Seminar Slides

Unsupervised feature selection[4]

It requires two key steps1. Attribute similarity metric –

Maximal information coefficient• One dimension of a 2-dimensional set of elements is

taken as x and value of the other dimension as y.• The x -values of first dimension are divided into x bins

and y - values of second dimension are divided into y bins , named as x-by-y grid G.

Page 20: Seminar Slides

• One x-by-y grid, G is taken and the distribution of set D is divided and named as D|G.

• Characteristic matrix is constructed as

Where where is the mutual information of

• MIC can be calculated as

Page 21: Seminar Slides

2. Clustering Algorithm – Affinity propagation clustering

It takes similarity metric as input and iterates following steps

• Calculating responsibilities given the availabilities - Initially availabilities are set to zero, The value of responsibilities are calculated as

r(i,k) shows how good point k serves as centroid for point i.

Page 22: Seminar Slides

• Updating all the availabilities given the responsibilities - Availabilities are calculated as

a(i,k) shows how suitable it is for point I to choose point k as its centroid.

• Combining the responsibilities and availabilities to generate cluster - For every iteration, the point j that maximizes is regarded as centroid for point i.

• Termination conditionThe algorithm terminates when changes in the message falls below previously set threshold. Final result is list of all the centroids which is an optimal subset of features.

Page 23: Seminar Slides

Extracting biological relevance of gene list using DAVID[5]

• Database for Annotation, Visualization and Integrated Discovery.

• Released in 2003(Dennis et al., Genome Biol.; Hosack et al., Genome Biol.)

• Extracts biological features/meaning associated with large gene list.

Page 24: Seminar Slides

What is Gene Ontology ?

• The Gene Ontology project provides controlled vocabularies of defined terms representing gene product properties of any organism.

• GO Term - Each GO term within the ontology has a term name, which may be a word or string of words; a unique alphanumeric identifier; a definition with cited sources; and a namespace indicating the domain to which it belongs.

Page 25: Seminar Slides

The 3 Gene Ontologies

Biological Process – molecular

events

Cellular components – parts of cell or its

extracellular environment

Molecular Function –

gene product activity at

molecular level

Page 26: Seminar Slides

P-value & enrichment calculation

Gene list : subset of features to be tested.Background list : chosen from options provided

by DAVID

N = all genes (universe)M = all genes belonging to a pathwayn = your gene listm = genes of your gene list that belongs to the pathway

Page 27: Seminar Slides
Page 28: Seminar Slides

Application of feature subset selection – Breast Cancer Subtyping

Page 29: Seminar Slides

Breast Cancer Subtyping

• Gene expression profiling has revealed that breast cancer is highly heterogeneous.

• It is critical to classify it into different subtypes both for research purposes as well as for clinical practice. Subtyping schemes are :

Page 30: Seminar Slides

Tumor SubtypesMolecular

Basal

Luminal A

Luminal B

HER 2 enriched

Normal

Clinical

Grade 1

Grade 2

Grade 3

Metastasis

Metastatic

Non metastatic

Page 31: Seminar Slides

Challenges in analyzing micro-array datasets

• Microarray data is noisy and heterogeneous, different clustering techniques tends to find different clusters for the same type of dataset.

• Gene signatures also known as “biomarkers” do not correlate the underlying biology of the tumour with the clinical outcome; hence they do not fulfil the purpose of identifying right drug and right treatment strategy for the patient.

Page 32: Seminar Slides

Breast Cancer Subtyping using Multiple Datasets [6]

Consensus clustering • Aims at finding similar clusters across multiple datasets. It

identifies partitions in the datasets which should comply by following three properties :

• Dataset partitions should be supported by a same set of biomarkers Partition the samples of one dataset D1 into classes c1 and c2. Using the biomarkers which are identified for the partitions of D1 and Nearest Neighbour classification model, partitions are generated in other datasets.

Page 33: Seminar Slides

• Classification accuracy of samples should be high across all datasets

10 fold cross validation is performed on all data sets Di. The overall accuracy accD is

=

• Proportion of samples falling in a clusters should be similar in all datasets

Mean of samples belonging to class Ci in all datasets is calculated as

=

Page 34: Seminar Slides

• The following equation is used for calculating balance :B = +

• Finally, combining all the three required characteristics, the objective function is:

obj = * *

• The objective function is for optimizing the count of biomarkers for D1 (nmarkers),high average classification accuracy across datasets (accD) and a similar proportion of samples falling in a cluster across multiple datasets.

Page 35: Seminar Slides

Associating molecular properties and clinical outcome[7]

• Gene signature correlate the molecular properties of tumors , aggressiveness markers such as grade, survival outcomes and response to therapy (predictive).

• Gene signatures like PAM 50, are used clinically for classifying breast cancer samples.

• But the classification is more related to molecular subtype rather than to prognostic or predictive outcome.

• The author used a large cohort of breast cancer tumours from The Cancer Genome Atlas to evaluate 8 biological pathways and 3 prognostic signatures across variety of classification scenarios.

Page 36: Seminar Slides

Signature Type Signature Number of genes

Benchmark PAM50 50

Prognostic

MammaPrint 70

Wang-76 76

OncotypeDX 21

Blood 14

Biological Pathways

Cell cycle 114

DDR 182

Netch 48

PI3K 346

RAS 227

RB 159

TGF-b 80

Wnt 139

Page 37: Seminar Slides

Results• Biological pathways which are not suppose to

determine prognosis results, gave good results while distinguishing survival outcome.

• Similarly the prognostic signatures designed for estimating survival outcomes, showed fine results for subtyping the samples on the basis of molecular properties.

Page 38: Seminar Slides

• The biological pathways which are best in distinguishing survival outcome and prognostic signatures best in distinguishing molecular subtype are combined to give 12 gene signatures.

• These 12 genes gave better classification performance than the signatures used independently, and were next to PAM50.

Conclusion• It can be concluded that a constructive prediction model should

consider anatomical, histopathological, molecular and prognosis parameters for better drug discovery as well as clinical strategies.

Page 39: Seminar Slides

Conclusion

• Filter methods are preferred for feature subset selection because they are computationally less expensive than wrapper and embedded methods and are scalable to high dimensional datasets.

• The selected set of gene signatures cannot predict breast cancer subtype accurately across heterogeneous datasets.

• They are also unable to predict the clinical outcome of a sample by linking it with its molecular properties.

• Future goal would be to generate optimal set of features which can accurately classify breast cancer sample into its subtype and predict its clinical outcome

Page 40: Seminar Slides

References

[1] X. Ren, Y. Wang, X. S. Zhang and Q. Jin, “iPcc: a novel feature extraction method for accurate disease class discovery and prediction”, Nucleic Acids Research, 2013.[2] B. Hanczar, J. D. Zucker, C. henegar and L. Saitta, “Feature construction from synergic pairs to improve microarray-based classification”, Bioinformatics, vol. 23, issue 21, pp. 2866-2872, 2007.[3] Lei, Yu and H. Liu, “Efficient Feature Selection via Analysis of Relevance and Redundancy”, ACM The Journal of Machine Learning Research, vol. 5 pp. 1205-1224, 2004.[4] X. Zhao, W. Deng and Y. Shi. “Feature Selection with Attributes Clustering by Maximal Information Coefficient”, Procedia Computer Science, vol 17, pp. 70 – 79, 2013.

Page 41: Seminar Slides

[5] G. Dennis Jr., B.T. Sherman, D.A. Hosack, J. Yang, M.W. Baseler, H. C. lane and R. A. Lempicki. “David: Database for Annotation, Visualization, and Integrated Discovery. Genome Biology”, vol. 4, issue 5, 2003[6] Alexandre Mendis, “Identification of Breast Cancer Subtypes Using Multiple Gene Expression Microarray Datasets”, Advances in Artificial Intelligence, vol. 7106, pp. 92-101, 2011[7] A. T. Fard, S. Srihari and M. A. Rogan, “Breast cancer classification: linking molecular mechanism to disease prognosis”, Briefings in bioinformatics, 2014.