advanced gene selection algorithms designed for microarray datasets limitation of current feature...

Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: Ignores gene/gene interaction: single gene based discriminative scores, correlation (redundancy) based algorithms Virtual Gene Algorithm Using correlations between genes Gene Ontology Based Gene Selection Integrating domain knowledge Boost Selection Feature selection based on bootstraps Virtual Gene Algorithm Gene to gene correlations are generally ignored in feature selection algorithms. In this work, we examine using instead of ignoring such correlations for the purpose of gene selection. Motivating examples are shown in the next two pages, from both synthetic and real datasets. Virtual Gene: Motivating Example Virtual Gene Algorithm The expression levels of any single gene does not capture the class label distinction However, the combination of expression levels of two genes captures class label distinction pretty well Virtual Gene: a linear combination of genes Virtual gene definitions Systematically examining all possible virtual genes There are possible virtual genes that can be constructed from a set of n genes. Pairwise virtual genes are those virtual genes that limit the size of constituent gene set to be 2. This reduces computation enormously. Clustering algorithms are further used to reduce the number of gene pairs to be considered. Clustering algorithm identifies genes that potentially interact or share similar functions. Pairwise virtual gene algorithm Our experiments show that limiting pairwise virtual gene computation to genes in the same cluster greatly reduces computational complexity while preserving classification accuracy. Pairwise virtual gene algorithm Pairwise Virtual Gene algorithm runs in three stages 1.Cluster genes into gene clusters using k- means algorithm 2.Compute pairwise virtual genes within clusters, their virtual gene expressions and their discriminative power 3.Select top ranked virtual gene, degrade the discriminative power using , (parameters supplied by user) Pairwise virtual gene algorithm Parameters to pairwise virtual gene algorithm: : ranges [0,1], the likelihood of virtual genes with same constituent genes being selected : ranges [0,1], the likelihood of virtual genes whose constituent genes come from same cluster being selected k : number of virtual genes to be selected Experiments: Virtual Gene Extensive experiments are performed on three publicly available datasets: colon cancer, leukemia and multi-class cancer. We will briefly discuss the performance on these dataset, and report more detailed result on colon cancer dataset. Performance are measured by cross validation procedure, three classifiers (SVM, KNN, DLD) are used. Performance of four FSS algorithms are compared. Experiments: Virtual Gene Summary of classification performance of virtual gene algorithm. Experiments: Virtual Gene Summary of classification performance of virtual gene algorithm. Experiments: Virtual Gene More detailed result on colon cancer dataset Study how the choice of number of clusters in the pairwise virtual gene algorithm affects classification performance. Study how the choice of initial cluster centers in the pairwise virtual gene algorithm affects gene selection performance. Experiments: Virtual Gene, number of clusters Experiments: Virtual Gene, initial cluster centers The limit of pairwise virtual gene algorithm Biological process obviously could involve more than 2 genes at a time. Pairwise virtual gene algorithm might be too restrictive in this sense. Our goal is to investigate the relative expression values of biologically related genes. Using domain knowledge enables us to do just that, to some degree. Different levels of feature selection Single gene based discriminative scores ignore feature correlations completely. Exhaustive search of the power set is too slow. GO based virtual gene algorithm utilizes domain knowledge information and decide which set to explorer intelligently. More on GO and GO annotation Gene Ontology (GO) consists of GO terms, which form a shared biological vocabulary. GO terms are connected based on is-a or is-part-of relationship. Combined, GO terms and relationships between them form a DAG (directed acyclic graph). Genes are annotated by GO terms by GO collaborators. Gene annotations are assumed to be transitive in this thesis: if a gene is annotated by a GO term, it is also considered to be annotated by all the parent GO terms of that GO term. Domain knowledge in form of gene ontology annotations Some definitions Explaining of Definitions The GO distance between genes measures how close two genes are from the information embedded in GO annotations. Gene connectivity graph shows the overall gene affinity. We want to examine correlation in gene expressions between tightly related genes. Our algorithm best demonstrated using the graph in the next slide. GO based virtual gene algorithm First, GO distances between genes are computed. Genes that are close to each other are identified by finding cliques in gene connectivity graph. Each small gene clique is used to create a virtual gene. Virtual genes are then ranked using single gene based discriminative scores. Experiment Setup Two publicly available microarray expression data sets are used: colon cancer, leukemia. Three gene ontology branches are used separately. Three classifiers are used. GO annotations are extract from Stanford's online database SOURCE. Experiments: GO Virtual Gene Experiment result on Colon Cancer data set. Experiment: GO Virtual Gene Experiment result on Leukemia data set. Conclusion: GO Virtual Gene Usage of domain knowledge embedded in GO annotations enables us to example expression correlations between a large set of genes. GO based virtual gene algorithm sometimes improves gene selection performance significantly.

advanced gene selection algorithms designed for microarray datasets limitation of current feature...

Documents