lsi based framework to predict gene regulatory information

2
BioMed Central Page 1 of 2 (page number not for citation purposes) BMC Bioinformatics Open Access Meeting abstract LSI based framework to predict gene regulatory information Sujoy Roy 1,2 , Lijing Xu 1,2 and Ramin Homayouni* 1,2 Address: 1 Department of Biology, University of Memphis, Memphis, TN 38152, USA and 2 Bioinformatics Program, University of Memphis, Memphis, TN 38152, USA Email: Ramin Homayouni* - [email protected] * Corresponding author Background Latent Semantic Indexing (LSI), a vector space model for information retrieval, has shown promise in predicting functional relationships between genes using textual information in MEDLINE abstracts. The underlying prin- ciple is that genes may be represented as document vectors in a multi-dimensional hyperspace, and the conceptual relationship between any two genes is determined by the cosine of the angle between their vectors [1]. In this study, we sought to extend this concept for identification of putative transcription factors (TFs) that regulate a group of co-regulated genes. We hypothesized that co-expressed genes identified by microarray experiments are function- ally related and that at least some of these genes have pre- viously been linked explicitly or implicitly to TFs in the literature. A transcriptional module is then defined as a set of genes clustered together in LSI space with closely related TFs (Figure 1). We devised a framework using these assumptions to identify transcriptional modules from microarray and promoter motif data (Figure 2). The framework requires as input, co-expressed genes from a microarray dataset and a set of TFs that have consensus motifs in the promoter regions of the co-expressed genes. Usually the set of such motif-derived TFs is large and makes the identification of the critical ones difficult. The framework first identifies functionally related clusters of co-expressed genes based on their latent relationships from literature, and then adds to each cluster TFs that are closely associated with the genes in the cluster. The puta- tive transcriptional modules are ranked based on the degree of relative literature coherence amongst the entities in them. Results and discussion The LSI-based algorithm allows prediction of TFs based on latent (implicit) relationships in the literature. A pre- liminary evaluation of our method using previously pub- lished knock-out experiments revealed that it has reasonable recall and precision. A more rigorous evalua- tion of the method will require several additional TF knock-out microarray experiments. This work provides proof of principle that the combination of motif analysis from UT-ORNL-KBRIN Bioinformatics Summit 2009 Pikeville, TN, USA. 20–22 March 2009 Published: 25 June 2009 BMC Bioinformatics 2009, 10(Suppl 7):A5 doi:10.1186/1471-2105-10-S7-A5 <supplement> <title> <p>UT-ORNL-KBRIN Bioinformatics Summit 2009</p> </title> <editor>Eric C Rouchka and Julia Krushkal</editor> <note>Meeting abstracts – A single PDF containing all abstracts in this Supplement is available <a href="http://www.biomedcentral.com/content/pdf/1471-2105-10-S7-full.pdf">here</a>.</note> <url>http://www.biomedcentral.com/content/pdf/1471-2105-10-S7-info.pdf</url> </supplement> This abstract is available from: http://www.biomedcentral.com/1471-2105/10/S7/A5 © 2009 Roy et al; licensee BioMed Central Ltd. Gene clusters and transcriptional modules in LSI space Figure 1 Gene clusters and transcriptional modules in LSI space. Gene document vectors (blue) are clustered together in LSI space based on the closeness of the angle between them. Next, transcription factor vectors (green) are added to the clusters if their cosine value is close to the average cosine of the gene cluster.

Upload: sujoy-roy

Post on 30-Sep-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

BioMed CentralBMC Bioinformatics

ss

Open AcceMeeting abstractLSI based framework to predict gene regulatory informationSujoy Roy1,2, Lijing Xu1,2 and Ramin Homayouni*1,2

Address: 1Department of Biology, University of Memphis, Memphis, TN 38152, USA and 2Bioinformatics Program, University of Memphis, Memphis, TN 38152, USA

Email: Ramin Homayouni* - [email protected]

* Corresponding author

BackgroundLatent Semantic Indexing (LSI), a vector space model forinformation retrieval, has shown promise in predictingfunctional relationships between genes using textualinformation in MEDLINE abstracts. The underlying prin-ciple is that genes may be represented as document vectorsin a multi-dimensional hyperspace, and the conceptualrelationship between any two genes is determined by thecosine of the angle between their vectors [1]. In this study,we sought to extend this concept for identification ofputative transcription factors (TFs) that regulate a groupof co-regulated genes. We hypothesized that co-expressedgenes identified by microarray experiments are function-ally related and that at least some of these genes have pre-viously been linked explicitly or implicitly to TFs in theliterature. A transcriptional module is then defined as a setof genes clustered together in LSI space with closelyrelated TFs (Figure 1). We devised a framework usingthese assumptions to identify transcriptional modulesfrom microarray and promoter motif data (Figure 2). Theframework requires as input, co-expressed genes from amicroarray dataset and a set of TFs that have consensusmotifs in the promoter regions of the co-expressed genes.Usually the set of such motif-derived TFs is large andmakes the identification of the critical ones difficult. Theframework first identifies functionally related clusters ofco-expressed genes based on their latent relationshipsfrom literature, and then adds to each cluster TFs that areclosely associated with the genes in the cluster. The puta-tive transcriptional modules are ranked based on the

degree of relative literature coherence amongst the entitiesin them.

Results and discussionThe LSI-based algorithm allows prediction of TFs basedon latent (implicit) relationships in the literature. A pre-liminary evaluation of our method using previously pub-lished knock-out experiments revealed that it hasreasonable recall and precision. A more rigorous evalua-tion of the method will require several additional TFknock-out microarray experiments. This work providesproof of principle that the combination of motif analysis

from UT-ORNL-KBRIN Bioinformatics Summit 2009Pikeville, TN, USA. 20–22 March 2009

Published: 25 June 2009

BMC Bioinformatics 2009, 10(Suppl 7):A5 doi:10.1186/1471-2105-10-S7-A5

<supplement> <title> <p>UT-ORNL-KBRIN Bioinformatics Summit 2009</p> </title> <editor>Eric C Rouchka and Julia Krushkal</editor> <note>Meeting abstracts – A single PDF containing all abstracts in this Supplement is available <a href="http://www.biomedcentral.com/content/pdf/1471-2105-10-S7-full.pdf">here</a>.</note> <url>http://www.biomedcentral.com/content/pdf/1471-2105-10-S7-info.pdf</url> </supplement>

This abstract is available from: http://www.biomedcentral.com/1471-2105/10/S7/A5

© 2009 Roy et al; licensee BioMed Central Ltd.

Gene clusters and transcriptional modules in LSI spaceFigure 1Gene clusters and transcriptional modules in LSI space. Gene document vectors (blue) are clustered together in LSI space based on the closeness of the angle between them. Next, transcription factor vectors (green) are added to the clusters if their cosine value is close to the average cosine of the gene cluster.

Page 1 of 2(page number not for citation purposes)

BMC Bioinformatics 2009, 10(Suppl 7):A5 http://www.biomedcentral.com/1471-2105/10/S7/A5

Publish with BioMed Central and every scientist can read your work free of charge

"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral

Overview of FrameworkFigure 2Overview of Framework.

and LSI may be used to identify putative transcriptionalmodules from microarray data.

References1. Homayouni R, Heinrich K, Wei L, Berry MW: Gene clustering by

latent semantic indexing of MEDLINE abstracts. Bioinformatics2005, 21(1):104-115.

Page 2 of 2(page number not for citation purposes)