Finding Functional Gene Relationships Using the Semantic
Gene Organizer (SGO)
Kevin Heinrich
Master’s Defense
July 16, 2004
Outline
• Problem / Goals
• Related Work
• Information Retrieval– Vector Space Model– Latent Semantic Indexing (LSI)
• Biological Databases
• SGO Use & Results
Problem
• Biological tools are creating vast amounts of data.
• Current techniques are time-consuming and expensive.
• Want to know phenotype (function) from genotype (structure/sequence).
Goals
• Develop a tool to aid researchers in finding and understanding functional gene relationships.
• Use information that covers whole genome, e.g. literature.
Related Work
• Jenssen et al. (2001) developed PubGene.– Literature network– Assigns functional association if there is a co-
occurrence of gene symbols
• Wilkinson and Huberman (2004) expanded this idea to find communities of related genes.
• Yandell and Majoros (2002) use natural language processing techniques to identify nature of relationships.
Related Work
• Most all literature-based techniques rely on term co-occurrence.
• What about gene aliases?
• Solution: Apply a more robust technique.
Information RetrievalVector Space Model
• Documents are parsed into tokens.
• Tokens are assigned a weight of, wij, of ith token in jth document.
• An m x n term-by-document matrix, A, is created where
– Documents are m-dimensional vectors.– Tokens are n-dimensional vectors.
ijwA
Information RetrievalTerm Weights
• Term weights are the product of a local and global component
• tf
• idf
• idf2
jiijij dglw
ijij fl
jij
jij
i f
f
g
1log2 j
iji f
ng
Information RetrievalTerm Weights (cont’d)
• log-entropy
• Goal is to give distinguishing terms more weight.
n
pp
g jijij
i2
2
log
log
1
ijij fl 1log
jij
ijij f
fp
Information RetrievalQuery & Similarity
• Queries are represented by a pseudo-document vector
• Similarity is the cosine of the angle between document vectors.
mgggq ,,, 210
m
kk
m
kkj
m
kkjk
j
jjj
gw
wg
dq
dqdqsim
1
2
1
2
1cos,
Information RetrievalLatent Semantic Indexing (LSI)
LSI performs a truncated SVD on
A = UΣVT
• U is the m x n matrix of eigenvectors of AAT
• VT is the r x n matrix of eigenvectors of ATA• Σ is the r x r diagonal matrix containing the r nonnegative
singular values of A• r is the rank of A
A rank-k approximation is given by Ak = UkΣkVkT
Information RetrievalLSI (cont’d)
• Document-to-document similarity is
• Queries are projected into low-rank approximation space
TkkkkTk VVAA
10
kkTUqq
Information RetrievalLSI (cont’d)
• Scaled document vectors can be computed once and stored for quick retrieval.
• The lower-dimensional space forces queries and documents to be compared in a more conceptual manner and saves storage.
• Choice of number of factors is an open question.
• End Effect: LSI can find similarities between documents that have no term co-occurrence.
Information RetrievalEvaluation Measures
• Precision – ratio of relevant returned documents to the total number of returned documents.
• Recall – ratio of relevant returned documents to the total number of relevant documents.
• Goal is to have high precision at all levels of recall.
• Systems are often evaluated by average precision (AP), which is the average of 11 interpolated precision values at the decile ranges.
Biological DatabasesMEDLINE
• MEDLINE (NLM)– Contains 14+ million references to journal
articles with a concentration in medicine– Span over 4,600 journals worldwide– 1966 to present– ~500,000 citations added annually– Each citation is manually indexed with MeSH
terms.
Biological DatabasesPubMed
• PubMed– Retrieves articles from MEDLINE and other
journals.– Can be queried via any combination of
attributes.
Biological DatabasesLocusLink
• NCBI human-curated database• Single query interface to a comprehensive
directory for genes and gene reference sequences for key genomes.
• Provides links to related records in PubMed and other citations when applicable.
• Provides RefSeq Summary of gene function and links to key MEDLINE citations relevant to each gene.
Biological DatabasesOverview
• MEDLINE has lots information– Not all articles relate to genes– Gene terminology problem
• LocusLink does not cover all relevant citations, but a representative few.
Biological DatabasesGene Document Construction
• Concatenate titles and abstracts of MEDLINE citations cross-referenced in Human, Rat, and Mouse LocusLink entries.
• Sequencing abstracts included – noise
• LocusLink references are not comprehensive, so recall of all relevant abstracts is not guaranteed.
SGO
• Primarily uses LSI to rank genes.
• Enables user to specify query method– Gene query– Keyword query– Number of factors– Show latent matches
• Saves previous query sessions.
SGOInterface
SGOInterface (cont’d)
SGOTrees
• Unfortunately, ranked lists mean little to biologists.
• Pairwise distances can be formed into a matrix
where is the similarity between documents i and j
ijdD
ijijd cos1
ijcos
SGOTrees (cont’d)
• Fitch-Margoliash (1967) method in PHYLIP is applied to D to generate hierarchical trees.
• Thresholds can be applied to self-similarity matrix to produce graphs.
SGOHierarchical Tree
SGOGraph or Nodal Tree
SGOCoding Issues
• Web interface – must be interactive– Queries are processed on click– Document collections are parsed offline– Trees are constructed offline
• Storage will eventually become an issue.
ResultsTest Data Set
• 50 gene test data set was constructed.– Alzheimer’s Disease– Cancer– Development
• Reelin signaling pathway used as basis for evaluation– 5 primary genes (directly
associated)– 7 secondary genes (indirectly
associated)
ResultsPrimary AP
• AP for 5 primary genes– 61% for 5 factors– 84% for 25 factors– 84% for 50 factors
ResultsSecondary AP
• AP for 12 secondary genes– 53% for 5 factors– 59% for 25 factors– 61% for 50 factors
ResultsComparison
• LSI comparable to tf-idf for 5 primary genes• Far superior to tf-idf for 12 second genes
– PubMed co-citation identifies 2 of the 7 indirectly related genes
– Abstract overlap of LocusLink citations fails to identify any indirectly related genes
• tf-idf fails on many keyword queries
• Tested on Gene Ontology classifications (not shown)– Similar tendencies are observed
ResultsAbstract Representation
• To simulate scaling up, decrease representation of reelin-related genes
• AP of 47% on 20,856 Human LocusLink abstracts
ResultsHierarchical Tree
ResultsHierarchical Tree
ResultsHierarchical Tree
Conclusions
• SGO allows genes to be compared to each other and to keyword (function).
• SGO identifies latent relationships with promising accuracy.
• SGO is not meant to replace existing technologies, but to assist researchers– Verify current results– Direct future exploration
Future Work
• Scale up to entire genome
• Document construction
• Incorporate structural or other information for multi-modal similarity
• Test other models e.g. NMF, QR, etc.
• Interactive tree building
• Keep collections current