discovering gene-disease association using on-line scientific text abstracts. raj adhikari advisor:...
TRANSCRIPT
Discovering Gene-Disease Association using On-line Scientific Text Abstracts.
Raj AdhikariAdvisor: Javed Mostafa
April 20, 2023 Bioinformatics capstone project 2
Motivation Motivation
A central problem in bioinformatics is how to capture information from the vast scientific literature and create an automated system for “knowledge discovery” that can be used in various areas.
I address the special case of gene-disease interactions and show that using the frequencies/relevance of words in Pubmed abstracts can be used to find genes related to a disease.
April 20, 2023 Bioinformatics capstone project 3
Goal Use the combination of statistical
methods and a database to: retrieve research abstracts from Pubmed. extract relevant information from the free
texts using statistical methods. Measure the accuracy of the results and
display the results using a Web based system .
Complement and support existing knowledge base systems like GeneCards.
April 20, 2023 Bioinformatics capstone project 4
Resources used in creating database PubMed
The US National Library of Medicine's database that contains more than 11 million references to journal articles in the health sciences.
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi GeneCards
a database of human genes, their products and their involvement in diseases
http://bioinfo.weizmann.ac.il/cards/index.shtml HGNC
HUGO Gene Nomenclature Committee (approved over 19000 human gene symbols).
consistent with OMIM and LocusLink http://www.gene.ucl.ac.uk/nomenclature
Tools used: Perl, CGI, Java, MySQL
April 20, 2023 Bioinformatics capstone project 5
Creating the database Data I used:
A relatively small list of genes and diseases in humans
An article set (around 8000) For each Pubmed article:
PMID Article Title Abstract (filter with a list of stop words)
The HUGO dataset. List of around 3500 related gene-disease
pairs from GeneCards.
April 20, 2023 Bioinformatics capstone project 6
Populating the database tables Use the book Genes and Disease at OMIM to
generate a list of around 60 diseases and 90 genes.
Search Pubmed for each gene-disease pair on the Title/Abstract field.
Use ESearch (tool that provides access to Pubmed database outside of the web interface) to retrieve data in XML file format.
Use XML::Simple Perl package to parse the XML file
Filter the text using stop words and store each title and abstract along with the related PMID in a database table.
Add more genes using HUGOOMIM: Database of genetic diseases with references to molecular medicine, cell biology, biochemistry and clinical details of the diseases.
April 20, 2023 Bioinformatics capstone project 7
Populating the database tables Table structures:
Derivative table
Parse the retrieved text files and create the following tables:
HUGO table structure:
GeneCards table structure:
HGNC
genesymbol
alias
Genesymbol
disease
Term PMID Tfreq
Dfreq
Tfidf LSI
April 20, 2023 Bioinformatics capstone project 8
Generating term weights Basic idea: compare co-occurrence of terms in a
document and across a set of documents by generating term weights.
Within a document: Term-Frequency tf measures term density within a document.
Across the document set: Inverse Document Frequency
idf measures the “informativeness” of a term across a dataset.
Thus:
dfntftfidf
i
i log
dfnidf
i
i log
April 20, 2023 Bioinformatics capstone project 9
Latent Symantec Indexing Calculating co-occurrence of terms might not
suffice because of possible “noise” in the dataset.
Use LSI, a statistical technique, to estimate a latent structure.
Assume some underlying semantic structure in the dataset which could be partially obscured.
Implementation term by document matrix (tends to be sparse) convert matrix entries to weights, e.g. tfidf. Analyze the matrix by singular value decomposition
(SVD) to derive latent semantic structure model.
April 20, 2023 Bioinformatics capstone project 10
SVD SVD
unique mathematical decomposition of a matrix into the product of three matrices:
two with orthonormal columns one with singular values on the diagonal
finds optimal projection into low-dimensional space
tool for dimension reduction
April 20, 2023 Bioinformatics capstone project 11
SVDSingular Value Decomposition
{A}={U}{E}{V}T
Where:{U} has orthonormal, unit length columns: {U}{U}’ = I{E} is the diagonal matrix of positive real numbers{V} has orthonormal, unit length columns: {V}{V}’ = I
April 20, 2023 Bioinformatics capstone project 12
SVD Approximate Ak keeping only the first k singular
values and the corresponding columns from U and V matrices.
The new matrix Ak does not exactly match the original term by document matrix A. (It gets closer and closer as more singular values are kept).
This is what we want: we don’t want perfect fit since we think some of the 0’s in A should be not be 0 and vice versa.
Limitations of SVD – very memory intensive, cannot handle large datasets.
April 20, 2023 Bioinformatics capstone project 13
Scoring Matrix Generation A scoring matrix is generated for each
term weighting method using the data stored in the database.
This matrix is used to find the relationships between genes and diseases.
Relatively fast process since the weights are pre-computed and stored in a database.
April 20, 2023 Bioinformatics capstone project 14
Finding relationships
T1 T2 T3 … Tn
D1 1 1
D2 1 1
… 1 0
Dn 1 0
T1 T2 T3 … Tn
T1 2
T2
…
Tn
Use the doc-term matrix to establish relationships between genes and disease
April 20, 2023 Bioinformatics capstone project 16
Verification of the relationship Data from GeneCards and HUGO has been
stored in a database. For each gene, if the symbol is an official
genesymbol (according to HUGO), then search for the genesymbol in GeneCards and display the disease associated with it.
Else (if the symbol is an alias), use HUGO to find the official genesymbol and search in GeneCards using this genesymbol and display the disease associated with the gene.
April 20, 2023 Bioinformatics capstone project 18
Using gene alias Make use of gene alias from HUGO to
increase the chances of detecting correct genes for a given disease
Method: Increment the weight of an official gene by
adding the weight of the of the alias. Group the alias together with the official
gene.
April 20, 2023 Bioinformatics capstone project 19
Results for Pancreatic Cancer
Top five genes – without considering alias
Top five genes – considering alias
April 20, 2023 Bioinformatics capstone project 20
Using gene alias - problems Problem: HUGO might have multiple official
gene symbols for some alias: This particular alias could actually increase the
weight of a gene that is not related to the disease.
Example:3585 FANCD2 FAD, FA-D2
1101 BRCA2 FAD, FAD1
9508 PSEN1 FAD, S182, PS1
April 20, 2023 Bioinformatics capstone project 22
Verification In addition, the number of Pubmed
articles containing a disease and a gene symbol can be an indication of how strong the association between a disease and a gene is.
Same theory applies for a gene-gene relationship.
April 20, 2023 Bioinformatics capstone project 23
In addition, we can use the doc-term matrix to find gene(s) that are related to any given gene.
Using the matrices above, we see that g2 is related to g3 and the weight is 2.
Gene-Gene Relationships
g1 g2 g3
… gn
D1
1 1 1
D2
1 1 1
… 1 0 1
Dn
1 0 0
g1
g2
g3
… gn
g1
g2
2
…
gn
April 20, 2023 Bioinformatics capstone project 24
Discovering additional gene-gene relationships We can make use of the possibility that two
genes might be related to each other via a disease as in:
gene1 -> disease1 -> gene2gene1 -> disease2 -> gene2
to establish relationships between gene1 and gene2.
In our case, the fact that gene1 and gene2 are related to each other via two different diseases makes the relationship between them even stronger.
April 20, 2023 Bioinformatics capstone project 26
System Demonstration http://
biokdd.informatics.indiana.edu/radhikar/search.html
Related URLs: Genecards:
http://bioinfo.weizmann.ac.il/cards/index.shtml HGNC:
http://www.gene.ucl.ac.uk/nomenclature/
April 20, 2023 Bioinformatics capstone project 27
Summary Using the combination of statistical
methods and a database, the process of establishing gene-disease relationship using literature data is fast and efficient.
With minimal changes, our system can be extended to discover other relationships like protein-protein interactions, etc.
April 20, 2023 Bioinformatics capstone project 28
Future Work Extend our system to incorporate the
entire Medline dataset. Incorporate full gene names. Find a better way to verify the gene-
gene relationships. Incorporate other On-Line scientific
literature databases.